diff --git a/.gitignore b/.gitignore
index 5ecafef..9350b36 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,7 +1,3 @@
-build
-dist
-.vscode
-
 *.pyc
 __pycache__
 python/build
diff --git a/README.md b/README.md
index bee9f66..9c6075c 100644
--- a/README.md
+++ b/README.md
@@ -1,9 +1,10 @@
-**NOTE: this software is part of the BenchBot software stack, and not intended to be run in isolation (although it can be installed independently through pip & run on results files if desired). For a working BenchBot system, please install the BenchBot software stack by following the instructions [here](https://github.com/roboticvisionorg/benchbot).**
+**NOTE: this software is part of the BenchBot software stack, and not intended to be run in isolation (although it can be installed independently through pip and run on results files if desired). For a working BenchBot system, please install the BenchBot software stack by following the instructions [here](https://github.com/qcr/benchbot).**
 
 # BenchBot Evaluation
-BenchBot Evaluation is a library of functions used to evaluate the performance of a BenchBot system in two core semantic scene understanding tasks: semantic SLAM, and scene change detection. The easiest way to use this module is through the helper scripts provided with the [BenchBot software stack](https://github.com/roboticvisionorg/benchbot).
 
-## Installing & performing evaluation with BenchBot Evaluation
+BenchBot Evaluation is a library of functions used to call evaluation methods. These methods are installed through the [BenchBot Add-ons Manager](https://github.com/qcr/benchbot-addons), and evaluate the performance of a BenchBot system against the metric. The easiest way to use this module is through the helper scripts provided with the [BenchBot software stack](https://github.com/qcr/benchbot).
+
+## Installing and performing evaluation with BenchBot Evaluation
 
 BenchBot Evaluation is a Python package, installable with pip. Run the following in the root directory of where this repository was cloned:
 
@@ -11,184 +12,44 @@ BenchBot Evaluation is a Python package, installable with pip. Run the following
 u@pc:~$ pip install .
 ```
 
-Although evaluation is best run from within the BenchBot software stack, it can be run in isolation if desired. The following code snippet shows how to perform evaluation from Python:
+Although evaluation is best run from within the BenchBot software stack, it can be run in isolation if desired. The following code snippet shows how to perform evaluation with the `'omq'` method from Python:
 
 ```python
-from benchbot_eval.evaluator import Evaluator
-
-results_filename = "/path/to/your/proposed/map/results.json"
-ground_truth_folder = "/path/to/your/ground_truth/folder"
-save_file = "/path/to/save/file/for/scores.txt"
+from benchbot_eval.evaluator import Evaluator, Validator
 
-my_evaluator = Evaluator(results_filename, ground_truth_folder, save_file)
-my_evaluator.evaluate()
+Validator(results_file).validate_results_data()
+Evaluator('omq', scores_file).evaluate()
 ```
 
-This prints the final scores to the screen and saves them to the named file:
-- `results_filename`: points to the JSON file with the output from your experiment (in the format [described below](#the-results-format))
+This prints the final scores to the screen and saves them to a file using the following inputs:
+
+- `results_file`: points to the JSON file with the output from your experiment
 - `ground_truth_folder`: the directory containing the relevant environment ground truth JSON files
 - `save_file`: is where final scores are to be saved
 
-## The results format
-
-Results for both semantic SLAM & scene change detection tasks consist of an object-based semantic map, and associated task metadata. Results from the two types of task differ only in that objects in scene change detection tasks require a probability distribution describing the suggested state change (`'state_probs'`). See further below for more details. 
-
-Our results format defines an object-based semantic map as a list of objects with information about both their semantic label & spatial geometry:
-- Semantic label information is provided through a probability distribution (`'label_probs'`) across either a provided (`'class_list'`), or our  default class list:
-  ```
-  bottle, cup, knife, bowl, wine glass, fork, spoon, banana, apple, orange, cake, 
-  potted plant, mouse, keyboard, laptop, cell phone, book, clock, chair, table, 
-  couch, bed, toilet, tv, microwave, toaster, refrigerator, oven, sink, person, 
-  background
-  ```
-- Spatial geometry information is provided by describing the location (`'centroid'`) of a cuboid in 3D space, whose dimensions are `'extent'`.
-
-The entire required results format is outlined below:
-```
-{
-    'task_details': {
-        'type': <test_challenge_type>,
-        'control_mode': <test_control_mode>,
-        'localisation_mode': <test_localisation_mode>
-    },
-    'environment_details': {
-        'name': <test_environment_name>,
-        'numbers': [<test_environment_numbers>]
-    },
-    'objects': [
-        {
-            'label_probs': [<object_class_probability_distribution>],
-            'centroid': [<xc>, <yc>, <zc>],
-            'extent': [<xe>, <ye>, <ze>],
-            'state_probs': [<pa>, <pr>, <pu>]
-        },
-        ...
-    ]
-    'class_list': [<classes_order>]
-}
-```
-
-Notes:
-- `'task_details'` and `'environment_details'` define what task was being solved when the results file was produced, & what environment the agent was operating in. These values are defined in the BenchBot software stack when `benchbot_run` is executed to start a new task. For example:
-    ```
-    u@pc:~$ benchbot_run --task semantic_slam:passive:ground_truth --env miniroom:1
-    ```
-
-    would produce the following:
-
-    ```json
-    "task_details": {
-      "type": "semantic_slam",
-      "control_mode": "passive",
-      "localisation_mode": "ground_truth"
-    },
-    "environment_details": {
-      "name": "miniroom",
-      "numbers": [1]
-    }
-    ```
-
-    The above dicts can  be obtained at runtime through the `BenchBot.task_details` & `BenchBot.environment_details` [API properties](https://github.com/roboticvisionorg/benchbot_api).
-- For `'task_details'`:
-    - `'type'` must be either `'semantic_slam'` or `'scd'`
-    - `'control_mode'` must be either `'passive'` or `'active'`
-    - `'localisation_mode'` must be either `'ground_truth'` or `'dead_reckonoing'`
-- For `'environment_details'`:
-    - `'name'` must be present (any name is accepted, but will fail at evaluation time if an associated ground truth cannot be found)
-    - `'numbers'` must be a list of integers (or strings which can be converted to integers) between 1 & 5 inclusive
-- For each object in `'objects'` the objects list:
-    - `'label_probs'` is the probability distribution for the suggested object label corresponding to the class list in `'class_list'`, or our default class list above (must be a list of numbers)
-    - `'centroid'` is the 3D coordinates for the centre of the object's cuboid (must be a list of 3 numbers)
-    - `'extent'` is the **full** width, height, & depth of the cuboid (must be a list of 3 numbers)
-        - **Note** the cuboid described by `'centroid'` & `'extent'` must be axis-aligned in global coordinates, & use metres for units
-    - `'state_probs'` must be a list of 3 numbers corresponding to the probability that the object was added, removed, or changed respectively (**only** required when `'type'` is `'scd'` in `'task_details'`)
-        -  **Note** if your system is not probabilistic, simply use all 0s & a single 1 for any of the distributions above (e.g. `'state_probs'` of  `[1, 0, 0]` for an added object)
-- `'class_list'` is a list of strings defining a custom order for the probabilities in the `'label_probs'` distribution field of objects (if not provided the default class list & order is assumed). Other notes on `class_list`:
-    - there is some support given for class name synonyms in `./benchbot_eval/class_list.py`.
-    - any class names given that are not in our class list, & don't have an appropriate synonym, will have their probability added to the `'background'` class (this avoids over-weighting label predictions solely because your detector had classes we don't support)
-- all probability distributions in `'label_probs'` & `'state_probs'` are normalized if their total probability is greater than 1, or have the missing probability added to the final class (`'background'` or `'unchanged'`)
-
-## Generating results for evaluation
-
-An algorithm attempting to solve a semantic scene understanding task only has to fill in the list of `'objects'` and the `'class_list'` field (only if a custom class list has been used); everything else can be pre-populated using the [provided BenchBot API methods](https://github.com/roboticvisionorg/benchbot_api). Using these helper methods, only a few lines of code is needed to create results that can be used with our evaluator:
-
-```python
-from benchbot_api import BenchBot
-
-b = BenchBot()
-
-my_map = ...  # TODO solve the task, & get a map with a set of objects
-
-results = b.empty_results()  # new results with pre-populated metadata
-for o in my_map.objects:
-    new_obj = b.empty_object()  # new object result with correct fields
-    new_obj[...] = ...  # TODO insert data from object in my_map
-    results['objects'].append(new_obj)  # add new object result to results
-```
-
-Alternatively, users of the `Agent` class can use the data provided in the `Agent.save_result()` function call:
-
-```python
-from benchbot_api import Agent
-
-class MyAgent(Agent):
-    def __init__(self):
-        ...
-        my_map = ... 
-
-    ...
-
-    def save_result(self, filename, empty_results, empty_object_fn):
-        for o in self.my_map:
-            new_obj = empty_object_fn()  # new object result with correct fields
-            new_obj[...] = ... # TODO insert data from object in my_map
-            empty_results['objects'].append(new_obj)  # add new object result to results
-
-        # TODO write the results to 'filename'...
-
-```
-
-## Specific details of the evaluation process
-
-Solutions to our semantic scene understanding tasks require the generation of an object-based semantic map for given environments that the robot has explored. The tasks merely differ in whether the object-based semantic map is of all the objects in an environment, or the objects that have changed between two scenes of an environment.
-
+## How add-ons interact with BenchBot Evaluation
 
-### Object Map Quality (OMQ)
+Two types of add-ons are used in the BenchBot Evaluation process: format definitions, and evaluation methods. An evaluation method's YAML file defines what results formats and ground truth formats the method supports. This means:
 
-To evaluate object-based semantic maps we introduce a novel metric called Object Map Quality (OMQ). Object map quality compares the object cuboids of a ground-truth object-based semantic map with the object cuboids provided by a generated object-based semantic map. The metric compares both geometric overlap of the generated map with the ground-truth map, and the accuracy of the semantically labelling in the generated map.
+- this package requires installation of the [BenchBot Add-ons Manager](https://github.com/qcr/benchbot_addons) for interacting with installed add-ons
+- the `results_file` must be a valid instance of a supported format
+- there must be a valid ground truth available in a supported format, for the same environment as the results
+- validity is determined by the format-specific validation function described in the format's YAML file
 
-The steps for calculating OMQ for a generated object-based semantic map are as follows:
+Please see the [BenchBot Add-ons Manager's documentation](https://github.com/qcr/benchbot_addons) for further details on the different types of add-ons.
 
-1. Compare each object in the generated map to all ground-truth objects, by calculating a *pairwise object quality* score for each pair. The pairwise object quality is the geometric mean of all object sub-qualities. Standard OMQ has two object sub-quality scores:
-    - *spatial quality* is the 3D IoU of the ground-truth and generated cuboids being compared
-    - *label quality* is the probability assigned to the class label of the ground-truth object being compared to
+## Creating valid results and ground truth files
 
-    Note that the pairwise object quality will be zero if either sub-quality score is zero - like when there is no overlap between object cuboids.
+The [BenchBot software stack](https://github.com/qcr/benchbot) includes tools to assist in creating results and ground truth files:
 
-2. Each object in the generated map is assigned the ground-truth object with the highest non-zero pairwise quality, establishing the *"true positives"* (some non-zero quality), *false negatives* (ground-truth objects with no pairwise quality match), and *false positives* (objects in the generated map with no pairwise quality match).
+- **results:** are best created using the `empty_results()` and `results_functions()` helper functions in the [BenchBot API](https://github.com/qcr/benchbot_api), which automatically populate metadata for your current task and environment.
+- **ground truths:** this package includes a `GroundTruthCreator` class to aid in creating ground truths of a specific format, for a specific environment. Example use includes:
 
-3. A *false positive cost*, defined as the maximum confidence given to a non-background class, is given for all false positive objects in the generated map
+  ```python
+  from benchbot_eval.ground_truth_creator import GroundTruthCreator
 
-4. Overall OMQ score is calculated as the sum of all "true positive" qualities divided by the sum of: number of "true positives", number of false negatives, & total false positive cost
-
-Notes:
-- Average pairwise qualities for the set of "true positive" objects are often also provided with the overall OMQ score. Average pairwise qualities include an average overall quality, as well as averages for each of the object sub-qualities (spatial & label for standard OMQ)
-- OMQ is based on the probabilistic object detection quality measure PDQ, which is described in [our paper](http://openaccess.thecvf.com/content_WACV_2020/papers/Hall_Probabilistic_Object_Detection_Definition_and_Evaluation_WACV_2020_paper.pdf) and [accompanying code](https://github.com/david2611/pdq_evaluation)).
-
-### Evaluating Semantic SLAM with OMQ
-
-![semantic_slam_object_map](./docs/semantic_slam_obmap.png)
-
-Evaluation compares the object-based semantic map (as shown above) generated by the robot with the ground-truth map of the environment using OMQ. The evaluation metric is used exactly as described above.
-
-### Evaluating scene change detection (SCD) with OMQ
-
-![scene_change_detection_object_map](./docs/scd_obmap.png)
-
-Scene change detection (SCD) creates object-based semantic maps comprising of the objects that have *changed* between two scenes. Valid changes are the addition or removal of an object, with unchanged provided as a third state to capture uncertainty in state change. 
-
-Evaluation of SCD compares the object-based semantic map of changed objects generated by the robot with the ground-truth map of object changes between two scenes in an environment. The comparison is done using OMQ, but a third pairwise object sub-quality is added to capture the quality of the detected state change: 
-- *state quality* is the probability given to the correct change state (e.g. [0.4, 0.5, 0.1] on an added object would get a state score of 0.4)
-- *final pairwise quality* ([step 1 of OMQ](#object-map-quality-(omq))) is now the geometric mean of three sub-quality scores (spatial, label, & state)
-- *false positive cost* ([step 3 of OMQ](#object-map-quality-(omq))) is now the geometric mean of both the maximum label confidence given to a non-background class, & the maximum state confidence of a added or removed state change (i.e. not unchanged). This means both overconfidence in label & state change will increase the false positive cost.
-- **Note:** including state quality changes the quality scores for pairs of generated & ground-truth objects due to averaging over 3 terms instead of 2 (i.e. pairwise scores will be different between semantic SLAM & SCD tasks)
+  gtc = GroundTruthCreator('object_map_ground_truth', 'miniroom:1')
+  gt = gtc.create_empty();
+  print(gtc.functions())  # ['create', 'create_object']
+  gt['ground_truth']['objects'][0] = gtc.functions('create_object')
+  ```
diff --git a/benchbot_eval/__init__.py b/benchbot_eval/__init__.py
index a7d1af0..4759874 100644
--- a/benchbot_eval/__init__.py
+++ b/benchbot_eval/__init__.py
@@ -1,5 +1,6 @@
-from . import evaluator, omq, class_list, iou_tools
+from . import evaluator, validator
 
 from .evaluator import Evaluator
+from .validator import Validator
 
-__all__ = ['evaluator', 'iou_tools', 'class_list', 'omq']
+__all__ = ['evaluator', 'validator']
diff --git a/benchbot_eval/class_list.py b/benchbot_eval/class_list.py
deleted file mode 100644
index 78dea86..0000000
--- a/benchbot_eval/class_list.py
+++ /dev/null
@@ -1,68 +0,0 @@
-CLASS_LIST = [
-    'bottle', 'cup', 'knife', 'bowl', 'wine glass', 'fork', 'spoon', 'banana',
-    'apple', 'orange', 'cake', 'potted plant', 'mouse', 'keyboard', 'laptop',
-    'cell phone', 'book', 'clock', 'chair', 'table', 'couch', 'bed', 'toilet',
-    'tv', 'microwave', 'toaster', 'refrigerator', 'oven', 'sink', 'person',
-    'background'
-]
-
-CLASS_IDS = {class_name: idx for idx, class_name in enumerate(CLASS_LIST)}
-
-# Some helper synonyms, to handle cases where multiple words mean the same
-# class. This list is used to sanitise class name lists both in ground truth &
-# submitted maps
-SYNONYMS = {
-    'television': 'tv',
-    'tvmonitor': 'tv',
-    'tv monitor': 'tv',
-    'computer monitor': 'tv',
-    'coffee table': 'table',
-    'dining table': 'table',
-    'kitchen table': 'table',
-    'desk': 'table',
-    'stool': 'chair',
-    'sofa': 'couch',
-    'diningtable': 'dining table',
-    'pottedplant': 'potted plant',
-    'plant': 'potted plant',
-    'cellphone': 'cell phone',
-    'mobile phone': 'cell phone',
-    'mobilephone': 'cell phone',
-    'wineglass': 'wine glass',
-
-    # background classes
-    'none': 'background',
-    'bg': 'background',
-    '__background__': 'background'
-}
-
-
-def get_nearest_class_id(class_name):
-    """
-    Given a class string, find the id of that class
-    This handles synonym lookup as well
-    :param class_name: the name of the class being looked up (can be synonym from SYNONYMS)
-    :return: an integer corresponding to nearest ID in CLASS_LIST, or None
-    """
-    class_name = class_name.lower()
-    if class_name in CLASS_IDS:
-        return CLASS_IDS[class_name]
-    elif class_name in SYNONYMS:
-        return CLASS_IDS[SYNONYMS[class_name]]
-    return None
-
-
-def get_nearest_class_name(class_name):
-    """
-    Given a string that might be a class name,
-    return a string that is definitely a class name.
-    Again, uses synonyms to map to known class names
-    :param potential_class_name: the queried class name
-    :return: the nearest class name from CLASS_LIST, or None
-    """
-    class_name = class_name.lower()
-    if class_name in CLASS_IDS:
-        return class_name
-    elif class_name in SYNONYMS:
-        return SYNONYMS[class_name]
-    return None
diff --git a/benchbot_eval/evaluator.py b/benchbot_eval/evaluator.py
index 494e420..80931d9 100644
--- a/benchbot_eval/evaluator.py
+++ b/benchbot_eval/evaluator.py
@@ -1,495 +1,98 @@
-from __future__ import print_function
-
-import inspect
 import json
-import os
+import pickle
 import pprint
-import re
-import numpy as np
-import subprocess
-import sys
-import warnings
-import zipfile
 
-from .omq import OMQ
-from . import class_list as cl
+from benchbot_addons import manager as bam
 
-# Needed to simply stop it printing the source code text with the warning...
-warnings.formatwarning = (lambda msg, cat, fn, ln, line: "%s:%d: %s: %s\n" %
-                          (fn, ln, cat.__name__, msg))
+from . import helpers
 
 
 class Evaluator:
-    _TYPE_SEMANTIC_SLAM = 'semantic_slam'
-    _TYPE_SCD = 'scd'
-
-    _REQUIRED_RESULTS_STRUCTURE = {
-        'task_details': {
-            'type': (lambda value: value in
-                     [Evaluator._TYPE_SEMANTIC_SLAM, Evaluator._TYPE_SCD]),
-            'control_mode':
-                lambda value: value in ['passive', 'active'],
-            'localisation_mode':
-                lambda value: value in ['ground_truth', 'dead_reckoning']
-        },
-        'environment_details': {
-            'name':
-                None,
-            'numbers': (lambda value: type(value) is list and all(
-                int(x) in range(1, 6) for x in value))
-        },
-        'objects': None
-    }
-
-    _REQUIRED_OBJECT_STRUCTURE = {
-        'label_probs': None,
-        'centroid': lambda value: len(value) == 3,
-        'extent': lambda value: len(value) == 3
-    }
-
-    _REQUIRED_SCD_OBJECT_STRUCTURE = {
-        'state_probs': lambda value: len(value) == 3
-    }
-
-    _ZIP_IGNORE = ["submission.json"]
-
-    __LAMBDA_REGEX = [
-        (r'Evaluator._TYPE_([^,^\]]*)', lambda x: "'%s'" % x.group(1).lower()),
-        (r'  +', r' '), (r'^.*?(\(|lambda)', r'\1'), (r', *$', r''),
-        (r'\n', r''), (r'^\((.*)\)$', r'\1')
-    ]
-
     def __init__(self,
-                 results_filenames,
-                 ground_truth_dir,
+                 evaluation_method,
                  scores_filename,
-                 print_all=True,
-                 required_task=None,
-                 required_envs=None):
-        # Confirm we have a valid submission file, & ground truth directory
-        if not os.path.exists(ground_truth_dir):
-            raise ValueError("ERROR: Ground truths directory "
-                             "'%s' does not exist." % ground_truth_dir)
-        for r in results_filenames:
-            if not os.path.exists(r):
-                raise ValueError("ERROR: Results file '%s' does not exist." %
-                                 r)
-
-        # We have valid parameters, save them & return
-        self.results_filenames = results_filenames
-        self.ground_truth_dir = ground_truth_dir
+                 skip_load=False,
+                 print_all=True):
+        self.evaluation_method_data = bam.get_match(
+            "evaluation_methods", [("name", evaluation_method)],
+            return_data=True)
         self.scores_filename = scores_filename
-        self.print_all = print_all
-        self.required_task = required_task
-        self.required_envs = required_envs
-
-    @staticmethod
-    def __lambda_to_text(l):
-        s = inspect.getsource(l)
-        for a, b in Evaluator.__LAMBDA_REGEX:
-            s = re.sub(a, b, s, re.DOTALL)
-        return s
-
-    @staticmethod
-    def __validate(data_dict, reference_dict, name):
-        for k, v in reference_dict.items():
-            if k not in data_dict:
-                raise ValueError("Required key '%s' not found in '%s'" %
-                                 (k, name))
-            elif type(v) is dict and type(data_dict[k]) is not dict:
-                raise ValueError("Value for '%s' in '%s' is not a dict" %
-                                 (k, name))
-            elif type(v) is dict:
-                Evaluator.__validate(data_dict[k], v, name + "['%s']" % k)
-            elif v is not None and not v(data_dict[k]):
-                raise ValueError(
-                    "Key '%s' in '%s' has value '%s', "
-                    "which fails the check:\n\t%s" %
-                    (k, name, data_dict[k], Evaluator.__lambda_to_text(v)))
 
-    @staticmethod
-    def _create_scores(task_details,
-                       environment_details,
-                       scores_omq,
-                       scores_avg_pairwise,
-                       scores_avg_label,
-                       scores_avg_spatial,
-                       scores_avg_fp_quality,
-                       scores_avg_state_quality=None):
-        return {
-            'task_details': task_details,
-            'environment_details': environment_details,
-            'scores': {
-                'OMQ':
-                    scores_omq,
-                'avg_pairwise':
-                    scores_avg_pairwise,
-                'avg_label':
-                    scores_avg_label,
-                'avg_spatial':
-                    scores_avg_spatial,
-                'avg_fp_quality':
-                    scores_avg_fp_quality,
-                **({} if scores_avg_state_quality is None else {
-                       'avg_state_quality': scores_avg_state_quality
-                   })
-            }
-        }
-
-    @staticmethod
-    def _evaluate_scd(results_data, ground_truth_data):
-        # Takes in results data from a BenchBot submission and evaluates the
-        # difference map to results
-
-        # Use the ground truth object-based semantic maps for each scene to
-        # derive the ground truth scene change semantic map (use empty lists to
-        # handle missing fields just in case)
-        # NOTE: ground truth uses a flag to determine change state rather than
-        # the distribution provided in the submission results
-        es = Evaluator._get_env_strings(results_data['environment_details'])
-        gt_objects_1 = (ground_truth_data[es[0]]['objects']
-                        if 'objects' in ground_truth_data[es[0]] else [])
-        gt_objects_2 = (ground_truth_data[es[1]]['objects']
-                        if 'objects' in ground_truth_data[es[1]] else [])
-        gt_changes = [{
-            **o, 'state': 'removed'
-        } for o in gt_objects_1 if o not in gt_objects_2]
-        gt_changes += [{
-            **o, 'state': 'added'
-        } for o in gt_objects_2 if o not in gt_objects_1]
-
-        # Grab an evaluator instance, & use it to return some results
-        evaluator = OMQ(scd_mode=True)
-        return Evaluator._create_scores(
-            task_details=results_data['task_details'],
-            environment_details=results_data['environment_details'],
-            scores_omq=evaluator.score([(gt_changes, results_data['objects'])
-                                       ]),
-            scores_avg_pairwise=evaluator.get_avg_overall_quality_score(),
-            scores_avg_label=evaluator.get_avg_label_score(),
-            scores_avg_spatial=evaluator.get_avg_spatial_score(),
-            scores_avg_fp_quality=evaluator.get_avg_fp_score(),
-            scores_avg_state_quality=evaluator.get_avg_state_score())
-
-    @staticmethod
-    def _evaluate_semantic_slam(results_data, ground_truth_data):
-        # Takes in results data from a BenchBot submission, evaluates the
-        # result using the ground truth data, & then spits out a dict of scores
-        # data
-
-        # Get ground truth objects from the correct ground truth set
-        gt_data = ground_truth_data[Evaluator._get_env_string(
-            results_data['environment_details'])]
-        gt_objects = (gt_data['objects'] if 'objects' in gt_data else [])
+        self.evaluate_fns = bam.load_functions(self.evaluation_method_data)
+        assert 'evaluate' in self.evaluate_fns.keys(), (
+            "No 'evaluate' function was found when loading method '%s'",
+            evaluation_method)
+        self.evaluate_fn = self.evaluate_fns['evaluate']
 
-        # Grab an evaluator instance, & use it to return some results
-        evaluator = OMQ()
-        return Evaluator._create_scores(
-            task_details=results_data['task_details'],
-            environment_details=results_data['environment_details'],
-            scores_omq=evaluator.score([(gt_objects, results_data['objects'])
-                                       ]),
-            scores_avg_pairwise=evaluator.get_avg_overall_quality_score(),
-            scores_avg_label=evaluator.get_avg_label_score(),
-            scores_avg_spatial=evaluator.get_avg_spatial_score(),
-            scores_avg_fp_quality=evaluator.get_avg_fp_score())
+        self.results_data = None
 
-    @staticmethod
-    def _get_task_string(task_details):
-        return "%s:%s:%s" % (task_details['type'],
-                             task_details['control_mode'],
-                             task_details['localisation_mode'])
+        self.required_task = None
+        self.required_envs = None
 
-    @staticmethod
-    def _get_env_string(environment_details):
-        return ":".join([environment_details['name']] +
-                        [str(x) for x in environment_details['numbers']])
-
-    @staticmethod
-    def _get_env_strings(environment_details):
-        # Returns a list of strings referring to an individual environment
-        # scene
-        return ([
-            "%s:%s" % (environment_details['name'], i)
-            for i in environment_details['numbers']
-        ])
-
-    @staticmethod
-    def _ground_truth_file(ground_truth_dir, name, number):
-        filename = subprocess.check_output(
-            "find %s -name '%s_%s.json'" % (ground_truth_dir, name, number),
-            shell=True).decode(sys.stdout.encoding).split('\n')[0]
-        if not os.path.exists(filename):
-            raise ValueError(
-                "Results request a ground truth for variation "
-                "#%s of environment '%s', but a corresponding ground truth "
-                "file (%s) could not be found in '%s'." %
-                (number, name, filename, ground_truth_dir))
-        return filename
-
-    @staticmethod
-    def _load_ground_truth_data(ground_truth_dir, envs_details_list):
-        # Takes a list of envs, & loads the associated ground truth files
-        gtd = {}  # Dict of ground truth data, with env_string as keys
-        for e in envs_details_list:
-            env_strs = Evaluator._get_env_strings(e)
-            for i, s in zip(e['numbers'], env_strs):
-                if s not in gtd:
-                    fn = Evaluator._ground_truth_file(ground_truth_dir,
-                                                      e['name'], i)
-                    print("Loading ground truth data from '%s' ..." % fn)
-                    with open(fn, 'r') as f:
-                        # NOTE should remove format step in time
-                        gtd[s] = Evaluator._sanitise_ground_truth(
-                            (json.load(f)))
-                    print("\tDone.")
-        return gtd
-
-    @staticmethod
-    def _load_results_data(results_filenames):
-        # Takes a list of filenames & pulls all data from JSON & *.zip files
-        # (sorry... nesting abomination...)
-        results = {}  # Dict of provided data, with filenames as keys
-        for r in results_filenames:
-            print("Loading data from '%s' ..." % r)
-            if zipfile.is_zipfile(r):
-                with zipfile.ZipFile(r, 'r') as z:
-                    for f in z.filelist:
-                        if f.filename in Evaluator._ZIP_IGNORE:
-                            print("\tIgnoring file '%s'" % f.filename)
-                        else:
-                            with z.open(f, 'r') as zf:
-                                d = None
-                                try:
-                                    d = json.load(zf)
-                                except:
-                                    print("\tSkipping file '%s'" % f.filename)
-                                    continue  # Failure is fine / expected here!
-                                print("\tExtracting data from file '%s'" %
-                                      f.filename)
-                                results[z.filename + ':' + f.filename] = (
-                                    Evaluator.sanitise_results_data(d))
-            else:
-                with open(r, 'r') as f:
-                    results[r] = Evaluator.sanitise_results_data(json.load(f))
-        print("\tDone.")
-        return results
-
-    @staticmethod
-    def _sanitise_ground_truth(ground_truth_data):
-        # This code is only needed as we have a discrepancy between the format
-        # of ground_truth_data produced in ground truth generation, & what the
-        # evaluation process expects. Long term, the discrepancy should be
-        # rectified & this code removed.
-        for o in ground_truth_data['objects']:
-            o['class_id'] = cl.get_nearest_class_id(
-                o.pop('class'))  # swap name for ID
-        return ground_truth_data
-
-    @staticmethod
-    def _validate_object_data(object_data, object_number, scd=False):
-        # Validates whether an object has all of the required fields
-        try:
-            Evaluator.__validate(object_data,
-                                 Evaluator._REQUIRED_OBJECT_STRUCTURE,
-                                 'object')
-            if scd:
-                Evaluator.__validate(object_data,
-                                     Evaluator._REQUIRED_SCD_OBJECT_STRUCTURE,
-                                     'object')
-        except Exception as e:
-            raise ValueError("Validation of object #%s failed: %s" %
-                             (object_number, e))
-
-    @staticmethod
-    def _validate_results_data(results_data):
-        # Validates whether a results dict has all of the required fields
-        try:
-            Evaluator.__validate(results_data,
-                                 Evaluator._REQUIRED_RESULTS_STRUCTURE,
-                                 'results_data')
-        except Exception as e:
-            raise ValueError("Results validation failed: %s" % e)
-
-    @staticmethod
-    def _validate_results_set(results_set,
-                              required_task=None,
-                              required_envs=None):
-        # Validates whether a set of results meets requirements as a set (i.e.
-        # all tasks the same, & possible task / environment contstraints)
-        task_str = required_task
-        env_strs = []
-        for f, d in results_set.items():
-            s = Evaluator._get_task_string(d['task_details'])
-            if task_str is None:
-                task_str = s
-            elif s != task_str and required_task is None:
-                raise ValueError(
-                    "JSON result files can only be evaluated together if "
-                    "they are for the same task. File '%s' was for task '%s', "
-                    "whereas file '%s' was for task '%s'." %
-                    (results_set[0][0], task_str, f, s))
-            elif s != task_str:
-                raise ValueError(
-                    "Evaluator was configured to only accept results for task "
-                    "'%s', but results file '%s' is for task '%s'" %
-                    (required_task, f, s))
-
-            s = Evaluator._get_env_string(d['environment_details'])
-            if (required_envs is not None and s not in required_envs):
-                raise ValueError(
-                    "Evaluator was configured to require environments: %s. "
-                    "Results file '%s' is for environment '%s' which is not "
-                    "in the list." % (", ".join(required_envs), f, s))
-            elif s in env_strs:
-                raise ValueError(
-                    "Evaluator received multiple results for environment '%s'. "
-                    "Only one result is permitted per environment." % s)
-            env_strs.append(s)
-
-        # Lastly, ensure we have all required environments if relevant
-        if required_envs is not None:
-            for e in required_envs:
-                if e not in env_strs:
-                    raise ValueError(
-                        "Evaluator was configured to require environments: "
-                        "%s. No result was found for environment '%s'." %
-                        (", ".join(required_envs), e))
-
-    @staticmethod
-    def sanitise_results_data(results_data):
-        is_scd = results_data['task_details']['type'] == Evaluator._TYPE_SCD
-
-        # Validate the provided results data
-        Evaluator._validate_results_data(results_data)
-        for i, o in enumerate(results_data['objects']):
-            Evaluator._validate_object_data(o, i, scd=is_scd)
-
-        # Use the default class_list if none is provided
-        if 'class_list' not in results_data or not results_data['class_list']:
-            warnings.warn(
-                "No 'class_list' field provided; assuming results have used "
-                "our default class list")
-            results_data['class_list'] = cl.CLASS_LIST
-
-        # Sanitise all probability distributions for labels & states if
-        # applicable (sanitising involves dumping unused bins to the background
-        # / uncertain class, normalising the total probability to 1, &
-        # optionally rearranging to match a required order)
-        for o in results_data['objects']:
-            if len(o['label_probs']) != len(results_data['class_list']):
-                raise ValueError(
-                    "The label probability distribution for object %d has a "
-                    "different length (%d) \nto the used class list (%d). " %
-                    (i, len(o['label_probs']), len(
-                        results_data['class_list'])))
-            o['label_probs'] = Evaluator.sanitise_prob_dist(
-                o['label_probs'], results_data['class_list'])
-            if is_scd:
-                o['state_probs'] = Evaluator.sanitise_prob_dist(
-                    o['state_probs'])
-
-        # We have applied our default class list to the label probs, so update
-        # the class list in results_data
-        results_data['class_list'] = cl.CLASS_LIST
-
-        return results_data
-
-    @staticmethod
-    def sanitise_prob_dist(prob_dist, current_class_list=None):
-        # This code makes the assumption that the last bin is the background /
-        # "I'm not sure" class (it is an assumption because this function can
-        # be called with no explicit use of a class list)
-        BACKGROUND_CLASS_INDEX = -1
-
-        # Create a new prob_dist if we were given a current class list by
-        # converting all current classes to items in our current class
-        # list, & amalgamating all duplicate values (e.g. anything not
-        # found in our list will be added to the background class)
-        if current_class_list is not None:
-            new_prob_dist = [0.0] * len(cl.CLASS_LIST)
-            for i, c in enumerate(current_class_list):
-                new_prob_dist[BACKGROUND_CLASS_INDEX if cl.
-                              get_nearest_class_id(c) is None else cl.
-                              get_nearest_class_id(c)] += prob_dist[i]
-            prob_dist = new_prob_dist
+        self.print_all = print_all
 
-        # Either normalize the distribution if it has a total > 1, or dump
-        # missing probability into the background / "I'm not sure" class
-        total_prob = np.sum(prob_dist)
-        if total_prob > 1:
-            prob_dist /= total_prob
-        else:
-            prob_dist[BACKGROUND_CLASS_INDEX] += 1 - total_prob
+        if not skip_load:
+            self.load_validator()
 
-        return prob_dist
+    def load_validator(self):
+        with open(helpers.DUMP_LOCATION, 'rb') as f:
+            self.__dict__.update(pickle.load(f).__dict__)
 
     def evaluate(self):
-        # Iteratively load data from each results file (turning *.zips into a
-        # list of JSON results), & sanitise the data
-        print("LOADING REQUIRED DATA FOR %d PROVIDED FILES:\n" %
-              len(self.results_filenames))
-        results_set = Evaluator._load_results_data(self.results_filenames)
-
-        # Ensure the results set meets any requirements that may exist (all
-        # must be same task type, may have to be a required task type, may have
-        # to match a required list of environments)
-        Evaluator._validate_results_set(results_set, self.required_task,
-                                        self.required_envs)
+        # Get our valid results
+        valid_results = {
+            k: v
+            for k, v in self.results_data.items() if not v[helpers.SKIP_KEY]
+        }
+        if not valid_results:
+            print("Skipping to summary, as no valid results were provided.\n")
 
-        # Try & load all of the requested ground truth maps (failing loudly if
-        # a required ground truth can't be found)
-        ground_truth_data = Evaluator._load_ground_truth_data(
-            self.ground_truth_dir,
-            [r['environment_details'] for r in results_set.values()])
-        print('\n' + '-' * 80 + '\n')
+        # Load available ground truth data
+        # TODO ideally this would only load what is required, but the manager
+        # in benchbot_addons would need to support nested 'get_match' queries
+        gt_data = bam.load_yaml_list(bam.find_all("ground_truths", 'json'))
 
-        # Iteratively evaluate each of the results JSONs provided, saving the
-        # scores so we can amalgamate them after
+        # Iterate through results, attempting evaluation
+        gt_formats = self.evaluation_method_data['valid_ground_truth_formats']
         scores_data = []
-        for f, d in results_set.items():
-            print("EVALUATING PERFORMANCE OF RESULTS IN '%s':\n" % f)
-
-            # Perform evaluation, selecting the appropriate evaluation function
-            scores_data.append(
-                (self._evaluate_scd
-                 if d['task_details']['type'] == Evaluator._TYPE_SCD else
-                 self._evaluate_semantic_slam)(d, ground_truth_data))
-
-            # Print the results if allowed, otherwise just say we're done
+        for k, v in valid_results.items():
+            # Find a usable ground truth (one of supported formats, & for same
+            # environment as this result)
+            gts = []
+            for e in v['environment_details']:
+                gt = next(
+                    (g for g in gt_data
+                     if bam.env_string(g['environment']) == bam.env_string(e)
+                     and g['format']['name'] in gt_formats), None)
+                assert gt is not None, (
+                    "No ground truth could was found for '%s' with "
+                    "any of the following formats:\n\t%s" %
+                    (bam.env_string(e), gt_formats))
+                gts.append(gt)
+
+            # Score the result
+            scores_data.append(self.evaluate_fn(v, gts))
+
+            # Print the results (if allowed)
             if self.print_all:
-                print("\nScores for '%s':\n" % f)
-                pprint.pprint(scores_data[-1])
+                print("\nScores for '%s':\n" % k)
+                pprint.pprint(scores_data[-1]['scores'])
             else:
                 print("Done")
             print('\n' + '-' * 80 + '\n')
 
-        # Amalgamate all of the produced scores
-        scores = Evaluator._create_scores(
-            task_details=scores_data[0]['task_details'],
-            environment_details=[
-                s['environment_details'] for s in scores_data
-            ],
-            scores_omq=np.mean([s['scores']['OMQ'] for s in scores_data]),
-            scores_avg_pairwise=np.mean(
-                [s['scores']['avg_pairwise'] for s in scores_data]),
-            scores_avg_label=np.mean(
-                [s['scores']['avg_label'] for s in scores_data]),
-            scores_avg_spatial=np.mean(
-                [s['scores']['avg_spatial'] for s in scores_data]),
-            scores_avg_fp_quality=np.mean(
-                [s['scores']['avg_fp_quality'] for s in scores_data]),
-            scores_avg_state_quality=(np.mean([
-                s['scores']['avg_state_quality'] for s in scores_data
-            ]) if 'avg_state_quality' in scores_data[0]['scores'] else None),
-        )
+        # Amalgamate all of the produced scores if supported by evaluation
+        # method
+        if 'combine' in self.evaluate_fns:
+            scores = self.evaluate_fns['combine'](scores_data)
+            print("\nFinal scores for the '%s' task:\n" %
+                  (scores['task_details']['name']
+                   if 'name' in scores['task_details'] else 'UNKNOWN'))
+            pprint.pprint(scores['scores'])
+        else:
+            scores = scores_data
 
-        # Print the results, save them, & finish
-        print(("\nFinal scores for the '%s:%s:%s' task:\n" %
-               (scores['task_details']['type'],
-                scores['task_details']['control_mode'],
-                scores['task_details']['localisation_mode'])).upper())
-        pprint.pprint(scores)
+        # Save our results & finish
         with open(self.scores_filename, 'w') as f:
             json.dump(scores, f)
         print("\nDone.")
diff --git a/benchbot_eval/ground_truth_creator.py b/benchbot_eval/ground_truth_creator.py
new file mode 100644
index 0000000..ca39bcf
--- /dev/null
+++ b/benchbot_eval/ground_truth_creator.py
@@ -0,0 +1,31 @@
+from benchbot_addons import manager as bam
+
+from . import helpers
+
+
+class GroundTruthCreator:
+    def __init__(self, ground_truth_format_name, environment_name):
+        self.ground_truth_format = bam.get_match(
+            "formats", [("name", ground_truth_format_name)], return_data=True)
+        self.environment = bam.get_match(
+            "environments", [("name", bam.env_name(environment_name)),
+                             ("variant", bam.env_variant(environment_name))],
+            return_data=True)
+
+        self._functions = bam.load_functions(self.ground_truth_format)
+
+    def create_empty(self, *args, **kwargs):
+        return {
+            'environment':
+            self.environment,
+            'format':
+            self.ground_truth_format,
+            'ground_truth': (self._functions['create'](*args, **kwargs)
+                             if 'create' in self._functions else {})
+        }
+
+    def function(self, name):
+        return self._functions[name] if name in self._functions else None
+
+    def functions(self):
+        return self._functions.keys()
diff --git a/benchbot_eval/helpers.py b/benchbot_eval/helpers.py
new file mode 100644
index 0000000..563a661
--- /dev/null
+++ b/benchbot_eval/helpers.py
@@ -0,0 +1,41 @@
+import importlib
+import json
+import os
+import re
+import sys
+import yaml
+import zipfile
+
+DUMP_LOCATION = '/tmp/benchbot_eval_validator_dump'
+
+SKIP_KEY = "SKIP"
+
+ZIP_IGNORE = ['submission.json']
+
+
+def load_results(results_filenames):
+    # Takes a list of filenames & pulls all data from JSON & *.zip files
+    # (sorry... nesting abomination...)
+    results = {}  # Dict of provided data, with filenames as keys
+    for r in results_filenames:
+        print("Loading data from '%s' ..." % r)
+        if zipfile.is_zipfile(r):
+            with zipfile.ZipFile(r, 'r') as z:
+                for f in z.filelist:
+                    if f.filename in ZIP_IGNORE:
+                        print("\tIgnoring file '%s'" % f.filename)
+                    else:
+                        with z.open(f, 'r') as zf:
+                            d = None
+                            try:
+                                print("\tExtracting data from file '%s'" %
+                                      f.filename)
+                                results[z.filename + ':' +
+                                        f.filename] = json.load(zf)
+                            except:
+                                print("\tSkipping file '%s'" % f.filename)
+        else:
+            with open(r, 'r') as f:
+                results[r] = json.load(f)
+    print("\tDone.\n")
+    return results
diff --git a/benchbot_eval/iou_tools.py b/benchbot_eval/iou_tools.py
deleted file mode 100644
index 5563b31..0000000
--- a/benchbot_eval/iou_tools.py
+++ /dev/null
@@ -1,262 +0,0 @@
-
-# Software License Agreement (BSD License)
-
-# Modified by Suman Raj Bista 
-# for Robotic Vision Evaluation and Benchmarking Project at ACRV/QUT
-# Adapted from https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/datasets/voc_eval.py
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above
-#    copyright notice, this list of conditions and the following
-#    disclaimer in the documentation and/or other materials provided
-#    with the distribution.
-#  * Neither the name of ACRV/QUT nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
-# FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
-# COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
-# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
-# BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
-# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
-# ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-# POSSIBILITY OF SUCH DAMAGE.
-#
-
-import numpy as np
-from shapely.geometry import Polygon
-
-
-class IoU:
-
-    def __init__(self):
-        """ initilise with args
-        """
-        
-        self.__gt_bb = []
-        self.__est_bb = []
-        self.__gtd = []
-        self.__gtc = []
-        self.__ed = []
-        self.__ec = []
-
-
-
-    def __rotz(self, t):
-        """Rotation about the z-axis."""
-        c = np.cos(t)
-        s = np.sin(t)
-        return np.array([[c, -s,  0],
-                        [s,  c,  0],
-                        [0,  0,  1]])
-
-
-
-    def get_bbox3D(self, box_size, heading_angle, center):
-        """ Calculate 3D bounding box corners from its parameterization.
-        Input:
-            box_size: (length,wide,height) , heading_angle: rad , center:  (x,y,z)
-        Output:
-            corners3D:  3D box corners, vol: volume of 3D bounding box
-        """
-
-        l,w,h = box_size
-        #print(box_size)
-        vol = (l)*(w)*(h)
-        
-        R = self.__rotz(heading_angle)
-
-        corners3D = 0.5 * np.array([[-l , -l,  l , l , -l , -l , l , l ],
-                                    [w , -w , -w , w , w , -w , -w , w ],
-                                    [-h , -h , -h , -h , h , h , h , h ],
-                                   ])
-        
-        #corners3D = np.dot(R,corners3D)
-
-        corners3D[0,:] = corners3D[0,:] + center[0]
-        corners3D[1,:] = corners3D[1,:] + center[1]
-        corners3D[2,:] = corners3D[2,:] + center[2]
-        
-        #print(corners3D)
-        return corners3D, vol
-
-
-
-    def get_boundingCuboid(self, extent, centroid):
-        """ Calculate 3D bounding box corners from centre and 3 dimensions
-        """
-        corners3D, vol = self.get_bbox3D(extent,0.0,centroid)
-        return  corners3D
-
-
-
-    def get_boundingbox_bev(self, bbox_3D):
-        return Polygon(np.transpose(bbox_3D[(0,1), 0:4]))
-
-    def __get_boundingVolume(self, bbox_3D):
-        l =  np.max(bbox_3D[0]) - np.min(bbox_3D[0])
-        b =  np.max(bbox_3D[1]) - np.min(bbox_3D[1])
-        h =  np.max(bbox_3D[2]) - np.min(bbox_3D[2])
-        return l*b*h
-
-    
-    def __get_ovelapHeight(self, bbox_3Da, bbox_3Db):
-        """
-        Calculate overlapping height between two 3D bounding box
-        Input: 8 corners of bounding box 
-        Output: overlapping height
-        """
-
-        h_a_min = np.min(bbox_3Da[2])
-        h_a_max = np.max(bbox_3Da[2])
-
-        h_b_min = np.min(bbox_3Db[2])
-        h_b_max = np.max(bbox_3Db[2])
-
-        max_of_min = np.max([h_a_min, h_b_min])
-        min_of_max = np.min([h_a_max, h_b_max])
-
-        hoverlap = np.max([0, min_of_max - max_of_min])
-        #print("h_overlap = ", hoverlap)
-
-        return hoverlap
-
-    
-
-    def __get_intersestionin2D(self, poly_a, poly_b):
-        """
-        Calculate overlapping area between two polygons
-        Input: 2 polygons in 2d
-        Output: overlapping polygon and area of overlap
-        """
-        poly_xy_int = poly_a.intersection(poly_b)
-        return poly_xy_int, poly_xy_int.area
-    
-
-
-    def __get_polygonXY(self, bbox3D):
-        """
-        birdeye view of 3d bbox
-        Input: 3d box
-        Output: projection of 3d bbox in xyplane and area of the polygon in xy
-        """
-      
-        poly_xy = Polygon(zip(*bbox3D[0:2, 0:4]))
-        return poly_xy, poly_xy.area
-
-
-
-    def __iou(self, set_a, set_b, set_intersect):
-        union = set_a + set_b - set_intersect
-        return set_intersect / union if union else 0.
-
-
-
-    def cal_IoU(self, bbox_3Da, bbox_3Db):
-
-        vol_a = self.__get_boundingVolume(bbox_3Da)
-        vol_b = self.__get_boundingVolume(bbox_3Db)
-
-        poly_a, area_a = self.__get_polygonXY(bbox_3Da)
-        poly_b, area_b = self.__get_polygonXY(bbox_3Db)
-
-        poly_xy, area_xy = self.__get_intersestionin2D(poly_a, poly_b)
-
-        h_overlap = self.__get_ovelapHeight(bbox_3Da, bbox_3Db)
-
-        vol_ab =  h_overlap * area_xy
-
-        iou_bev = self.__iou(area_a, area_b, area_xy)
-        iou_3D = self.__iou(vol_a, vol_b, vol_ab)
-       
-        return iou_bev, iou_3D
-
-    
-    def calculate(self, box_size_gt, center_gt, box_size_est, center_est):
-
-        corners_gt, v_gt =   self.get_bbox3D(box_size_gt, 0.0 ,center_gt)
-        corners_est, v_est =   self.get_bbox3D(box_size_est, 0.0 ,center_est)
-        
-
-        poly_gt, area_gt = self.__get_polygonXY(corners_gt)
-        poly_est, area_est = self.__get_polygonXY(corners_est)
-
-        poly_int_xy, area_int_xy = self.__get_intersestionin2D(poly_gt, poly_est)
-
-        h_int = self.__get_ovelapHeight(corners_gt, corners_est)
-
-        vol_int = h_int *  area_int_xy
-
-        iou_bev = self.__iou(area_gt, area_est, area_int_xy)
-        iou_3D = self.__iou(v_gt, v_est, vol_int)
-
-        
-        #print(v_gt, v_est, vol_int, area_gt, area_est, area_int_xy, h_int)
-       
-        return iou_bev, iou_3D
-
-    
-    def dict_iou(self, dict1, dict2):
-        corners1, v1 =   self.get_bbox3D(dict1['extent'], 0.0 ,dict1['centroid'])
-        corners2, v2 =   self.get_bbox3D(dict2['extent'], 0.0 ,dict2['centroid'])
-        
-
-        poly1, area1 = self.__get_polygonXY(corners1)
-        poly2, area2 = self.__get_polygonXY(corners2)
-
-        poly_int_xy, area_int_xy = self.__get_intersestionin2D(poly1, poly2)
-
-        h_int = self.__get_ovelapHeight(corners1, corners2)
-
-        vol_int = h_int *  area_int_xy
-
-        iou_bev = self.__iou(area1, area2, area_int_xy)
-        iou_3D = self.__iou(v1, v2, vol_int)
-
-        
-        #print(v_gt, v_est, vol_int, area_gt, area_est, area_int_xy, h_int)
-       
-        return iou_bev, iou_3D
-
-    def dict_prop_fraction(self, prop_dict, gt_dict):
-        """
-        Function for determining the proportion of a proposal is covered by the ground-truth
-        :param prop_dict:
-        :param gt_dict:
-        :return:
-        """
-        corners1, v1 = self.get_bbox3D(prop_dict['extent'], 0.0, prop_dict['centroid'])
-        corners2, v2 = self.get_bbox3D(gt_dict['extent'], 0.0, gt_dict['centroid'])
-
-        poly1, area1 = self.__get_polygonXY(corners1)
-        poly2, area2 = self.__get_polygonXY(corners2)
-
-        poly_int_xy, area_int_xy = self.__get_intersestionin2D(poly1, poly2)
-
-        h_int = self.__get_ovelapHeight(corners1, corners2)
-
-        vol_int = h_int * area_int_xy
-
-        return vol_int / v1
-
-
-
-
-    
-
-   
-
-
-
-
-
diff --git a/benchbot_eval/omq.py b/benchbot_eval/omq.py
deleted file mode 100644
index 949b9f9..0000000
--- a/benchbot_eval/omq.py
+++ /dev/null
@@ -1,508 +0,0 @@
-from __future__ import absolute_import, division, print_function, unicode_literals
-
-import numpy as np
-from scipy.optimize import linear_sum_assignment
-from scipy.stats import gmean
-from . import iou_tools
-
-_IOU_TOOL = iou_tools.IoU()
-_STATE_IDS = {"added": 0, "removed": 1, "constant": 2}
-
-# NOTE For now we will ignore the concept of foreground and background quality in favor of
-# spatial quality being just the IoU of a detection.
-
-
-class OMQ(object):
-    """
-    Class for evaluating results from a Semantic SLAM system.
-    Method is object map quality (OMQ)
-    Based upon the probability-based detection quality (PDQ) system found below
-    https://github.com/david2611/pdq_evaluation
-    """
-
-    def __init__(self, scd_mode=False):
-        """
-        Initialisation function for OMQ evaluator
-        :param: scd_mode: flag for whether OMQ is evaluating a scene change detection system which has
-        an extra consideration made for the uncertainty in the map for the state of an object (added, removed, same)
-        """
-        super(OMQ, self).__init__()
-        self._tot_overall_quality = 0.0
-        self._tot_spatial_quality = 0.0
-        self._tot_label_quality = 0.0
-        self._tot_fp_cost = 0.0
-        self._tot_TP = 0
-        self._tot_FP = 0
-        self._tot_FN = 0
-        self._tot_state_quality = 0.0
-        self.scd_mode = scd_mode
-
-    def reset(self):
-        """
-        Reset all internally stored evaluation measures to zero.
-        :return: None
-        """
-        self._tot_overall_quality = 0.0
-        self._tot_spatial_quality = 0.0
-        self._tot_label_quality = 0.0
-        self._tot_fp_cost = 0.0
-        self._tot_TP = 0
-        self._tot_FP = 0
-        self._tot_FN = 0
-        self._tot_state_quality = 0.0
-
-    def add_map_eval(self, gt_objects, proposed_objects):
-        """
-        Adds a single map's object proposals and ground-truth to the overall evaluation analysis.
-        :param gt_objects: list of ground-truth dictionaries present in the given map.
-        :param proposed_objects: list of detection dictionaries objects provided for the given map
-        :return: None
-        """
-        results = _calc_qual_map(gt_objects, proposed_objects, self.scd_mode)
-        self._tot_overall_quality += results['overall']
-        self._tot_spatial_quality += results['spatial']
-        self._tot_label_quality += results['label']
-        self._tot_fp_cost += results['fp_cost']
-        self._tot_TP += results['TP']
-        self._tot_FP += results['FP']
-        self._tot_FN += results['FN']
-        self._tot_state_quality += results['state_change']
-
-    def get_current_score(self):
-        """
-        Get the current score for all scenes analysed at the current time.
-        :return: The OMQ score across all maps as a float
-        (average pairwise quality over the number of object-detection pairs observed with FPs weighted by confidence).
-        """
-        denominator = self._tot_TP + self._tot_fp_cost + self._tot_FN
-        if denominator > 0.0:
-            return self._tot_overall_quality / denominator
-        return 0.0
-
-    def score(self, param_lists):
-        """
-        Calculates the average quality score for a set of object proposals on
-        a set of ground truth objects over a series of maps.
-        The average is calculated as the average pairwise quality over the number of object-detection pairs observed
-        with FPs weighted by confidence.
-        Note that this removes any evaluation information that had been stored for previous maps.
-        Assumes you want to score just the full list you are given.
-        :param param_lists: A list of tuples where each tuple holds a list of ground-truth dicts and a list of
-        detection dicts. Each map observed is an entry in the main list.
-        :return: The OMQ score across all maps as a float
-        """
-        self.reset()
-
-        for map_params in param_lists:
-            map_results = self._get_map_evals(map_params)
-            self._tot_overall_quality += map_results['overall']
-            self._tot_spatial_quality += map_results['spatial']
-            self._tot_label_quality += map_results['label']
-            self._tot_fp_cost += map_results['fp_cost']
-            self._tot_TP += map_results['TP']
-            self._tot_FP += map_results['FP']
-            self._tot_FN += map_results['FN']
-            self._tot_state_quality += map_results['state_change']
-
-        return self.get_current_score()
-
-    def get_avg_spatial_score(self):
-        """
-        Get the average spatial quality score for all assigned object proposals in all maps analysed at the current time.
-        Note that this is averaged over the number of assigned object proposals (TPs) and not the full set of TPs, FPs,
-        and FNs like the final Semantic SLAM OMQ score.
-        :return: average spatial quality of every assigned detection
-        """
-        if self._tot_TP > 0.0:
-            return self._tot_spatial_quality / float(self._tot_TP)
-        return 0.0
-
-    def get_avg_label_score(self):
-        """
-        Get the average label quality score for all assigned object proposals in all maps analysed at the current time.
-        Note that this is averaged over the number of assigned object proposals (TPs) and not the full set of TPs, FPs,
-        and FNs like the final OMQ score.
-        :return: average label quality of every assigned detection
-        """
-        if self._tot_TP > 0.0:
-            return self._tot_label_quality / float(self._tot_TP)
-        return 0.0
-
-    def get_avg_state_score(self):
-        """
-        Get the average state quality score for all assigned object proposals in all maps analysed at the current time.
-        Note that this is averaged over the number of assigned object proposals (TPs) and not the full set of TPs, FPs,
-        and FNs like the final OMQ score.
-        :return: average label quality of every assigned detection
-        """
-        if self._tot_TP > 0.0:
-            return self._tot_state_quality / float(self._tot_TP)
-        return 0.0
-
-    def get_avg_overall_quality_score(self):
-        """
-        Get the average overall pairwise quality score for all assigned object proposals
-        in all maps analysed at the current time.
-        Note that this is averaged over the number of assigned object proposals (TPs) and not the full set of TPs, FPs,
-        and FNs like the final OMQ score.
-        :return: average overall pairwise quality of every assigned detection
-        """
-        if self._tot_TP > 0.0:
-            return self._tot_overall_quality / float(self._tot_TP)
-        return 0.0
-
-    def get_avg_fp_score(self):
-        """
-        Get the average quality (1 - cost) for all false positive object proposals across all maps analysed at the
-        current time.
-        Note that this is averaged only over the number of FPs and not the full set of TPs, FPs, and FNs.
-        Note that at present false positive cost/quality is based solely upon label scores.
-        :return: average false positive quality score
-        """
-        if self._tot_FP > 0.0:
-            return (self._tot_FP - self._tot_fp_cost) / self._tot_FP
-        return 1.0
-
-    def get_assignment_counts(self):
-        """
-        Get the total number of TPs, FPs, and FNs across all frames analysed at the current time.
-        :return: tuple containing (TP, FP, FN)
-        """
-        return self._tot_TP, self._tot_FP, self._tot_FN
-
-    def _get_map_evals(self, parameters):
-        """
-        Evaluate the results for a given image
-        :param parameters: tuple containing list of ground-truth dicts and object proposal dicts
-        :return: results dictionary containing total overall quality, total spatial quality on positively assigned
-        object proposals, total label quality on positively assigned object proposals,
-        total false positive cost for each false positive object proposal, number of true positives,
-        number of false positives, number false negatives, and total state quality (relevant only in SCD).
-        Format {'overall':<tot_overall_quality>, 'spatial': <tot_tp_spatial_quality>, 'label': <tot_tp_label_quality>,
-        'fp_cost': <tot_false_positive_cost>, 'TP': <num_true_positives>, 'FP': <num_false_positives>,
-        'FN': <num_false_positives>, 'state_change': <tot_state_quality>}
-        """
-        gt_objects, proposed_objects = parameters
-        results = _calc_qual_map(gt_objects, proposed_objects, self.scd_mode)
-        return results
-
-
-def _vectorize_map_gts(gt_objects, scd_mode):
-    """
-    Vectorizes the required elements for all ground-truth object dicts as necessary for a given map.
-    These elements are the ground-truth cuboids, class ids and state ids if given (0: added, 1: removed, 2: constant)
-    :param gt_objects: list of all ground-truth dicts for a given image
-    :return: (gt_cuboids, gt_labels, gt_state_ids).
-    gt_cuboids: g length list of cuboid centroids and extents stored as dictionaries for each of the
-    g ground-truth dicts. (dictionary format: {'centroid': <centroid>, 'extent': <extent>})
-    gt_labels: g, numpy array of class labels as an integer for each of the g ground-truth dicts
-    """
-    gt_labels = np.array([gt_obj['class_id'] for gt_obj in gt_objects],
-                         dtype=np.int)  # g,
-    gt_cuboids = [{
-        "centroid": gt_obj['centroid'],
-        "extent": gt_obj["extent"]
-    } for gt_obj in gt_objects]  # g,
-    if scd_mode:
-        gt_state_ids = [_STATE_IDS[gt_obj["state"]] for gt_obj in gt_objects]
-    else:
-        gt_state_ids = [-1 for gt_obj in gt_objects]
-    return gt_cuboids, gt_labels, gt_state_ids
-
-
-def _vectorize_map_props(proposed_objects, scd_mode):
-    """
-    Vectorize the required elements for all object proposal dicts as necessary for a given map.
-    These elements are the detection boxes, the probability distributions for the labels, and the probability
-    distribution of the state change estimations (if provided).
-    :param proposed_objects: list of all proposed object dicts for a given image.
-    :return: (prop_cuboids, prop_class_probs, det_)
-    prop_cuboids: p length list of cuboid centroids and extents stored as dictionaries for the p object proposals
-    (dictionary format: {'centroid': <centroid>, 'extent': <extent>})
-    prop_class_probs: p x c numpy array of class label probability scores across all c classes for each of the
-    p object proposals
-    prop_state_probs: p x 3 numpy array of state probability scores across all 3 object states for each of the
-    p object proposals. Note this is only relevant for SCD and states are (in order) [added, removed, same].
-    """
-    prop_class_probs = np.stack(
-        [np.array(prop_obj['label_probs']) for prop_obj in proposed_objects],
-        axis=0)  # d x c
-    prop_cuboids = [{
-        "centroid": prop_obj['centroid'],
-        "extent": prop_obj["extent"]
-    } for prop_obj in proposed_objects]  # d,
-    if scd_mode:
-        # Currently assuming that format of state_probs is 3 dimensional
-        # format [<prob_added>, <prob_removed>, <prob_same>]
-        prop_state_probs = np.stack([
-            np.array(prop_obj['state_probs']) for prop_obj in proposed_objects
-        ],
-                                    axis=0)  # d x 3
-    else:
-        prop_state_probs = np.ones((len(proposed_objects), 3)) * -1
-    return prop_cuboids, prop_class_probs, prop_state_probs
-
-
-def _calc_spatial_qual(gt_cuboids, prop_cuboids):
-    """
-    Calculate the spatial quality for all object proposals on all ground truth objects for a given map.
-    :param: gt_cuboids: g length list of all ground-truth cuboid dictionaries defining centroid and extent of objects
-    :param: prop_cuboids: p length list of all proposed object cuboid dictionaries defining centroid and
-    extent of objects
-    :return: spatial_quality: g x p spatial quality score between zero and one for each possible combination of
-    g ground truth objects and p object proposals.
-    """
-    # TODO optimize in some clever way in future rather than two for loops
-    spatial_quality = np.zeros((len(gt_cuboids), len(prop_cuboids)),
-                               dtype=np.float)  # g x d
-    for gt_id, gt_cub_dict in enumerate(gt_cuboids):
-        for prop_id, prop_cub_dict in enumerate(prop_cuboids):
-            spatial_quality[gt_id, prop_id] = _IOU_TOOL.dict_iou(
-                gt_cub_dict, prop_cub_dict)[1]
-
-    return spatial_quality
-
-
-def _calc_label_qual(gt_labels, prop_class_probs):
-    """
-    Calculate the label quality for all object proposals on all ground truth objects for a given image.
-    :param gt_labels:  g, numpy array containing the class label as an integer for each ground-truth object.
-    :param prop_class_probs: p x c numpy array of class label probability scores across all c classes
-    for each of the p object proposals.
-    :return: label_qual_mat: g x p label quality score between zero and one for each possible combination of
-    g ground truth objects and p object proposals.
-    """
-    label_qual_mat = prop_class_probs[:, gt_labels].T.astype(
-        np.float32)  # g x d
-    return label_qual_mat
-
-
-def _calc_state_change_qual(gt_state_ids, prop_state_probs):
-    """
-    Calculate the label quality for all object proposals on all ground truth objects for a given image.
-    :param gt_state_ids:  g, numpy array containing the state label as an integer for each object.
-    (0 = added, 1 = removed, 2 = same)
-    :param prop_state_probs: p x 3 numpy array of label probability scores across all 3 states
-    for each of the d object proposals.
-    :return: state_qual_mat: g x p state quality score between zero and one for each possible combination of
-    g ground truth objects and p object proposals.
-    """
-    # if we are not doing scd we return None
-    if np.max(gt_state_ids) == -1:
-        return None
-    state_qual_mat = prop_state_probs[:, gt_state_ids].T.astype(
-        np.float32)  # g x d
-    return state_qual_mat
-
-
-def _calc_overall_qual(label_qual_mat, spatial_qual_mat, state_qual_mat):
-    """
-    Calculate the overall quality for all object proposals on all ground truth objects for a given image
-    :param label_qual_mat: g x p label quality score between zero and one for each possible combination of
-    g ground truth objects and p object proposals.
-    :param spatial_qual_mat: g x p spatial quality score between zero and one for each possible combination of
-    g ground truth objects and p object proposals.
-    :return: overall_qual_mat: g x p overall label quality between zero and one for each possible combination of
-    g ground truth objects and p object proposals.
-    """
-    # For both the Semantic SLAM and SCD case, combined quality is the geometric mean of evaluated quality components
-    if state_qual_mat is None:
-        combined_mat = np.dstack((label_qual_mat, spatial_qual_mat))
-    else:
-        combined_mat = np.dstack(
-            (label_qual_mat, spatial_qual_mat, state_qual_mat))
-
-    # Calculate the geometric mean between all qualities
-    # Note we ignore divide by zero warnings here for log(0) calculations internally.
-    with np.errstate(divide='ignore'):
-        overall_qual_mat = gmean(combined_mat, axis=2)
-
-    return overall_qual_mat
-
-
-def _gen_cost_tables(gt_objects, object_proposals, scd_mode):
-    """
-    Generate the cost tables containing the cost values (1 - quality) for each combination of ground truth objects and
-    object proposals within a given map.
-    :param gt_objects: list of all ground-truth object dicts for a given map.
-    :param object_proposals: list of all object proposal dicts for a given map.
-    :return: dictionary of g x p cost tables for each combination of ground truth objects and object proposals.
-    Note that all costs are simply 1 - quality scores (required for Hungarian algorithm implementation)
-    Format: {'overall': overall summary cost table, 'spatial': spatial quality cost table,
-    'label': label quality cost table, 'state': state cost table}
-    """
-    # Initialise cost tables
-    n_pairs = max(len(gt_objects), len(object_proposals))
-    overall_cost_table = np.ones((n_pairs, n_pairs), dtype=np.float32)
-    spatial_cost_table = np.ones((n_pairs, n_pairs), dtype=np.float32)
-    label_cost_table = np.ones((n_pairs, n_pairs), dtype=np.float32)
-    state_cost_table = np.ones((n_pairs, n_pairs), dtype=np.float32)
-
-    # Generate all the matrices needed for calculations
-    gt_cuboids, gt_labels, gt_state_ids = _vectorize_map_gts(
-        gt_objects, scd_mode)
-    prop_cuboids, prop_class_probs, prop_state_probs = _vectorize_map_props(
-        object_proposals, scd_mode)
-
-    # Calculate spatial, label and state qualities (state only used in SCD)
-    label_qual_mat = _calc_label_qual(gt_labels, prop_class_probs)
-    spatial_qual = _calc_spatial_qual(gt_cuboids, prop_cuboids)
-    state_change_qual = _calc_state_change_qual(gt_state_ids, prop_state_probs)
-
-    # Generate the overall cost table (1 - overall quality)
-    overall_cost_table[:len(gt_objects
-                           ), :len(object_proposals)] -= _calc_overall_qual(
-                               label_qual_mat, spatial_qual, state_change_qual)
-
-    # Generate the spatial, label and (optionally) state cost tables
-    spatial_cost_table[:len(gt_objects), :len(object_proposals
-                                             )] -= spatial_qual
-    label_cost_table[:len(gt_objects), :len(object_proposals
-                                           )] -= label_qual_mat
-    if scd_mode:
-        state_cost_table[:len(gt_objects), :len(object_proposals
-                                               )] -= state_change_qual
-
-    return {
-        'overall': overall_cost_table,
-        'spatial': spatial_cost_table,
-        'label': label_cost_table,
-        'state': state_cost_table
-    }
-
-
-def _calc_qual_map(gt_objects, object_proposals, scd_mode):
-    """
-    Calculates the sum of qualities for the best matches between ground truth objects and object proposals for a map.
-    Each ground truth object can only be matched to a single object proposal and vice versa as an gt-proposal pair.
-    Note that if a ground truth object or proposal does not have a match, the quality is counted as zero.
-    This represents a theoretical gt-proposal pair with the object or detection and a counterpart which
-    does not describe it at all.
-    Any provided proposal with a zero-quality match will be counted as a false positive (FP).
-    Any ground-truth object with a zero-quality match will be counted as a false negative (FN).
-    All other matches are counted as "true positives" (TP)
-    If there are no ground-truth objects or object proposals for the map, the system returns zero and this map
-    will not contribute to average score.
-    :param gt_objects: list of ground-truth dictionaries describing the ground truth objects in the current map.
-    :param object_proposals: list of object proposal dictionaries describing the object proposals for the current map.
-    :return: results dictionary containing total overall spatial quality, total spatial quality on positively assigned
-    object proposals, total label quality on positively assigned object proposals, total false positive cost,
-    number of true positives, number of false positives, number false negatives, and total state change quality on
-    positively assigned object proposals (relevant only for SCD).
-    Format {'overall':<tot_overall_quality>, 'spatial': <tot_tp_spatial_quality>, 'label': <tot_tp_label_quality>,
-    'fp_cost': <tot_fp_cost>, 'TP': <num_true_positives>, 'FP': <num_false_positives>, 'FN': <num_false_positives>,
-    'state_change': <tot_tp_state_quality>}
-    """
-
-    tot_fp_cost = 0.0
-    # if there are no object proposals or gt instances respectively the quality is zero
-    if len(gt_objects) == 0 or len(object_proposals) == 0:
-        if len(object_proposals) > 0:
-            # Calculate FP quality
-            # NOTE background class is the final class in the distribution which is ignored when calculating FP cost
-            tot_fp_cost = np.sum([
-                np.max(det_instance['label_probs'][:-1])
-                for det_instance in object_proposals
-            ])
-
-        return {
-            'overall': 0.0,
-            'spatial': 0.0,
-            'label': 0.0,
-            'fp_cost': tot_fp_cost,
-            'TP': 0,
-            'FP': len(object_proposals),
-            'FN': len(gt_objects),
-            'state_change': 0.0
-        }
-
-    # For each possible pairing, calculate the quality of that pairing and convert it to a cost
-    # to enable use of the Hungarian algorithm.
-    cost_tables = _gen_cost_tables(gt_objects, object_proposals, scd_mode)
-
-    # Use the Hungarian algorithm with the cost table to find the best match between ground truth
-    # object and detection (lowest overall cost representing highest overall pairwise quality)
-    row_idxs, col_idxs = linear_sum_assignment(cost_tables['overall'])
-
-    # Transform the loss tables back into quality tables with values between 0 and 1
-    overall_quality_table = 1 - cost_tables['overall']
-    spatial_quality_table = 1 - cost_tables['spatial']
-    label_quality_table = 1 - cost_tables['label']
-    state_change_quality_table = 1 - cost_tables['state']
-
-    # Go through all optimal assignments and summarize all pairwise statistics
-    # Calculate the number of TPs, FPs, and FNs for the image during the process
-    true_positives = 0
-    false_positives = 0
-    false_negatives = 0
-    false_positive_idxs = []
-
-    for match_idx, match in enumerate(zip(row_idxs, col_idxs)):
-        row_id, col_id = match
-        # Handle "true positives"
-        if overall_quality_table[row_id, col_id] > 0:
-            true_positives += 1
-        else:
-            # Handle false negatives
-            if row_id < len(gt_objects):
-                false_negatives += 1
-            # Handle false positives
-            if col_id < len(object_proposals):
-                # Check if false positive is actually a proposal of an isgroup object that has a better match
-                # only ignored if best match otherwise would be isgroup object (non-zero quality) and max class matches
-                if np.sum(overall_quality_table[:, col_id]) > 0:
-                    # check if max match class is a grouped object
-                    best_gt_idx = np.argmax(overall_quality_table[:, col_id])
-                    if 'isgroup' in gt_objects[best_gt_idx] and gt_objects[best_gt_idx]['isgroup']:
-                        # check if the class of the proposal matches the class of the object
-                        # (ignoring final class which should be background)
-                        if np.argmax(object_proposals[col_id]['label_probs'][:-1]) == gt_objects[best_gt_idx]['class_id']:
-                            # Check if at least 50% of the proposal is within the ground-truth object
-                            if _IOU_TOOL.dict_prop_fraction(object_proposals[col_id], gt_objects[best_gt_idx]) >= 0.5:
-                                # if all criteria met, skip this detection in both fp quality and number of fps
-                                continue
-                false_positives += 1
-                false_positive_idxs.append(col_id)
-
-    # Calculate the sum of quality at the best matching pairs to calculate total qualities for the image
-    tot_overall_img_quality = np.sum(overall_quality_table[row_idxs, col_idxs])
-
-    # Force spatial and label qualities to zero for total calculations as there is no actual association between
-    # object proposals and therefore no TP when this is the case.
-    spatial_quality_table[overall_quality_table == 0] = 0.0
-    label_quality_table[overall_quality_table == 0] = 0.0
-    state_change_quality_table[overall_quality_table == 0] = 0.0
-
-    # Calculate the sum of spatial and label qualities only for TP samples
-    tot_tp_spatial_quality = np.sum(spatial_quality_table[row_idxs, col_idxs])
-    tot_tp_label_quality = np.sum(label_quality_table[row_idxs, col_idxs])
-    tot_tp_state_change_quality = np.sum(
-        state_change_quality_table[row_idxs, col_idxs])
-
-    # Calculate the penalty for assigning a high label probability to false positives
-    # NOTE background class is final class in the class list and is not considered
-    # This will be the geometric mean between the maximum label quality and maximum state estimated (ignore same)
-    if scd_mode:
-        with np.errstate(divide='ignore'):
-            tot_fp_cost = np.sum([
-                gmean([
-                    np.max(object_proposals[i]['label_probs'][:-1]),
-                    np.max(object_proposals[i]['state_probs'][:-1])
-                ]) for i in false_positive_idxs
-            ])
-    else:
-        tot_fp_cost = np.sum([
-            np.max(object_proposals[i]['label_probs'][:-1])
-            for i in false_positive_idxs
-        ])
-
-    return {
-        'overall': tot_overall_img_quality,
-        'spatial': tot_tp_spatial_quality,
-        'label': tot_tp_label_quality,
-        'fp_cost': tot_fp_cost,
-        'TP': true_positives,
-        'FP': false_positives,
-        'FN': false_negatives,
-        'state_change': tot_tp_state_change_quality
-    }
diff --git a/benchbot_eval/validator.py b/benchbot_eval/validator.py
new file mode 100644
index 0000000..3ba1d29
--- /dev/null
+++ b/benchbot_eval/validator.py
@@ -0,0 +1,120 @@
+import pickle
+import textwrap
+
+from benchbot_addons import manager as bam
+
+from . import helpers
+
+
+class Validator:
+    def __init__(self,
+                 results_filenames,
+                 required_task=None,
+                 required_envs=None):
+        self.results_data = helpers.load_results(results_filenames)
+
+        self.required_task = required_task
+        self.required_envs = required_envs
+
+        self.validate_results_data()
+        self.dump()
+
+    def _validate_result(self, result_data):
+        # Attempt to load the results format
+        format_data = bam.get_match(
+            "formats",
+            [("name", result_data['task_details']['results_format'])],
+            return_data=True)
+        assert format_data, textwrap.fill(
+            "Results declare their format as '%s', "
+            "but this format isn't installed in your BenchBot installation" %
+            result_data['task_details']['results_format'], 80)
+
+        # Call the validation function if it exists
+        fns = bam.load_functions(format_data)
+        if not fns or 'validate' not in fns:
+            print("\t\tWARNING: skipping format validation "
+                  "('%s' has no validate fn)" % format_data['name'])
+            return
+        fns['validate'](result_data['results'])
+
+    def dump(self):
+        with open(helpers.DUMP_LOCATION, 'wb') as f:
+            pickle.dump(self, f)
+
+    def validate_results_data(self):
+        print("Validating data in %d results files:" %
+              len(self.results_data.keys()))
+        for k, v in self.results_data.items():
+            print("\t%s ..." % k)
+            assert ('task_details' in v and 'name' in v['task_details']
+                    and 'results_format' in v['task_details']), (
+                        "Results are missing the bare minimum task details "
+                        "('name' & 'resuts_format')")
+            assert (
+                'environment_details' in v
+                and len(v['environment_details']) >= 1
+                and all('name' in e for e in v['environment_details'])), (
+                    "Results are missing the bare minimum environment details "
+                    "(at least 1 item with a 'name' field)")
+            assert all(
+                e['name'] == v['environment_details'][0]['name']
+                for e in v['environment_details']), (
+                    "Results have multiple environments, rather than multiple "
+                    "variants of a single environment")
+            assert 'results' in v, "Results has no 'results' field"
+            self._validate_result(v)
+            print("\t\tPassed.")
+            v[helpers.SKIP_KEY] = False
+
+        if self.required_task:
+            print(
+                "\nFollowing results will be skipped due to 'required_task=%s':"
+                % self.required_task)
+            for k, v in self.results_data.items():
+                v[helpers.SKIP_KEY] = (v['task_details']['name'] !=
+                                       self.required_task)
+                if v[helpers.SKIP_KEY]:
+                    print("\t%s" % k)
+
+        if self.required_envs:
+            print("\n%s\n%s" %
+                  ("Following results will be skipped due to",
+                   textwrap.fill("'required_envs=%s':" % self.required_envs,
+                                 80)))
+            env_strings = {
+                k: bam.env_string(v['environment_details'])
+                for k, v in self.results_data.items()
+            }
+            skipped = [
+                k for k, v in self.results_data.items()
+                if env_strings[k] not in self.required_envs
+            ]
+            if skipped:
+                print("\t", end="")
+                print("\n\t".join("%s (%s)" % (s, env_strings[s])
+                                  for s in skipped))
+                for s in skipped:
+                    self.results_data[s][helpers.SKIP_KEY] = True
+            else:
+                print("\tNone.")
+
+            missed = [
+                e for e in self.required_envs if e not in env_strings.values()
+            ]
+            if missed:
+                print("\n%s" % textwrap.fill(
+                    "WARNING the following required environments have "
+                    "no results (an empty result will be used instead):", 80))
+                print("\t", end="")
+                print("\n\t".join(missed))
+
+        tasks = list(
+            set([
+                r['task_details']['name'] for r in self.results_data.values()
+            ]))
+        assert len(tasks) == 1, textwrap.fill(
+            "Results are not all for the same task, "
+            "& therefore can't be evaluated together. "
+            "Received results for tasks '%s' " % tasks, 80)
+        return True
diff --git a/docs/scd_obmap.png b/docs/scd_obmap.png
deleted file mode 100644
index 26502e5..0000000
Binary files a/docs/scd_obmap.png and /dev/null differ
diff --git a/docs/semantic_slam_obmap.png b/docs/semantic_slam_obmap.png
deleted file mode 100644
index ada3140..0000000
Binary files a/docs/semantic_slam_obmap.png and /dev/null differ
diff --git a/setup.py b/setup.py
index 314795b..df8450b 100644
--- a/setup.py
+++ b/setup.py
@@ -3,25 +3,24 @@
 with open("README.md", "r") as fh:
     long_description = fh.read()
 
-setup(
-    name='benchbot_eval',
-    version='0.1.3',
-    author='Ben Talbot',
-    author_email='b.talbot@qut.edu.au',
-    description=
-    'The BenchBot evaluation suite for use with the ACRV Scene Understanding Challenge',
-    long_description=long_description,
-    long_description_content_type='text/markdown',
-    packages=find_packages(),
-    install_requires=['numpy', 'shapely'],
-    classifiers=(
-        "Programming Language :: Python :: 2",
-        "Programming Language :: Python :: 2.7",
-        "Programming Language :: Python :: 3",
-        "Programming Language :: Python :: 3.3",
-        "Programming Language :: Python :: 3.4",
-        "Programming Language :: Python :: 3.5",
-        "Programming Language :: Python :: 3.6",
-        "License :: OSI Approved :: BSD License",
-        "Operating System :: OS Independent",
-    ))
+setup(name='benchbot_eval',
+      version='2.0.0',
+      author='Ben Talbot',
+      author_email='b.talbot@qut.edu.au',
+      description=
+      'The BenchBot evaluation tools for use with the BenchBot software stack',
+      long_description=long_description,
+      long_description_content_type='text/markdown',
+      packages=find_packages(),
+      install_requires=['numpy', 'shapely', 'benchbot_addons'],
+      classifiers=(
+          "Programming Language :: Python :: 2",
+          "Programming Language :: Python :: 2.7",
+          "Programming Language :: Python :: 3",
+          "Programming Language :: Python :: 3.3",
+          "Programming Language :: Python :: 3.4",
+          "Programming Language :: Python :: 3.5",
+          "Programming Language :: Python :: 3.6",
+          "License :: OSI Approved :: BSD License",
+          "Operating System :: OS Independent",
+      ))