Basic random forest

Before you begin...

Download and install Titus. This article was tested with Titus 0.7.1; newer versions should work with no modification. Python >= 2.6 and < 3.0 is required.

Launch a Python prompt and import the following:

Python 2.7.6
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> import csv
>>> import math
>>> 
>>> import titus.prettypfa
>>> from titus.genpy import PFAEngine

And download the Iris dataset.

The basic form

Scoring a random forest involves scoring every tree and reporting the most frequent result (a.mode for the majority vote). The tree-scoring process is identical to the decision tree example It's also common to report the fraction of trees that gave each class of score (use a.count and a.len to compute a fraction).

pfaDocument = titus.prettypfa.jsonNode('''
types:
  Input = record(Input,
                 sepal_length: double,
                 sepal_width: double,
                 petal_length: double,
                 petal_width: double);

  TreeNode = record(TreeNode,
                    field: enum([sepal_length, sepal_width, petal_length, petal_width]),
                    operator: string,
                    value: double,
                    pass: union(string, TreeNode),
                    fail: union(string, TreeNode))

input: Input
output: string
cells:
  forest(array(TreeNode)) = []
action:
  var scores = a.map(forest, fcn(tree: TreeNode -> string) {
    model.tree.simpleWalk(input, tree, fcn(d: Input, t: TreeNode -> boolean) {
      model.tree.simpleTest(d, t)
    })
  });
  a.mode(scores)    // majority vote
''')

Producing a model

The following code is a literal copy of the decision tree builder with tweaks to turn it into a random forest. The CART decision tree algorithm optimizes its choice of field to split on and the cut point; random forests randomize the choice of field and optimize the cut point for that field. They also stop building after a certain depth (to make an ensemble of weak classifiers).

class Datum(object):
    def __init__(self, sepal_length, sepal_width, petal_length, petal_width, truth):
        self.sepal_length = float(sepal_length)
        self.sepal_width = float(sepal_width)
        self.petal_length = float(petal_length)
        self.petal_width = float(petal_width)
        self.truth = truth.replace("Iris-", "")

def attemptSplit(field, data):
    bestGain = None
    bestSplit = None
    values = sorted(set(getattr(x, field) for x in data))
    splits = [(low + high)/2.0 for low, high in zip(values[:-1], values[1:])]
    for split in splits:
        gainPass = 0.0
        gainFail = 0.0
        for truth in "setosa", "versicolor", "virginica":
            numer = sum(x.truth == truth for x in data if getattr(x, field) < split)
            denom = sum(getattr(x, field) < split for x in data)
            if numer > 0.0:
                p = numer / float(denom)
                gainPass -= p * math.log(p, 2)
            numer = sum(x.truth == truth for x in data if getattr(x, field) >= split)
            denom = sum(getattr(x, field) >= split for x in data)
            if numer > 0.0:
                p = numer / float(denom)
                gainFail -= p * math.log(p, 2)
        gain = gainPass + gainFail
        if bestGain is None or gain > bestGain:
            bestGain = gain
            bestSplit = split
    if bestGain is None:
        return 0.0, 0.0
    else:
        return bestGain, bestSplit

def predominant(data):
    setosa = sum(x.truth == "setosa" for x in data)
    versicolor = sum(x.truth == "versicolor" for x in data)
    virginica = sum(x.truth == "virginica" for x in data)
    if setosa > versicolor and setosa > virginica:
        return "setosa"
    elif versicolor > setosa and versicolor > virginica:
        return "versicolor"
    else:
        return "virginica"

def makeTree(data, depth):
    field = random.choice(["sepal_length", "sepal_width", "petal_length", "petal_width"])
    gain, split = attemptSplit(field, data)
    dataPass = [x for x in data if getattr(x, field) < split]
    dataFail = [x for x in data if getattr(x, field) >= split]
    if depth == 0:
        branchPass = {"string": predominant(dataPass)}
        branchFail = {"string": predominant(dataFail)}
    else:
        branchPass = makeTree(dataPass, depth - 1)
        branchFail = makeTree(dataFail, depth - 1)
    return {"TreeNode": {"field": field,
                         "operator": "<",
                         "value": split,
                         "pass": branchPass,
                         "fail": branchFail}}

iris = csv.reader(open("iris.csv"))
iris.next()
dataset = [Datum(*x) for x in iris]

forest = [makeTree(dataset, 5)["TreeNode"] for x in xrange(100)]

Insert the model into PFA

Just as with the decision tree example, the trees are already formatted for PFA.

pfaDocument["cells"]["forest"]["init"] = forest

Test it!

Depending on how your random number generator is seeded, this forest correctly identifies about 95% of the Iris dataset. Compare with the ruleset example (95%) and overtrained decision tree example (100%).

engine, = PFAEngine.fromJson(pfaDocument)

iris = csv.reader(open("iris.csv"))
iris.next()

numer = 0
denom = 0
for sepal_length, sepal_width, petal_length, petal_width, truth in iris:
    result = engine.action({"sepal_length": sepal_length, "sepal_width": sepal_width, "petal_length": petal_length, "petal_width": petal_width})
    print result, "vs", truth.replace("Iris-", "")
    if result != truth.replace("Iris-", ""):
        numer += 1
    denom += 1

print "Fraction mislabeled:", float(numer) / float(denom)

Return to the Hadrian wiki table of contents.

Licensed under the Hadrian Personal Use and Evaluation License (PUEL).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic random forest

Basic random forest

Before you begin...

The basic form

Producing a model

Insert the model into PFA

Test it!

Clone this wiki locally