Bernoulli Mixtures

Cluster and Visualize Multi-label Data

Bernoulli mixture model is a classic tool for multidimensional binary variable density estimation. Here we apply Bernoulli mixture to cluster and visualize multi-label data. In multi-label data, each instance has multiple labels or tags. For example, a Youtube video about Minecraft may be tagged with Minecraft, Game, Video game, and PC game. The set of relevant labels for an instance can be written as a multidimensional binary vector with the total size equal to the number of possible labels and each bit indicating the presence (1) or absence (0) of a particular label. Bernoulli mixture partitions (probabilistically) all possible labels into several clusters. Labels in the same cluster are related and tend to co-occur. Bernoulli mixture clustering provides a convenient way for us to understand the topics behind these labels.

Sample clusters

Below are a few clusters produced by applying Bernoulli mixture to the youtube dataset, which contains a total number of 4716 labels. The font size indicts the relative frequency of each label in the cluster. gadget

comics

game

vehicle

Producing the clusters

To run the Bernoulli mixture clusting algorithm, please just type

./pyramid config/cluster_labels.properties

where the cluster_labels.properties file (can be found in the config folder of the package) specifies all the algorithm parameters, as explained below.

Program Properties

The properties file (a plain text file with each line being a key value pair) specifies all input, output and hyper parameters required by the program. A sample properties file is shown below. The same file can also be found in the config folder associated with the code release. You can modify this file to set up the the correct dataset paths on your computer and experiment with different model parameters.

# input label file; each line lists all relevant labels for an instance
# labels are encoded as 0, 1, 2, ..., L-1, where L is the total number of labels
input.labels=/Users/chengli/tmp/youtube/labels.csv

# a file that specifies all label names; one name per line; should have L lines
input.labelNames=/Users/chengli/tmp/youtube/label_names.csv

# directory for program output
output.dir=/Users/chengli/tmp/youtube/out

# number of Bernoulli mixture clusters
numClusters=100

# number of EM training iterations
numIterations=50

# the internal Java class name for this application. 
# users do not need to modify this.
pyramid.class=ClusterLabels

Sample Data

Youtube labels

Youtube label names

These two files are taken from the public Youtube 8M dataset. Some prepossessing is done to transform the format. The first 1M instances from the training set are used for clustering.

All 100 clusters produced by Bernoulli Mixtures

Beyond Multi-label Clustering -- Conditional Bernoulli Mixtures for Multi-label Classification

Bernoulli mixture model performs clustering by building a mixture density estimation p(Y) where Y is a multi-label vector. The clustering is done purely on the label space without looking at the instance features (e.g. image features). Sometimes one is also interested in classification and wish to build a conditional density estimation p(Y|x) based on the training data, where x is a feature vector. Having such as model, one can then predict the set of relevant labels for a new instance x by picking the Y with the highest probability in p(Y|x). The conditional extension of Bernoulli mixture, named Conditional Bernoulli Mixtures (CBM), can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly