My Fourth Year CS Project at the University of Oxford. In this repository I have implemented a framework for testing different compression algorithms' abilities to classify data via the normalised compression distance method. Included in the compression algorithms are novel BERT-based ones that I have developed.
- Install the requirements given in
requirements.txt
. - Open the Python shell, run
import nltk
thennltk.download('punkt')
andnltk.download('reuters')
. - In the root of the
csproject
subdirectory of the root, runpython compressors/setup.py build_ext --inplace
. - Create a directory named whatever you like (e.g.
build
) in the root of this repository, change directory into it, and runcmake ../csproject_pairwise_dists/ && make
. This will create an executable which can be run asbuild/csproject_pairwise_dists $DB
where$DB
is your database filename. It is just a more efficient way of computing the pairwise distances than in Python!
- Pick a script in the
load/
folder to create a new DB and load the data. See below for a discussion. - If the dataset requires unkification,
unkify.py
does this. - Once the data has been loaded, call
pair_up.py
, which creates pairings of the data points to train on (recall that the NCD method operates on pairs of inputs). This makes use of a comma code. Warning: by default, this script creates a new element in the alphabet to denote commas. However, if you are using a pretrained model, this will cause an error at training time. Instead in this case, you must pass the token which denotes the separator. For BERT's NLP tokenisation, this is the[SEP]
token, so you would add--use-comma "[SEP]"
to the command. You must also use--squash-start-end
for NLP BERT-tokenised data. - Once the data has been paired-up, it's time to start compressing it, using
compress.py
. It applies the compressor to all of the data in the DB. Some compressors have to train first, like the BERT compressor. - Once you have invoked the relevant compressors,
compute_ncd.py
will use the compressibilities achieved to compute all of the different NCD similarities we are interested in. - Then, run the executable you should've built earlier using CMake, called
csproject_pairwise_dists
, passing the DB as the only parameter. This fills in a handy table calledPairwiseDistances
which is useful for downstream classification. - Once the similarities are computed,
classify.py
can be used to make classifications (you input the type of classifier you want to use). There are additional classification scripts, e.g.classify_mst.py
. - Then,
compute_accs.py
computes the accuracies of each prediction method.
- There is one database file for each dataset (where dataset is taken to mean labelled, partitioned sequences over some predetermined alphabet). Note that it is permitted to have many labellings per dataset.
- Where possible the scripts are parallelised. There should be no reason to want to run scripts in parallel.
- Scripts are mostly organised by their "axis of parameterisation". Specifically, there are many independent things we would like to compare (the possibility of a dataset having many labels, the compressor itself, the NCD formula, and the classification method given the dissimilarities).
- All Python code is in
csproject/
- These are all executable Python scripts.
- The files in the
csproject/load/
folder are all executable scripts, for loading various datasets. - The
csproject/compressors
folder is a package, containing compression algorithms. - The
csproject/bnb
folder is a package, containing code for producing different token orderings, given a model. - The
csproject_pairwise_dists/
contains a CMake C++ project which is just an optimised version of a particular SQL query. SQLite makes me sad sometimes.
- The
load/gen_random.py
data loader is clearly not associated with any dataset in particular, and just generates arbitrary data for testing. - The
load/jeopardy.py
dataset can be downloaded from the JSON file linked here. - The
load/sent140.py
dataset is available from Kaggle.