Dataset Description

File	Length	Vocabulary Size	Brief Description
Real Data
webster	41.1M	98	HTML data of the 1913 Webster Dictionary, from the Silesia corpus
text8	100M	27	First 100M of English text (only) extracted from enwiki9
enwiki9	500M	206	First 500M of the English Wikipedia dump on 2006
mozilla	51.2M	256	Tarred executables of Mozilla 1.0, from the Silesia corpus
h. chr20	64.4M	5	Chromosome 20 of H. sapiens GRCh38 reference sequence
h. chr1	100M	5	First 100M bases of chromosome 1 of H. Sapiens GRCh38 sequence
c.e. genome	100M	4	C. elegans whole genome sequence
ill-quality	100M	4	100MB of quality scores for PhiX virus reads sequenced with Illumina
np-bases	300M	5	Nanopore sequenced reads (only bases) of a human sample (first 300M symbols)
np-quality	300M	91	Quality scores for nanopore sequenced human sample (first 300M symbols)
num-control	159.5M	256	Control vector output between two minimization steps in weather-satellite data assimilation
obs-spitzer	198.2M	256	Data from the Spitzer Space Telescope showing a slight darkening
msg-bt	266.4M	256	NPB computational fluid dynamics pseudo-application bt
audio	264.6M	256	First 600 files (combined) in ESC Dataset for environmental sound classification
Synethetic Data
XOR-k	10M	2	Pseudorandom sequence $S_{n+1} = S_n \bigoplus S_{n-k}$ Entropy rate 0 bpc.
HMM-k	10M	2	Hidden Markov sequence $S_n = X_n \bigoplus Z_n$ , with $Z_n \sim Bern(0.1)$ , $X_{n+1} = X_n \bigoplus X_{n-k}$ Entropy rate 0.46899 bpc.

Links to the Datasets and Trained Boostrap Models

File	Link	Bootstrap Model
webster	http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia	webster
mozilla	http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia	mozilla
h. chr20	ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr20.fa.gz	chr20
h. chr1	ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz	chr1
c.e. genome	ftp://ftp.ensembl.org/pub/release-97/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz	celegchr
ill-quality	http://bix.ucsd.edu/projects/singlecell/nbt_data.html	phixq
text8	http://www.mattmahoney.net/dc/textdata.html	text8
enwiki9	http://www.mattmahoney.net/dc/textdata.html	enwiki9
np-bases	https://github.com/nanopore-wgs-consortium/NA12878	npbases
np-quality	https://github.com/nanopore-wgs-consortium/NA12878	npquals
num-control	https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPdouble/	model
obs-spitzer	https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPdouble/	model
msg-bt	https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPdouble/	model
audio	https://github.com/karolpiczak/ESC-50	model

Synethetic Dataset Generation Example

Go to Datasets
For real datasets, run

bash get_data.sh

For synthetic datasets, run

# For generating XOR-10 dataset
python generate_data.py --data_type 0entropy --markovity 10 --file_name files_to_be_compressed/xor10.txt
# For generating HMM-10 dataset
python generate_data.py --data_type HMM --markovity 10 --file_name files_to_be_compressed/hmm10.txt

This will generate a folder named files_to_be_compressed. This folder contains the parsed files which can be used to recreate the results in our paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets.md

Datasets.md

Dataset Description

Links to the Datasets and Trained Boostrap Models

Synethetic Dataset Generation Example

Files

Datasets.md

Latest commit

History

Datasets.md

File metadata and controls

Dataset Description

Links to the Datasets and Trained Boostrap Models

Synethetic Dataset Generation Example