Skip to content

Practical Examples Using Cpp

masajiro edited this page May 25, 2020 · 2 revisions

Practical examples with a large-scale dataset for two types of primary NGT graphs (ANNG, ONNG) are described.

Dataset generation

First, to describe how to search large-scale datasets, an NGT dataset needs to be generated. Download the fastText dataset as follows.

curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
unzip wiki-news-300d-1M-subword.vec.zip

The dataset above should be converted to a format that our sample scripts can read by using the following python script.

# dataset.py 
with open('wiki-news-300d-1M-subword.vec', 'r') as fi,\
     open('objects.tsv', 'w') as fov, open('words.tsv', 'w') as fow:
    n, dim = map(int, fi.readline().split())
    fov.write('{0}\t{1}\n'.format(n, dim))
    for line in fi:
        tokens = line.rstrip().split(' ')
        fow.write(tokens[0] + '\n')
        fov.write('{0}\n'.format('\t'.join(tokens[1:])))

ANNG Construction and Search

Below is an example of how to construct an ANNG with cosine similarity for metric space.

// construct-fasttext.cpp                                                                                                                    
#include        "NGT/Index.h"
using namespace std;
int main(int argc, char **argv)
{
  ifstream      is("objects.tsv");
  string        line;
  getline(is, line);
  stringstream  linestream(line);
  NGT::Property property;
  size_t n;
  linestream >> n;
  linestream >> property.dimension;
  property.objectType   = NGT::ObjectSpace::ObjectType::Float;
  property.distanceType = NGT::Index::Property::DistanceType::DistanceTypeCosine;
  string path("fasttext.anng");
  NGT::Index::create(path, property);
  NGT::Index    index(path);
  index.disableLog();
  while (getline(is, line)) {
    stringstream        linestream(line);
    vector<float>       object;
    while (!linestream.eof()) {
      float value;
      linestream >> value;
      object.push_back(value);
    }
    index.append(object);
  }
  index.createIndex(16);        // 16 is the number of threads to construct.                                                                 
  index.save();
  return 0;
}

The ANNG can be searched with a query by using the following program.

// search-fasttext.cpp                                                                                                                       
#include        "NGT/Index.h"
#include        <sstream>
#include        <iomanip>
using namespace std;
int main(int argc, char **argv)
{
  ifstream              is("words.tsv");
  string                word;
  vector<string>        words;
  while (getline(is, word)) {
    words.push_back(word);
  }
  NGT::Index            index("fasttext.anng", true);
  int                   queryID = 10001;
  vector<float>         query;
  index.getObjectSpace().getObject(queryID, query);
  NGT::SearchQuery      searchQuery(query);
  NGT::ObjectDistances  results;
  searchQuery.setResults(&results);
  searchQuery.setSize(10);
  searchQuery.setEpsilon(0.1);
  index.search(searchQuery);
  cout << "Query No." << queryID << endl;
  cout << "Rank\tID\tDistance" << setprecision(6) << fixed << endl;
  for (size_t i = 0; i < results.size(); i++) {
    cout << i + 1 << "\t" << results[i].id << "\t" << results[i].distance << "\t" << words[results[i].id - 1] << endl;
  }
  cout << endl;
  return 0;
}

Below are the search results.

Query No.10001
Rank	ID	Distance
1	10001	0.000000	Doctors
2	4632	0.244096	Doctor
3	79543	0.258944	Medics
4	2045	0.263412	doctors
5	339398	0.274972	Doctoring
6	20668	0.280508	Physicians
7	80647	0.292580	Dentists
8	24256	0.292752	Nurses
9	9481	0.322196	Scientists
10	623161	0.330500	Practioners

When a higher accuracy is needed, you can specify an epsilon value of SearchQuery higher than the default 0.1 as shown below.

  searchQuery.setEpsilon(0.15);

When a short query time is needed at the expense of accuracy, you can specify a smaller epsilon value.