Skip to content

Configuration

Gabor Szarnyas edited this page May 30, 2021 · 1 revision

Setup Hadoop

The LDBC data generator uses Apache Hadoop 3.2.1.

To install Hadoop, untar hadoop-3.2.1.tar.gz to your home folder ~ (we will use the home for this example, but you can choose the folder that best fits your needs):

cd ~
wget http://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz

This will create a directory named hadoop-3.2.1 in your home folder. Hadoop can be configured to run in three different modes: Standalone, Pseudo-Distributed and Distributed modes. By default, Hadoop is configured to run in Standalone mode, which can only have, at most, one reducer per job. This works well for generating small data sets in a local environment. To configure and start Hadoop in Pseudo-Distributed and Distributed modes, please visit the Single Node Cluster and the Cluster Setup pages, respectively.

  1. Fine-tune the logger. We found that setting log levels in mapred-conf.xml does not yield the expected result but there are two ways that work.

    • Use an environment variable:
    # reduce clutter in the Hadoop output
    export HADOOP_LOGLEVEL=WARN
    • Overwrite the $HADOOP_HOME/etc/hadoop/log4j.properties file with the src/main/resources/log4j.properties file.
  2. Set the number of threads. For example, to get 8 threads, run:

    echo "ldbc.snb.datagen.generator.numThreads:8" >> params.ini

Configuring the run.sh script

The main configuration is through the file params.ini in the ldbc_snb_datagen directory. You can set multiple options as listed in Advanced Configuration.

We provide a run.sh script to ease the execution of Hadoop. The following variables are used to configure the script:

  • HADOOP_HOME: points to where Hadoop was installed. Following our example, this folder is ~/hadoop-3.2.1.
  • LDBC_SNB_DATAGEN_HOME: points to the LDBC data generator folder.
  • PARAM_GENERATION [deprecated]: whether the parameters for SNB queries are generated. You should only use it with standard scaleFactor (e.g., SF 1). Always disable PARAM_GENERATION when using the data generator for non-standard input parameters (e.g., when you set numYears instead of using scaleFactor).

Example configuration (you might want to save these in the .bashrc file):

export HADOOP_HOME=~/hadoop-3.2.1
# optional configurations
export LDBC_SNB_DATAGEN_HOME=`pwd` # set to the ldbc_snb_datagen repo's location
export HADOOP_CLIENT_OPTS="-Xmx2G" # increase for sizes above SF1

Finally, open ~/hadoop-3.2.1/etc/hadoop/hadoop-env.sh and set JAVA_HOME to point to your JDK folder.

To make sure the Hadoop job does not run out of memory, increase the heap size (-Xmx) to a sufficient value (see the Troubleshooting page for details).

Parameter generation

If you'd like to skip parameter generation, add the following line in the Datagen configuration (params.ini):

ldbc.snb.datagen.parametergenerator.parameters:false

Generator parameters

The LDBC data generator is configured by means of the params.ini file, which is found at the LDBC data generator root folder. Set the parameters properly to meet your needs. There are two ways to configure the size of the desired data output: by setting the scale factor or by setting the number of persons, starting year and the number of years the data generated span. The params.ini file contains the following options:

Besides these parameters, Datagen supports predefined configurations (numPersons, startYear, numYears, degreeDistribution, etc.), named scale factors, which serve to generate data at different scales for specific benchmarks such as the LDBC SNB or Graphalytics. The semantics of scale factors depend on the benchmark they belong to. Currently, the following scale factors are defined:

  • snb.interactive.0.1
  • snb.interactive.0.3
  • snb.interactive.1
  • snb.interactive.3
  • snb.interactive.10
  • snb.interactive.30
  • snb.interactive.100
  • snb.interactive.300
  • snb.interactive.1000
  • graphalytics.1
  • graphalytics.3
  • graphalytics.10
  • graphalytics.30
  • graphalytics.100
  • graphalytics.300
  • graphalytics.1000
  • graphalytics.3000
  • graphalytics.10000
  • graphalytics.30000

These scale factors are set by means of option ldbc.snb.datagen.generator.scaleFactor. Scale factors are loaded at the beginning of params.ini parsing. Comment with "#" other options affecting the amount of data generated not to conflict with them. If both the scale factor and the number of persons, start year or number of years are set, the latter will have a higher priority.

An example of a configuration file (by number of persons, start year and number of years):

ldbc.snb.datagen.generator.numPersons:100000
ldbc.snb.datagen.generator.numYears:3
ldbc.snb.datagen.generator.startYear:2010

ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer
ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVInvariantSerializer
ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonActivitySerializer

ldbc.snb.datagen.generator.numThreads:1

SNB workloads

For the LDBC SNB Interactive and BI workloads, Datagen uses the same configuration parameters and classes (snb.interactive.* for scale factors and ldbc.snb.datagen.serializer.snb.interactive.* for serializers).