spark-vlbfgs

This package is an implementation of the Vector-free L-BFGS solver and some scalable machine learning algorithms for Apache Spark.

Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of which use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. Vector-free L-BFGS avoids the expensive dot product operations in the two loop recursion and greatly improves computation efficiency with a great degree of parallelism. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets.

Supported algorithms

spark-vlbfgs currently supports the following algorithms:

Logistic Regression
Linear Regression
Softmax Regression

with regularization:

L1
L2
Elastic Net

To be supported:

Multilayer Perceptron Classifier

Build and run spark-vlbfgs

spark-vlbfgs is built using Apache Maven. To build spark-vlbfgs and its example programs, run:

mvn clean package -DskipTests

by default this project will be built against spark-2.0.0 with scala-2.11, if you want to specify other version, use maven -D parameter such as:

mvn clean package -Dscala.binary.version=2.10 -Dspark.version=2.0.0

then run example:

spark-submit
   --master yarn
   --num-executors 10
   --executor-cores 2
   --class org.apache.spark.ml.example.VLORExample
   /path/to/spark-vlbfgs-0.1-SNAPSHOT.jar [paramlist]

Example

You can train a logistic regression model via spark-vlbfgs API which is consistent with Apache Spark MLlib:

val dataset: Dataset[_] = spark.read.format("libsvm").load("data/a9a")
val trainer = new VLogisticRegression()
  .setColsPerBlock(100)
  .setRowsPerBlock(10)
  .setColPartitions(3)
  .setRowPartitions(3)
  .setRegParam(0.5)
val model = trainer.fit(dataset)

println(s"Vector-free logistic regression coefficients: ${model.coefficients}")

Talks

Reference

Contact & Acknowledgements

If you have any questions or encounter bugs, feel free to submit an issue or contact:

We are immensely grateful to Xiangrui Meng for the initial work and guidance during the design and development of spark-vlbfgs.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-vlbfgs

Supported algorithms

Build and run spark-vlbfgs

Example

Talks

Reference

Contact & Acknowledgements

About

Releases

Packages

Contributors 3

Languages

License

yanboliang/spark-vlbfgs

Folders and files

Latest commit

History

Repository files navigation

spark-vlbfgs

Supported algorithms

Build and run spark-vlbfgs

Example

Talks

Reference

Contact & Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages