Brisbane bike stations clustering

Clustering bike stations coordinates with Gaussian mixtures.

Since the data dimension is small, I thought this would be a better choice than the regular K-mean approach.

Prerequisites

Obviously, have JVM, Scala, Spark and Maven installed. Eventually eclipse for building or browsing code.

Installing

Run 'mvn install' inside downloaded repo so as to make a uber jar.

Running the tests

To process given json data sample you can run this command (cf runSparkJob.sh):

spark-submit --class cluster.Brisbane --master local[*] --name "Brisbane bike station clustering" target/cluster-0.0.1-SNAPSHOT-uber.jar Brisbane_CityBike.json 4 clustered_ids

Which means: run a spark job on local cluster ( --master local[*] ) so as to find 4 cluster indices for bike stations, and write results in clustered_ids folder.

So change --master option if you want to use YARN or Mesos cluster for example ..

The results will be displayed line by line this way:
"$station_id \t $cluster_index"

Built With

Maven - Dependency Management
[Scala] 2.11.11
[Spark] 2.2.0
Java SE 1.8

Backlog

The both coordinates type stored in json have been handled but i did not take care of partial data such as station 7:\

{
"id": 7,
"name": "7 - MARGARET STREET / EDWARD STREET",
"address": "Margaret St / Edward St",
"latitude": -27.47148,
"longitude": "not relevant"
}
and
{
"id": 7,
"name": "7 - MARGARET STREET / EDWARD STREET",
"address": "Margaret St / Edward St",
"latitude": "not relevant",
"longitude": 153.029647
}

Of course, I could have merged those kind of data so as to have valid station, but i would have had to group by ids, and the question is; in real case scenario, what shoud I do if we have more than two partial data ? Take the first valid latitude and longitude? Average all the valid data ?

So I decide to eliminate those datas ...

In production, results shoud be stored according to the date and in different folder
Use Plotly's Scala graphing library so as to pretty print clusters ...

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src/main/scala/cluster		src/main/scala/cluster
.gitignore		.gitignore
Brisbane_CityBike.json		Brisbane_CityBike.json
README.md		README.md
pom.xml		pom.xml
runSparkJob.sh		runSparkJob.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brisbane bike stations clustering

Prerequisites

Installing

Running the tests

Built With

Backlog

About

Releases

Packages

Languages

zubicaray/BrisbaneBikeStationClustering

Folders and files

Latest commit

History

Repository files navigation

Brisbane bike stations clustering

Prerequisites

Installing

Running the tests

Built With

Backlog

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages