Skip to content
This repository has been archived by the owner on Apr 14, 2021. It is now read-only.

Creating a dataset generator #79

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions create-dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Generating geographical points
`
Note: is necessary to have python 3.7+ python installed
`

- [Generating categories](#creating-categories-and-its-scores-for-countries)
- [Creating the dataset indeed](#creating-the-dataset-indeed)

### What is `countries-default.json`
This file contains all geolocation data to each country points.
Is important to note:

Every point in this file represents a city with a population of at least 10,000 people. ***This are not a rondomic points file.***

The pattern to this file is one array containing arrays, which the inner array's first position of all this one is the Alpha-2 code for each country, and the second position is another array with one array for each city (represented by latitude and longitude respectively)
Just like this:
```
[
["BR", [
[-12.97111, -38.51083],
[-23.5475, -46.63611],
[-22.90278, -43.2075]
]
],
...
]
```

## Creating categories and its scores for countries
To create a geographical dataset is necessary to pay atenttion in `scores.json` default value.

This file accepts an array of objects wich receives 2 attributes: a category name and a countries array like that:
```
{
"category": "A",
"countries": [
{
"country": "BR",
"score": 0.55
},
{
"country": "CA",
"score": 0.25
},
{
"country": "US",
"score": 0.15
}
]
}
```
***Note: is pretty important to follow this pattern!!!***

## Creating the dataset indeed
After updating the file scores.json with your categories, countries and scores data you have 2 alternatives to generate your dataset.
The first one will generate a large file plotting all countries in your graphic with a default score to improve the graphic' visual.
The second will generate points only the countries listed in your categories, will be a smaller file, but with large blank spaces in your graph.
### First option - With default values on countries that are not listed in your categories
This kind of graphic is normally used to plot distributions, and distributions are represented with percentages, furthermore `for visual purpose` is aceptable to plot a 0.001% score as default value for all non-listed countries.

Keeping it in your mind is and agreeding you can just open your terminal in this folder and type this command:
```
python generate-dots.py
```
And you'll see a new file called `data.json` in this folder. This file is your dataset pre-configured, just copy it, paste it to your graphic folder and remind to call it on `index.html` with this tag:
```
<script type="text/javascript" src="data.json"></script>
```

### Second option - Without default values on countries, only plot the countries listed in your scores.json

Is pretty easy, just open your terminal in this folder and type this command:
```
python generate-dots.py default_dot=false
```
And you'll see a new file called `data.json` in this folder. This file is your dataset pre-configured, just copy it, paste it to your graphic folder and remind to call it on `index.html` with this tag:
```
<script type="text/javascript" src="data.json"></script>
```
1 change: 1 addition & 0 deletions create-dataset/countries-default.json

Large diffs are not rendered by default.

101 changes: 101 additions & 0 deletions create-dataset/generate-dots.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# -*- coding: utf-8 -*-
"""
@author: heltonfabio
"""
import json
import time
import sys
import io


class Unbuffered(object):
def __init__(self, stream):
self.stream = stream

def write(self, data):
self.stream.write(data)
self.stream.flush()

def writelines(self, datas):
self.stream.writelines(datas)
self.stream.flush()

def __getattr__(self, attr):
return getattr(self.stream, attr)


sys.stdout = Unbuffered(sys.stdout)


class generator:
start = time.time()
_default_dot = True
_score = []
_default_dots = []
_generated_dots = []

def getScore(self):
with open('scores.json', 'r') as f:
self._score = json.load(f)

def getDefaultCountriesDots(self):
with open('countries-default.json', 'r') as f:
self._default_dots = json.load(f)

def generateDots(self):
for category in self._score:
dots = []
for default_dot in self._default_dots:
score = self.getCountryScore(
category['category'], default_dot[0])
if self._default_dot:
for dot in default_dot[1]:
dots.extend(dot)
dots.append(score)
else:
if score != 0.00001:
for dot in default_dot[1]:
dots.extend(dot)
dots.append(score)
dots = [float(i) for i in dots]
if len(dots) > 0:
self._generated_dots.append([category['category'], dots])

def getCountryScore(self, category, code2):
for cat in self._score:
if cat['category'] == category:
for obj in cat['countries']:
if obj['country'] == code2:
print('Score from category {} | country {} | score {}'.format(
category, code2, obj['score']))
return obj['score']
return 0.00001

def printData(self, name_file):
with open(name_file, 'w') as f:
f.write('const data = ')
json.dump(self._generated_dots, f)
f.close()

def init(self, val):
self._default_dot = val
print('\n'*40)
print('Starting')
if val == False:
print('Plotting dots, but without default dots')
self.getScore()
self.getDefaultCountriesDots()
self.generateDots()
self.printData('data.json')
print('Dots generated and saved in {} seconds \n\nSaved as:\tdata.json'.format(
(time.time() - self.start)))


gen = generator()
if len(sys.argv) > 1:
if 'default_dot=false' in sys.argv[1]:
gen.init(False)
else:
gen.init(True)
else:
gen.init(True)
33 changes: 33 additions & 0 deletions create-dataset/scores.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
[{
"category": "A",
"countries": [{
"country": "BR",
"score": 0.55
},
{
"country": "CA",
"score": 0.25
},
{
"country": "US",
"score": 0.15
}
]
},
{
"category": "B",
"countries": [{
"country": "JP",
"score": 0.55
},
{
"country": "FR",
"score": 0.25
},
{
"country": "KR",
"score": 0.15
}
]
}
]