Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out where to get 'fake data' #15

Open
karlicoss opened this issue Apr 12, 2020 · 5 comments
Open

Figure out where to get 'fake data' #15

karlicoss opened this issue Apr 12, 2020 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@karlicoss
Copy link
Owner

It would be nice to have a public repository of raw data from different services, so it would be easy to test HPI and demonstrate without having to give up your own data. Does such a thing exist?

P.S. maybe this issue rather belongs here, and I'll tranfer it.

@karlicoss karlicoss added the help wanted Extra attention is needed label May 8, 2020
@felubra
Copy link

felubra commented May 9, 2020

Not sure about the existence of this raw data but I think you could use faker to generate all the data you need in a predictable way (seeding).

@karlicoss
Copy link
Owner Author

@felubra very nice, thanks! Ideally would be good to get hands on real data (I might just make some of mine public), but that's super helpful too.

karlicoss added a commit to karlicoss/dashboard that referenced this issue Sep 9, 2020
@karlicoss
Copy link
Owner Author

Briefly tried faker (I think Hypothesis testing framework is also using it). Had some issues with lots of duplicate data (similar to what's reported here, but haven't investigated yet.

And also there is mimesis, which claims to be faster.

I guess generally the problem is that random data doesn't quite work for the demos, because real data has some sort of 'narrative', and causal structure. But anyway it's certainly useful to generate lots of it, and then filter out the datapoints so that it starts making some causal sense.

@karlicoss
Copy link
Owner Author

In terms of organizing the code, etc: it seems that the data generations would belong well to the data access layers.

The idea is that the code that parses raw data and the code that generates fake raw data are close, so they don't go out of sync (also that allows to have CI for data parsing for free, just run it against the fake data).

Then, the corresponding HPI module uses the DAL to generate fake data and set it as inputs:

HPI/my/rescuetime.py

Lines 78 to 84 in 28fcc1d

with disabled_cachew(), override_config(config) as cfg, TemporaryDirectory() as td:
tdir = Path(td)
cfg.export_path = tdir
f = tdir / 'rescuetime.json'
import json
f.write_text(json.dumps(dal.fake_data_generator(rows=rows)))
yield

It works as a decorator, e.g.

with my.rescuetime.fake_data():
     # rescuetime module will run against fake data now

, here's an example: https://github.com/karlicoss/dashboard/blob/623555e09647cce20bcc60f8ba6e9f5e932d32a2/src/dashboard/tabs.py#L103-L116

And the end result: Rescuetime data heatmap generated against the completely fake data, with everything running on CI! https://karlicoss.github.io/dashboard/rescuetime.html

The snippets are a bit awkward at the moment, but I'll fix a couple of minor caveats, and I feel like this could work really well!

@karlicoss karlicoss pinned this issue Nov 4, 2020
@karlicoss
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants