Figure out where to get 'fake data' #15

karlicoss · 2020-04-12T10:40:55Z

It would be nice to have a public repository of raw data from different services, so it would be easy to test HPI and demonstrate without having to give up your own data. Does such a thing exist?

P.S. maybe this issue rather belongs here, and I'll tranfer it.

felubra · 2020-05-09T14:50:35Z

Not sure about the existence of this raw data but I think you could use faker to generate all the data you need in a predictable way (seeding).

karlicoss · 2020-05-09T14:53:07Z

@felubra very nice, thanks! Ideally would be good to get hands on real data (I might just make some of mine public), but that's super helpful too.

…ice here karlicoss/HPI#15 (comment) )

karlicoss · 2020-09-19T00:16:24Z

Briefly tried faker (I think Hypothesis testing framework is also using it). Had some issues with lots of duplicate data (similar to what's reported here, but haven't investigated yet.

And also there is mimesis, which claims to be faster.

I guess generally the problem is that random data doesn't quite work for the demos, because real data has some sort of 'narrative', and causal structure. But anyway it's certainly useful to generate lots of it, and then filter out the datapoints so that it starts making some causal sense.

karlicoss · 2020-09-19T00:16:52Z

In terms of organizing the code, etc: it seems that the data generations would belong well to the data access layers.

The idea is that the code that parses raw data and the code that generates fake raw data are close, so they don't go out of sync (also that allows to have CI for data parsing for free, just run it against the fake data).

Then, the corresponding HPI module uses the DAL to generate fake data and set it as inputs:

HPI/my/rescuetime.py

Lines 78 to 84 in 28fcc1d

    
           with disabled_cachew(), override_config(config) as cfg, TemporaryDirectory() as td: 
        
               tdir = Path(td) 
        
               cfg.export_path = tdir 
        
               f = tdir / 'rescuetime.json' 
        
               import json 
        
               f.write_text(json.dumps(dal.fake_data_generator(rows=rows))) 
        
               yield

It works as a decorator, e.g.

with my.rescuetime.fake_data():
     # rescuetime module will run against fake data now

, here's an example: https://github.com/karlicoss/dashboard/blob/623555e09647cce20bcc60f8ba6e9f5e932d32a2/src/dashboard/tabs.py#L103-L116

And the end result: Rescuetime data heatmap generated against the completely fake data, with everything running on CI! https://karlicoss.github.io/dashboard/rescuetime.html

The snippets are a bit awkward at the moment, but I'll fix a couple of minor caveats, and I feel like this could work really well!

karlicoss · 2022-05-31T13:26:30Z

Some test data I uploaded myself

karlicoss added the help wanted Extra attention is needed label May 8, 2020

karlicoss added a commit to karlicoss/dashboard that referenced this issue Sep 9, 2020

add weight tests & some fake data generation via faker (following adv…

ff1274f

…ice here karlicoss/HPI#15 (comment) )

karlicoss pinned this issue Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out where to get 'fake data' #15

Figure out where to get 'fake data' #15

karlicoss commented Apr 12, 2020

felubra commented May 9, 2020

karlicoss commented May 9, 2020

karlicoss commented Sep 19, 2020

karlicoss commented Sep 19, 2020

karlicoss commented May 31, 2022

Figure out where to get 'fake data' #15

Figure out where to get 'fake data' #15

Comments

karlicoss commented Apr 12, 2020

felubra commented May 9, 2020

karlicoss commented May 9, 2020

karlicoss commented Sep 19, 2020

karlicoss commented Sep 19, 2020

karlicoss commented May 31, 2022