Infochimps merge #129

Marsup · 2013-07-25T09:13:03Z

As discussed in #123, here is a full merge from infochimps-labs/cube@master to the current master.

Considering the size of the merge, I hope to involve more people into the review so that we don't miss any critical part, even though the tests continue to pass (minus the one already failing).

I hope especially people from @infochimps-labs can give their feedback since they know their codebase much more than I do.

metalog: * metalog.info('foo', {...}) for progress recording -- sent to log by default * metalog.minor('foo', {...}) for verbose recording -- sent nowhere by default * metalog.event('foo', {...}) cubifies the event and sends it to info - metalog.event('foo', {...}, 'silent') to cubify but not log * retargetable: - metalog.loggers.minor = metalog.log to log minor events - metalog.loggers.info = metalog.silent to quash logging also separated db test helpers into own file and added fixture ability.

* authentication.authenticator(name) gives you the requested authenticator - server.js uses options['authenticator'] - 'allow_all' is default in bin/collector-config.js etc. * authenticator.check(request, auth_ok, auth_no) - calls auth_ok if authenticated, auth_no if rejected - staples an 'authorized' member to the request object: - eg we use 'request.authorized.admin' to govern board editing - in the mongo_cookie authenticator, it's the user record * mongo_cookie authenticator compares bcrypted cookie to a stored hashed secret - you must set the cookie and store that db record in your host client; see 'test/authenticator-test.js' for format. Rails+devise snippet available on request.

* pr2_authentication: Added authentication: allow_all, read_only, or mongo_cookie

* marsup/mongodb-native-parser: Upgrade mongodb driver and use native BSON parser Conflicts: package.json

moved test.js to test_helper.js, renamed it in use.

Some mixture of these changes and the upgraded mongo npm package seem to have made the dancing sometimes-failures in test/metrics-test.js go away. Untangled and commented the code, so if you are still seeing those bugs here maybe is a better stepping-off point.

…ting server. * test_helper.with_server -- starts the server, run tests once server starts, stops server when tests are done * merged test/test_db.js into test/test_helper.js, documentated it good.

…s and metrics

…inct, all tiers cascade to tensec, horizons)

Marsup · 2013-09-20T14:56:31Z

For the record I still experience partially unresponsive evaluator if I ask for data past horizon, I think the computation never gets done because start and end date are equal, so any further request on the same collection will stall.

Another thing I noticed is pretty poor performance on grouped queries, a mongodb distinct on several million un-indexed rows is supposed to take a while but still...

RandomEtc · 2013-09-20T18:47:02Z

I hear that. I'm running this with personal data but I haven't thrown
square data at it yet. How do you feel about merging into a 0.3 branch and
refining things there?

Marsup · 2013-09-20T21:53:07Z

Have you given some thoughts on my new configuration proposal ?
Do you want to complete your check-list up there before moving to 0.3.x or will it be done along the way ?
I think it's not a perfect release and there are still known bugs, but at least we'd move on with tinier patches rather than this long running commit list, so why not, but maybe not publish to npm just yet.

Marsup · 2013-10-24T08:48:32Z

@RandomEtc After running the service for a while, I was forced to remove a "feature" infochimps added. Cascading cache to the lowest tiers is a very bad idea. It sucks the MongoDB storage like hell, way too expensive for insignificant benefit considering a few seconds/minutes is not that long to re-compute, so I came back to the way things were in the current cube release. Might want to consider this before doing a release...

hustonhoburg · 2013-10-24T19:09:03Z

I understand your feedback and that's definitely a valid concern, but want to clarify a little. First, as far as speed, a response time of seconds to minutes per query, given 20 to 30 queries on a page, wasn't acceptable for our UI requirements. So, stored metrics did offer some speed improvements. Although it was an added bonus, our intent was not to cache calculations for speed, but to store data. Hopefully I can offer some insights on why we did it that way.

For our use, event data vastly outsized metric data, so we purposefully capped the events collection to make event records fall out. To preserve the data contained in those events, we saved the metrics at the lowest tier. With the lowest tier, we could build back up a higher tier metric, like 5 minute or 1 hour, using those low tier 10 second metrics. We stored our permanent data in metrics with ephemeral events, as opposed to the previous situation of permanent events with ephemeral metric caches.

So assuming that one has large event record data sizes with many events per 10 second tier, storing only the metrics should use much less data. In our use case, it meant we were able to roll up thousands of multiple kB events into a handful of small, sub kB sized metrics per query. We also had a separate "cleaner" cron job to remove metrics older than a day, or so, to keep the data size down. We wrote our version with the intent of optimizing for high throughput while keeping a small, unsharded mongo. Storing metrics actually ended up being significantly more storage efficient for us.

I can see how for other data shapes / use cases, storing all metrics may not make sense. We definitely cut a couple corners to fulfill our use case because not everyone wants to predefine queries, drop events, and handle dense data. We lost some of the flexibility offered by cube in order to meet our needs. It sounds like our version didn't fit your use case. I'm glad you were able to change it to better meet your needs.

I hope that cleared up our intentions. If not, I'm happy to clarify further.

Marsup · 2013-10-24T21:20:24Z

Hello Houston !

First, reading it back, my previous comment seems more accusing that it was meant to be, so sorry for that.

Now I can see why you would do that, in my case I keep everything, no capped collection at all, and I also have many events for any given time, our situation should be similar, so you might understand my pain seeing metrics grow horribly fast :)

The difference might be I'm on a sharded mongo, with many evaluators to answer queries at the same time, and we mostly stream metrics (which is a difference of our fork) so full time ranges are not queried that often.

Your version definitely improved many things and I'm grateful for that, I'm just worried such a default setting would disappoint newcomers as it fills up several GB for only a few days/weeks of metrics. I think ideally this cascading aspect should be configurable, but that'll be the subject of another pull request ;)

Anyway it's nice to see you're still following things here !

RandomEtc · 2013-10-24T21:37:06Z

Thanks both for keeping the discussion going. I'm sorry I've been silent on most matters. I haven't had a chance to try this new branch on real data so I'm hesitant to express too strong of an opinion about it.

I'm open to landing it as a 0.3-pre branch here so there's a clearer target for new contributions/optimizations/docs. What do you think?

Marsup · 2013-10-24T21:59:47Z

Well, this thread has lasted long enough I think :)

You have raised many concerns along the way so I would say do that branch and close this pull request, but let's not forget anything here, and maybe create a bunch of separate issues to track every doubt/task that needs to be dealt with before final release.

RandomEtc · 2013-10-24T22:01:08Z

Sounds like a plan. Thanks for your help triaging issues!

simonlopez · 2013-12-03T18:59:23Z

when is it planned to be merged?

jeffhuys · 2014-01-14T22:21:23Z

This merge is (in my opinion) very important, why hasn't it been merged yet?

RandomEtc · 2014-01-23T20:47:58Z

Hi @jeffhuys - this branch is stalled mainly because I ran out of time/bandwidth for the project. But also because we only have one person (@Marsup) who has run this code to date. I'd like to run it before I merge it but I haven't had time.

If you've followed our discussion above, and the related issue where I asked for community input into the merge, you can see I was very optimistic about bringing all the Infochimps changes into the Square cube repo, but the actual process of doing this was a lot more complex and time consuming than I imagined. We don't run Cube in an official/production capacity at Square any more, so it's more or less a volunteer side-project for me.

Major apologies to @Marsup for not using this integration work yet.

Next step remains setting up a new branch here for this work, and getting a few more people to try it out for their use-case. I'll try to get that done soon, including trying it out at Square. Until then, please comment on this thread if you've run @Marsup's version and let us know how it goes.

consense · 2014-01-23T21:51:27Z

Hi,
running https://github.com/Marsup/cube/tree/full-merge in a testing/dev environment for the last weeks and so far no problems. Amount of data is relatively low though - ~100,000 events.

Just as a sidenote the revert to plain js object for the config in that branch made my live a lot easier.
Thanks everyone for the effort in this.

Marsup · 2014-01-23T22:46:27Z

I'm confident as well, I had this branch running in production since September (with a few modifications since then as the commit history will tell you), but beware it doesn't only contain infochimps modifications, so not everything is documented as it should be.

I'm also glad I came to reason for the config, imposing cfg in cube's core was not very clever, even though it's a very nice module and I still use it for my cube runners.

aganov · 2014-02-27T12:32:43Z

I'm going to test this branch with 150K events/day @Marsup can you tell me what is the job of the "warmer" and is there any way to use only one config, instead of three, which are almost the same?

Marsup · 2014-02-27T15:00:49Z

I'm not the one who conceived it so I'll try to describe it as best as I know it, I'm not using it either.
If you store your expressions in a specific collection, it will regularly take them and keep your metrics cache up-to-date even without a client asking for it.
The way to store expressions is inherited from the time cube had a kind of dashboard, there is no documentation about it AFAIK.

As for configuration, nope afraid not, you can still use cube as a module and do the slight variations programmatically.

ticean · 2014-04-01T19:31:45Z

Hello. I've been using the current Cube version for some time. Great work and thanks for open sourcing this project.

Question 1: What's the level of confidence and timeline that the InfoChimps branch will be merged? I need to add some additional features in our project (authentication). Since this contains a pluggable authentication system, it makes sense for me to go ahead and use what's here if it will be mainlined. Looks like there's been a lot of energy put into this branch, but it's long-running and hard for me to judge what's going on from the outside looking in. :)

Question 2: How can I override the configuration when using Cube as a library now? Looks like this line will always include the configuration file from Cube. Maybe I'm missing some cfg functionality that handles this? I'm trying to override with env vars according to cfgs readme, but they don't seem to register. A simple example would really help.

Thanks.

RandomEtc · 2014-04-01T22:46:21Z

@ticean my intention was to make a merge branch and update our readme to encourage people to try it out. Unfortunately Cube has become less and less of my day-job here at Square and since starting this merge, despite the heroic efforts of @Marsup (thank you!) I haven't carved out the time to make much progress.

Also since we started this project infochimps was acquired, so I suspect they haven't been able to give it the attention they wanted either.

I still have this on a TODO list, and hope to get to it one day soon, though I realize we are very likely to be losing goodwill and attention by letting this branch linger.

Enough excuses...

For using cube as a library, here's an example collector script that we have in our internal cube repo:

#!/usr/bin/env node

var options = require("../config/collector"),
    cube = require("cube"),
    server = cube.server(options);

server.register = function(db, endpoints) {
  cube.collector.register(db, endpoints);
};

server.start();

The require for cube is the stock one from npm. The ./config/collector.js file looks something like:

module.exports = {
  "mongo-host": "127.0.0.1",
  "mongo-port": 27017,
  "mongo-database": "cube_development",
  "http-port": 1080
};

Hope that gets you started. If you have a chance to checkout @Marsup's branch please do, any feedback on that will help others work out which version to use. Until then, now we have 3 versions...

https://xkcd.com/927/

ticean · 2014-04-02T18:43:56Z

Hi @RandomEtc. Thanks for the help and the quick reply! You helped me realize that I was testing with the wrong branch. This PR is based on Marsup:infochimps-merge but I'd mistakenly branched square:infochimps-merge for testing. So my bad there. 😁

Now using @Marsup's branch, I'm able to override the config like I need to (that wasn't possible in square:infochimps-merge)

I had some problems with the horizon feature not returning results when the request is "past_horizon". I see a metalogging output, but the server doesn't return a response and hangs. Don't think I'm interested in this feature anyway, so I was able to work around by removing the horizon configuration. This disables the feature. Some docs about horizons would be helpful, but you've already mentioned this a few times in the thread so I know you know that. :)

Things are otherwise working well now that I'm using this branch. I'll keep testing and let you know if anything else comes up.

ticean · 2014-04-03T01:09:01Z

Ok, after more hands-on time with this code I found some issues.

Collections aren't created automatically now. This comment informs me that I have to manually create them. Cube's flexibility to create collections as it gets events is a really good feature. Really sad to see this feature's getting dropped.
More evaluator hangs. I've hit cases where metalogging logs an error and subsequent requests aren't handled. I think it's because callbacks aren't called on error? Like here. It would be better for the process to crash than hang.

Marsup · 2014-04-04T09:26:20Z

I have had similar issues with horizons during my tests and haven't found a way to make it work, but I don't use this feature and so it's highly unlikely I'll spend time on it, though I encourage you to give it a try.
I never ever had to create collections manually, you shouldn't have to either, I don't understand the meaning of this comment since there are no schema files anywhere in the project.
You seem to have troubles with your mongodb, never encountered this error/hang, this doesn't mean the code is right but you should check your mongo.

Beware that full-merge is not the exact same thing as this pull request, I've piled up other modifications for my own needs.

zuk · 2014-07-07T15:53:32Z

Hey guys, can someone explain the status of this merge for those of us wanting to start using cube with all of the infochimps work merged in? What would be the best place to start? Clone this branch and go from there? Use the Marsup fork? GitHub is kind of useless right now in situations where the repo network gets complicated like this one :(

RandomEtc · 2014-07-30T21:32:26Z

I am declaring Cube-maintenance-failure for myself. I have updated the README here to indicate that nobody at Square is actively developing or maintaining Cube. Since I have failed to make progress on this branch I encourage people to help @Marsup with his integration branch and fork if you have any new features or bug fixes. I will be closing all issues here in a moment.

zuk · 2014-07-31T16:34:58Z

For the record, I'm up and running with the @Marsup branch. Working great so far.

RandomEtc · 2014-07-31T17:52:54Z

Excellent, that's great news @zuk. I'm open to updating the status and the Cube homepage with more info in future if you, @Marsup and others want to publish a new version. Thanks for letting us know!

Marsup and others added 30 commits August 10, 2012 10:14

Upgrade mongodb driver and use native BSON parser

af74865

applied metalog to rest of code, added comments on major plot lines

8b57375

Merge branch 'pr2_authentication'

59ec553

* pr2_authentication: Added authentication: allow_all, read_only, or mongo_cookie

Log evaluator queries when in verbose mode

414d234

re-adding visualizer

9889b33

Save metrics with zero-counts back to the DB

29b5011

Merge remote-tracking branch 'marsup/mongodb-native-parser' into working

0e369cd

* marsup/mongodb-native-parser: Upgrade mongodb driver and use native BSON parser Conflicts: package.json

vows no longer stalls trying to run test.js

06ac46d

moved test.js to test_helper.js, renamed it in use.

graceful shutdown of server and flushers

efd3f51

richer tests for server components; test helpers for setting up, star…

7435bff

…ting server. * test_helper.with_server -- starts the server, run tests once server starts, stops server when tests are done * merged test/test_db.js into test/test_helper.js, documentated it good.

travis.yml

e608edc

added test 'script' to package.json, for make travis-ci happy

f64cdb7

trying to make travis-ci happy

262bfa3

More tests around server, including UDP, HTTP and static file.

ab04068

Temporarily disable complex metric expressions

8ca2235

Move configs into single config script

e0083f1

Add mongo server config to cube config

4770ac2

Events and metrics collections are created from config file

be66b98

Configurable mongo connection

3645464

Add horizons for calculations and invalidations when processing event…

aa4fafd

…s and metrics

Cascade tiers down to 10 seconds

730ffe1

Add warmer to refresh metrics

e592724

Remove paraenthesis limitation

957222f

use strict mode everywhere

402f68b

doing scale experiments, so reverting the workaround-y stuff (no dist…

740d26c

…inct, all tiers cascade to tensec, horizons)

Extracting metric and event code into objects (WIP)

4581841

adding underscore lib

2635db0

Complete the interface binding work

f8de945

Marsup mentioned this pull request Oct 24, 2013

Added config options to allow binding servers to a particular interface #137

Closed

If udp is not configured then don't bind udp.

e0b19e8

Marsup mentioned this pull request Oct 25, 2013

Allow one to specify a bind address for http, udp, fixes problem when udp disabled. #138

Open

RandomEtc closed this Jul 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infochimps merge #129

Infochimps merge #129

Marsup commented Jul 25, 2013

Marsup commented Sep 20, 2013

RandomEtc commented Sep 20, 2013

Marsup commented Sep 20, 2013

Marsup commented Oct 24, 2013

hustonhoburg commented Oct 24, 2013

Marsup commented Oct 24, 2013

RandomEtc commented Oct 24, 2013

Marsup commented Oct 24, 2013

RandomEtc commented Oct 24, 2013

simonlopez commented Dec 3, 2013

jeffhuys commented Jan 14, 2014

RandomEtc commented Jan 23, 2014

consense commented Jan 23, 2014

Marsup commented Jan 23, 2014

aganov commented Feb 27, 2014

Marsup commented Feb 27, 2014

ticean commented Apr 1, 2014

RandomEtc commented Apr 1, 2014

ticean commented Apr 2, 2014

ticean commented Apr 3, 2014

Marsup commented Apr 4, 2014

zuk commented Jul 7, 2014

RandomEtc commented Jul 30, 2014

zuk commented Jul 31, 2014

RandomEtc commented Jul 31, 2014

Infochimps merge #129

Infochimps merge #129

Conversation

Marsup commented Jul 25, 2013

Marsup commented Sep 20, 2013

RandomEtc commented Sep 20, 2013

Marsup commented Sep 20, 2013

Marsup commented Oct 24, 2013

hustonhoburg commented Oct 24, 2013

Marsup commented Oct 24, 2013

RandomEtc commented Oct 24, 2013

Marsup commented Oct 24, 2013

RandomEtc commented Oct 24, 2013

simonlopez commented Dec 3, 2013

jeffhuys commented Jan 14, 2014

RandomEtc commented Jan 23, 2014

consense commented Jan 23, 2014

Marsup commented Jan 23, 2014

aganov commented Feb 27, 2014

Marsup commented Feb 27, 2014

ticean commented Apr 1, 2014

RandomEtc commented Apr 1, 2014

ticean commented Apr 2, 2014

ticean commented Apr 3, 2014

Marsup commented Apr 4, 2014

zuk commented Jul 7, 2014

RandomEtc commented Jul 30, 2014

zuk commented Jul 31, 2014

RandomEtc commented Jul 31, 2014