Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with other automatic ML libraries? #230

Closed
sergeyf opened this issue May 30, 2017 · 12 comments
Closed

Comparison with other automatic ML libraries? #230

sergeyf opened this issue May 30, 2017 · 12 comments

Comments

@sergeyf
Copy link

sergeyf commented May 30, 2017

First, thank you very much for the hard work and awesome project. I think it will get a lot of use in my workflow.

I was surveying the landscape of automatic ML solutions, and found your package along with tpot and auto-sklearn. I am trying to figure out what kind of strengths and weaknesses all these packages have. Would you mind discussing what auto_ml does differently and/or better?

Thanks again.

@ClimbsRocks
Copy link
Owner

ClimbsRocks commented May 30, 2017

Hi @sergeyf !

Glad you found the package.

I can't claim to have used either package extensively, so my quick assessment might be wrong. Both of those packages appear to do great things! Happily, we all seem to focus on different things in our documentation.

  • tpot optimizes the entire pipeline using smart algorithms. It's focus then is both the pipeline part, and the smart algorithm to optimize everything part.
  • auto-sklearn focuses on crazy amounts of ensembling. They'll train up like a billion different models and ensemble them together effectively. Their focus is ensembling.
    -auto_ml is focused on giving you a production-ready trained model (models typically can get predictions from individual records in about 1 millisecond each), and analytics (getting information from models that helps analysts and data scientists understand their datasets better).

Honestly, at the end of the day, model accuracy is going to be pretty similar across all projects. I'm sure each author has very good reason to think their project has an edge in model accuracy, but all of these projects are going to get you to numbers pretty close to each other. Which also means that if you're doing Kaggle competitions or the like, you can probably ensemble a model from each library together into a meta-estimator that's better than any one individually.

So it really comes down to your use case.

  • If you're competing in a Kaggle competition, auto-sklearn is probably your best bet, because of all of it's crazy ensembling (which takes a while to do, but kaggle doesn't care about prediction speed).
  • If you're looking for the most traditional optimized ML pipeline, tpot's algorithm to optimize pipelines probably takes the cake.
  • If you're looking for a trained model that's used in production at large companies, and to get pretty verbose analytics from your model, auto_ml is designed for exactly that use case.

A few things that I'm particularly proud of with auto_ml (not sure if other libraries do these or not)

  • Support for Deep Learning
  • Support for a user-inputted feature engineering function as the first step of the pipeline (ensures consistency between prod/research)
  • Feature Learning- using deep learning to create 10 useful features, then feeding those features into a gradient boosted model to turn those features into the most accurate prediction. This leverages both models for what they're best at.
  • Ability to take in a pandas DataFrame, a list of dictionaries, or a single dictionary (again, optimizing for production flexibility & speed)
  • Support for XGBoost and LightGBM
  • Categorical Ensembling (train one model for every X (product type, market, customer segment, etc.), but interact with them all as if you had just a single model
  • One-line APIs to save and load models with
  • Upcoming support for features like .predict_uncertainty(), and feature_responses (linear-model-esque interpretations from tree-based models).
  • Longer-term I plan on building in support for ensembles that make predictions at production-ready speeds.
  • NLP functionality baked in automatically
  • Automatic feature engineering of date columns

I'd love to hear your thoughts on the different packages!

@sergeyf
Copy link
Author

sergeyf commented May 30, 2017

Thanks for the thorough description! That is very helpful. I don't have many well-formed thoughts about the different packages yet, but I'll certainly get back to you when I have something useful to contribute.

PS, your response might be useful for others who find the package - perhaps it would work well on the readme?

@sergeyf sergeyf closed this as completed May 30, 2017
@rhiever
Copy link

rhiever commented Jun 1, 2017

As the author of TPOT, just a couple notes from your answer:

  • TPOT can support deep learning models if they have a sklearn interface. The custom configuration interface allows you to use any combination of operators and parameters that meet the sklearn interface.

  • Same above applies to XGBoost and any other sklearn-interfaced model or preprocessor.

@ClimbsRocks
Copy link
Owner

Sweet, thanks for the response @rhiever !

Out of curiosity, have you explored tsfresh at all? They've got the scikit-learn interface. Seems like the kind of thing that TPOT in particular would be great for.

@rhiever
Copy link

rhiever commented Jun 1, 2017

Haven't looked into tsfresh much, but definitely looks like something TPOT could wrap using a custom configuration. Maybe a PR is in order? :-)

@calz1
Copy link
Contributor

calz1 commented Aug 24, 2017

The automatic feature engineering in Auto_ML was a big decider for me. It automatically handles categorical variables, dates, and NLP on text strings. Last I looked at TPOT, I had to do a lot of preprocessing to get everything into a numeric format.

@etemiz
Copy link

etemiz commented Feb 13, 2018

Anyone who creates a dask-distributed interoperability will win the crowds! @mfeurer @rhiever @ClimbsRocks

@byrro
Copy link

byrro commented Jul 14, 2018

@rhiever One important thing that Auto_ML offers (as well as auto-sklearn ) is a permissive license. TPOT uses LGPL, which is quite restrictive and anyone pursuing commercial purposes should stay away from it. Auto_ML and auto-sklearn, on the other hand, offer MIT and BSD-3clause licenses, respectively, which are very permissive for almost any kind of usage.

@rhiever
Copy link

rhiever commented Jul 14, 2018

The only major limitation of the LGPL is the source disclosure clauses. LGPL can still be used in commercial domains. I suspect most TPOT users use TPOT to find a pipeline for their problem and export it, and that generated code falls outside the LGPL disclosure clause.

It's been a long-term goal of mine to rewrite TPOT with a MIT license, but TPOT depends heavily on another project that is LGPL licensed.

@byrro
Copy link

byrro commented Jul 14, 2018

Just because it can be used for commercial purposes, doesn't really mean you should. What are the implications? The LGPL terms are very confusing and obfuscated and it's very hard to understand what you can really do with LGPL software without compromising intellectual property that you intend to keep proprietary/closed. Anyone doing business seriously with an intention to build IP that is marketable in the future should either A) spend a lot of money with lawyers to make sure you do everything right and the LGPL library won't scare away any possible buyers/investors; or B) find another library that's licensed under simple and clear terms, such as BSD, MIT or Apache 2.

@rhiever
Copy link

rhiever commented Jul 16, 2018

Just a point of clarification: The common AutoML use case doesn't involve packaging the AutoML tool as a part of some product, which is the only case that the LGPL license will matter. Nearly all use cases I've seen in the wild involve using AutoML to find a pipeline for a problem, and exporting that pipeline to use independently from the AutoML tool. That use case is not affected by the LGPL.

The downside of the more permissive licenses (from a developer perspective) is that it makes it easy to take an open source AutoML tool and build a "AutoMLaaS" company around it, effectively cashing in on the developers' hard work without giving anything back to them or the open source community. This has already happened to the auto-sklearn developers, and they were not happy about it.

@byrro
Copy link

byrro commented Jul 16, 2018

I totally agree that using an open source project in the core of a SaaS business without giving anything back to the community is a shame and I'd not be happy with it as well.

But there are other scenarios where GPL and variations could become a problem. Say you're using TPOT in a SaaS, meaning no packaging. All good. A few years later this big customer asks you to run your software on premise. Now you have a big problem: does GPL allow you to do it without having to compromise your intellectual property in other areas of the project that interacts with the GPL software? Or say you decide to patent other parts of your software that interact or rely on TPOT... There are lots of subtleties with GPL and the like that make it harder to answer these questions.

You might end up with a relieving "yes", but my point is: you can't be really sure without spending reasonable money with good lawyers to study your particular use-case. When you're starting up, you want to avoid this future liability AND avoid spending money with lawyers. Thus, the best for a small business is to stick with MIT, BSD or Apache 2.0. That's all I'm saying. LGPL will make it harder for SMEs to work with it. I'm not saying they can't work with it, I'm just saying: if there's an MIT/BSD alternative, I'll definitely stick to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants