Tools

Submitted by valya on

Here we discuss ML tools solutions (R, Python eco-systems), Spark, Keras, TF, PyTorch, fast.ai, Weka, Vowpal Wabbit, XGboost, MLaaS, etc.

ML for "standard" use-cases

In most cases you may rely on R or Python eco-system. In Python scikit-learn is the de-facto standard, in R all ML tools are available through 3rd party packages.

The majority of data scientists in Kaggle competitions use XGBoost, the distributed gradient boosting library (both R and Python APIs are available) based on parallel tree boosting algorithm (aka GBDR, GBM).

Less known libraries are:

  • Weka, the Waikato Environment for Knowledge Analysis, is a suite of machine learning software written in Java, developed at the University of Waikato, New Zealand (GUI environment)
  • StackNet is a computational, scalable and analytical Meta modeling framework (developed by top-level Kaggle competitor Kaza-Nova and used in many competition that won first places). Written in Java and uses Wolpert's stacked generalization to improve accuracy of ML models. The network is built iteratively one layer at a time (using stacked generalization), each of which uses the final target as its target.
  • h2o Open Source Fast Scalable Machine Learning Platform For Smarter Applications (Deep Learning, Gradient Boosting, Random Forest, Generalized Linear Modeling (Logistic Regression, Elastic Net), K-Means, PCA, Stacked Ensembles, Automatic Machine Learning (AutoML)
Neural Network frameworks
  • Torch is an open source machine learning library, a scientific computing framework, and a script language based on the Lua programming language.
  • Theano is a numerical computation library for Python that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. In Theano, computations are expressed using a NumPy-esque syntax and compiled to run efficiently on either CPU or GPU architectures.
  • Caffe is a deep learning framework (C++ and Python) made with expression, speed, and modularity in mind.
  • TensorFlow is an open-source software library (C++, Python, Go) for data-flow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.
  • PyTorch is a deep learning framework for fast, flexible experimentation. It is Tensors and Dynamic neural networks in Python with strong GPU acceleration.
  • Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano
  • Apache MXNet framework (Python and R) is a modern deep learning framework
  • onnx.ai is an Open Neural Network exchange format which allows to import and export Neural Network models from/to different frameworks
  • Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Visualization of Neural Networks
  • TensorFlow playground: provides an intuitive web based interface to train Neural Networks for a given dataset
  • ConvNetJS is a Javascript library for training Deep Learning models (Neural Networks) entirely in your browser
  • LSTMVis - visual analysis for Recurrent Neural Networks
  • Netron is a visualizer for Deep Learning and machine learning models
  • Ann-visualizer is a python library for visualizing Artificial Neural Networks
  • Keras-vis is a high-level toolkit for visualizing and debugging your trained keras neural net models
  • VisualDL is an open-source cross-framework web dashboard that richly visualizes the performance and data flowing through your neural network training
ML for BigData
  • Some datasets can’t be trained with standard ML tools since they are too big to fit into memory, therefore you can’t use “standard” tools like scikit-learn or R.
  • Gradient Boosting Algorithm (GBM) is a ML technique which produces a prediction model in a form of ensemble of weak prediction models, typically decision trees. Boosting is an ensemble technique in which the predictors are not made independently, but sequentially. Therefore a large dataset can be learned in “chunks” with GBM
  • Vowpal Wabbit is an online learning algorithm designed to deal with tera-features datasets
  • Spark ML Big Data platform (MLlib), Spark is a framework to deal and process large datasets using Hadoop platform which now has a set of ML algorithms available as a part of platform
  • Apache MXNet is a deep learning software framework. It supports several programming languages (C++, Python, Matlab, Go, R, ...) and scales to multiple GPUs and multiple machines.
Recommendation systems

The out of the box solutions are commercial ones, e.g. Amazon, Google, Microsoft Azure offers various ML platforms/products which can be used for recommendation systems. Their solution is MLaaS but you need to pay for it. There are also MLaaS platforms like H2O, which originally were open source but lately became commercial.

But recommendation system is trivial to build using Open source tools, like scikit-learn library or even use existing complete solutions. Here is a brief list:

Guides to build recommendation systems in python eco-system:

  • Quick guide to build recommendation engine (link)
  • Recommender systems in python (link)
  • How to build recommendation system in python (link)

Existing recommendation frameworks: