Data Modeling

Modeling

Here we provide brieft information about various models used by analytics and Machine Learning to address different use cases. There are plenty of materials on internet, e.g. this tutorial gives good introduction to all concepts discussed here.

Classification

Classification models should be used when someone need to divide data into two or more categories, e.g. spam or not. The term binary classification means to divide data into two classes, while multi-class classification allows to classify data into multiple categories. The most popular models are

RandomForest and other tree based algorithms, usually requires large number of trees to be effective
Support Vector Machine, usually requires fine-tuning (and/or proper kernel choice) and large training time to be effective
Bayesian methods, treat data probabilistically and nicely complement Tree-based and SVMs algorithms in ensemble

Regression

Regression models should be used when someone wants to predict future value(s) based on available (historical) data, e.g. predict price of the house in a future. The most popular regression algorithms are:

Ordinary Least Squares Regression
Linear Regression
Logistic Regression

Clustering

Clustering models should be used when someone wants to find if data belongs to one group or another called clusters. The most common algorithm is k-Means.

Ensemble methods

Ensemble methods are used to enrich predictions (binary or numerical) by using set of models. A typical example would be use several models and average their predictions, other examples include voting, bagging, boosting, stacking different models.

Bagging
building multiple models (typically of the same type) from different sub-samples of the training dataset

Boosting
building multiple models (typically of the same type) each of which learns to fix the predictions errors of a prior model in the chain

Stacking
building multiple models (typically of the different types) and supervisor model that learns how to best combine the predictions of the primary model

Weighting|Blending
combine multiple models into single prediction using different weight functions

More information can be found here, here and here.

Bagging vs Boosting

Bagging is usually better to fight over-fitting while Boosting is better to get lower errors

Similarities: Both are ensemble methods to get N learners from one; Generate several training sets by random sampling; Make final decision by averaging N learners or taking majority of them

Differences: build independently for Bagging, and Boosting tries to add new models that do well where previous models fail; Boosting weights the data to scale in favor of most difficult cases; Bagging: equally weighted average, while Boosting weighted average, more weight to those who perform better on training set

More information can be found here.

Stacking

Stacking (also called meta-ensembling) is a model ensembling technique used to combine information from multiple predictive models to generate a new model. Usually outperform individual models used in ensemble, e.g. GBM+RF+NN Most effective when base models are independent. May be applied at multiple level, e.g. stacking first set, then second set, etc. More information can be found here.

Loss functions

Loss functions play significant role in any ML tasks. They define a function which can be used to optimize the model, either by minimizing or maximizing during the training procedure.

Regularization

Regularization methods are used to control ML models from over-fitting. There are different types of regularizers:

L1 or Lasso regularization adds penalty which is a sums of the absolute values of weights
L2 or Ridge regularization adds penalty which is a sums of the squared values of weights
Dropout is a term introduced in NN context where hidden nodes are dropped randomly and allow model to generalize better
Early Stopping is time regularization technique which stop training based on given criteria

Feature engineering

one-hot-encoding, leave-one-out, word embedding and add them to original data set
split days into years, months, dates and threat them as categorical variables because even though they are numerical values they really represent small dataset. Using them as categorical variables will enhance your model with additional information, e.g. weekends are not working days and people do shopping or go to vacation
aggregate values, e.g. sum all numerical values in a row and/or use its mean/median, usually such information represent overall trend in your data
handle missing values, e.g. apply mean across column or even apply additional training to find their values; adding missing values may increase sensitivity of your model, e.g. check if your model has hidden time series that you can use to fill out missing values.

Neural Networks

Recently Neural Networks gain a lot of popular due to advances in hardware and software. In particular, Deep Learning become a hot fields among researches and businesses. The NN models can be used both for classification and regression problems and usually applied to complex problems such as image, audio, video classification, speech recognition, etc.

Recommendation systems

The out of the box solutions are commercial ones, e.g. Amazon, Google, Microsoft Azure offers various ML platforms/products which can be used for recommendation systems. Their solution is MLaaS but you need to pay for it. There are also MLaaS platforms like H2O, which originally were open source but lately became commercial.

But recommendation system is trivial to build using Open source tools, like scikit-learn library or even use existing complete solutions. Here is a brief list:

Guides to build recommendation systems in python eco-system:

Quick guide to build recommendation engine (link)
Recommender systems in python (link)
How to build recommendation system in python (link)

Existing recommendation frameworks:

CERN Accelerating science