How to win a data science competition (Week 4)

Hyperparameter Tuning

Select the most important parameters
Understand how exactly they influence the training

Libraries -

HyperOpt
Scikit-optimize
GPyOpt
RoBo

HyperOpt

Bayesian optimisation takes into account past evaluations when choosing the hyperparameter set to evaluate next. By choosing its parameter combinations in an informed way, it enables itself to focus on those areas of the parameter space that it believes will bring the most promising validation scores.

Search Space

Bayesian Optimisation operates along probability distributions for each parameter that it will sample from. We specify a probability distribution for the hyperparameters to sample from. One could choose from the following -

hp.choice
hp.uniform
hp.normal
hp.lognormal

Objective Function

The objective function takes in hyperparameters and outputs a single real-valued score that we want to minimize (or maximize). It simply takes in a set of hyperparameters and outputs a score that indicates how well a set of hyperparameters performs on the validation set. For example - RMSE

The entire concept of Bayesian model-based optimization is to reduce the number of times the objective function needs to be run by choosing only the most promising set of hyperparameters to evaluate based on previous calls to the evaluation function.

Objective Function is usually the loss function. By default hyperopt tries to minimize the objective function.

Surrogate Function

Surrogate Function is an approximation of objective function. It is used to propose parameter sets to the objective function that likely yield an improvement. HyperOpt uses something called Tree Parzen Estimator (TPE).

Selection Function

The selection function is the criteria by which the next set of hyperparameters are chosen from the surrogate function. The most common choice of criteria is Expected Improvement.

XGBoost

Increasing Model’s Capacity

max_depth - Increasing this value will make the model more complex and more likely to overfit. Usually start in between [3, 7]. Increase only if there’s an improvement in score.
subsample - Controls the fraction of objects to use when fitting a true. Ranges between [0-1]. Has some sort of regularisation effect and prone to overfitting.
colsample_bytree - Controls the fraction of features to consider at every split. [0-1]
eta - Learning rate
num_round - Number of trees to build.

Decreasing Model’s Capacity

min_child_weight - The larger min_child_weight is, the more conservative the algorithm will be. Ranges $[0, _{inf}]$
gamma - Minimum loss required to make a further split. Larger values results in conservative model. Ranges $[0, _{inf}]$
lambda - L2 regularization.
alpha - L1 regularization

Usually after finding the appropriated values for num_round and eta. Divide eta by alpha & multiply num_round by alpha, this will usually give a slightly better result.

RandomForest/ExtraTrees

Unlike in Gradient Boosting Trees & it’s variants, RF builds trees parallely & hence increasing the number of trees won’t overfit the model.

n_estimators - Start from small values say 10 & check the time required to fit these 10 trees. If its not too high then use a larger value.
max_depth - Usually tree depth for RF is greater then GBDTs.
max_features
min_samples_leaf

Neural Networks

Increasing Model’s Capacity

Number of neurons per layer
Number of layers
Adaptive Methods (Adam, Adagrad) - Leads to overfitting sometimes
Batch Size - Larger batch size causes overfitting

Decreasing Model’s Capacity

SGD+momentum - Even though it leads to slower learning. The model generalises better.
Regularization - L2/L1, Dropout/Dropconnect,

Some people follow the rule - If you increase your batch size by a factor of alpha, increase your learning rate by the same factor as well.

Cross Validation Strategy

Creating a validation strategy, is to create a validation approach that resembles what you are being tested on. The validation data should be consistent with the test data.

Time Based

Always have past data predicting the future data. Also the intervals need to be similar with test data. So, if test data is 3 months in the future, the validation data should also be 3 months future compared to the training data.

Stratified

Different entities than train? Suppose the test data has entities which are not present in the training data, then the validation dataset should be created such that there are entities in it which are not present in the train data.

Ensembling

Smaller data requires simpler ensembling techniques like averaging.
Bigger data works well with Stacking like techniques

Statistic and Distance based feature engineering

Calculating various statistics of one feature grouped by another
Features derived from neighbourhood analysis of a given point

Matrix Factorizations

Coursera Matrix Factorization

General Notes

MF is a very general approach for dimensionality reduction and feature selection
It can be applied for transforming categorical features into real-valued features

Implementation Notes

Matrix factorisation can be applied to only a subset of columns if desired
We can use matrix factorisation to get another representation of the same data, especially useful for ensemble models
Loss of information
The number of latent factors to use must be treated as a hyperparamater & must be tuned.

NMF (Non-negative Matrix Factorisation)

It transforms data in a way that makes it more suitable for decision trees. NMF cannot be applied on negative values. “Standardized” means that every feature column has zero mean and unit variance. This implies that we have negative values and cannot apply NMF.

X_all = np.conctenate([X_train, X_test])
pca.fit(X_all)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

Manifold learning in general are non-linear methods of dimensionality reduction. Apply t-SNE to concatenation of train and test and split projection back.