How to win a data science competition (Week 4)
Notes from Week 4 of Coursera's 'How to win a data science competition'. Instructors in this week, detail about different hyperparameter tuning libraries, HyperOpt in particular. Also, they discuss about regular hyperparameters in XGBoost & RF.
Hyperparameter Tuning
- Select the most important parameters
- Understand how exactly they influence the training
Libraries -
- HyperOpt
- Scikit-optimize
- GPyOpt
- RoBo
HyperOpt
Bayesian optimisation takes into account past evaluations when choosing the hyperparameter set to evaluate next. By choosing its parameter combinations in an informed way, it enables itself to focus on those areas of the parameter space that it believes will bring the most promising validation scores.
Search Space
Bayesian Optimisation operates along probability distributions for each parameter that it will sample from. We specify a probability distribution for the hyperparameters to sample from. One could choose from the following -
hp.choice
hp.uniform
hp.normal
hp.lognormal
Objective Function
The objective function takes in hyperparameters and outputs a single real-valued score that we want to minimize (or maximize). It simply takes in a set of hyperparameters and outputs a score that indicates how well a set of hyperparameters performs on the validation set. For example - RMSE
The entire concept of Bayesian model-based optimization is to reduce the number of times the objective function needs to be run by choosing only the most promising set of hyperparameters to evaluate based on previous calls to the evaluation function.
Objective Function is usually the loss function. By default
hyperopt
tries to minimize the objective function.
Surrogate Function
Surrogate Function is an approximation of objective function. It is used to propose parameter sets to the objective function that likely yield an improvement. HyperOpt uses something called Tree Parzen Estimator (TPE).
Selection Function
The selection function is the criteria by which the next set of hyperparameters are chosen from the surrogate function. The most common choice of criteria is Expected Improvement.
XGBoost
Increasing Model’s Capacity
max_depth
- Increasing this value will make the model more complex and more likely to overfit. Usually start in between [3, 7]. Increase only if there’s an improvement in score.subsample
- Controls the fraction of objects to use when fitting a true. Ranges between [0-1]. Has some sort of regularisation effect and prone to overfitting.colsample_bytree
- Controls the fraction of features to consider at every split. [0-1]eta
- Learning ratenum_round
- Number of trees to build.
Decreasing Model’s Capacity
min_child_weight
- The largermin_child_weight
is, the more conservative the algorithm will be. Ranges $[0, _{inf}]$gamma
- Minimum loss required to make a further split. Larger values results in conservative model. Ranges $[0, _{inf}]$lambda
- L2 regularization.alpha
- L1 regularization
Usually after finding the appropriated values for num_round
and eta
. Divide eta
by alpha
& multiply num_round
by alpha
, this will usually give a slightly better result.
RandomForest/ExtraTrees
Unlike in Gradient Boosting Trees & it’s variants, RF builds trees parallely & hence increasing the number of trees won’t overfit the model.
n_estimators
- Start from small values say 10 & check the time required to fit these 10 trees. If its not too high then use a larger value.max_depth
- Usually tree depth for RF is greater then GBDTs.max_features
min_samples_leaf
Neural Networks
Increasing Model’s Capacity
- Number of neurons per layer
- Number of layers
- Adaptive Methods (Adam, Adagrad) - Leads to overfitting sometimes
- Batch Size - Larger batch size causes overfitting
Decreasing Model’s Capacity
- SGD+momentum - Even though it leads to slower learning. The model generalises better.
- Regularization - L2/L1, Dropout/Dropconnect,
Some people follow the rule - If you increase your batch size by a factor of
alpha
, increase your learning rate by the same factor as well.
Cross Validation Strategy
Creating a validation strategy, is to create a validation approach that resembles what you are being tested on. The validation data should be consistent with the test data.
Time Based
Always have past data predicting the future data. Also the intervals need to be similar with test data. So, if test data is 3 months in the future, the validation data should also be 3 months future compared to the training data.
Stratified
Different entities than train? Suppose the test data has entities which are not present in the training data, then the validation dataset should be created such that there are entities in it which are not present in the train data.
Ensembling
- Smaller data requires simpler ensembling techniques like averaging.
- Bigger data works well with Stacking like techniques
Statistic and Distance based feature engineering
- Calculating various statistics of one feature grouped by another
- Features derived from neighbourhood analysis of a given point
Matrix Factorizations
General Notes
- MF is a very general approach for dimensionality reduction and feature selection
- It can be applied for transforming categorical features into real-valued features
Implementation Notes
- Matrix factorisation can be applied to only a subset of columns if desired
- We can use matrix factorisation to get another representation of the same data, especially useful for ensemble models
- Loss of information
- The number of latent factors to use must be treated as a hyperparamater & must be tuned.
NMF (Non-negative Matrix Factorisation)
It transforms data in a way that makes it more suitable for decision trees. NMF cannot be applied on negative values. “Standardized” means that every feature column has zero mean and unit variance. This implies that we have negative values and cannot apply NMF.
X_all = np.conctenate([X_train, X_test])
pca.fit(X_all)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
Manifold learning in general are non-linear methods of dimensionality reduction. Apply t-SNE to concatenation of train and test and split projection back.