How to win a data science competition (Week 1)
My personal notes of Coursera's 'How to win a data science competition'. I will continue to improve and update them. Gives detailed information about feature preprocessing, generation and extraction.
- Comparison
- 1. Review
- 2. Preprocessing
- 2.1 Numeric Features
- 2.2 Categorical Features
- 3 Datetime
- 4. Coordinates
- 5. Missing Values
- 6. Feature Extraction in Text and Images
- 7. Geographic Data
Comparison
Real World ML Pipeline
- Understanding of business problem
- Problem Formalization
- Data Collection
- Data Preprocessing
- Modelling
- Way to evaluate model in real life
- Way to deploy model
Data Science Competitions
- Problem formalization (might be necessary sometimes)
- Data Collection (possible in few competitions)
- Data Preprocessing
- Modelling
Differences
- Usually the hardest part of problem formalisation and evaluation metrics is already done.
- Deployment is out of scope.
- Model complexity, speed & memory consumption doesn’t matter
Resources
1. Review
1.1 Linear Model
Especially good for sparse high dimensional data. Support Vector Machine (SVM) is a linear model with special loss function. Even with “kernel trick”, it’s still linear in new, extended space.
Libraries: Scikit-learn and Vowpal Wabbit
1.2 Tree Based Model
Decision trees used divide-and-conquer technique to recursively divide spaces into sub-spaces. Tree-based models can work very well with tabular data.
However, tree-based approaches cannot easily capture linear dependencies since it requires many splits.
ExtraTrees classifier always tests random splits over fraction of features (in contrast to RandomForest, which tests all possible splits over fraction of features)
Libraries: Scikit-learn, XGBoost, LightGBM
1.3 No Free Lunch Theorem
“There is no method which outperforms all others for all tasks“
“For every method we can construct a task for which this particular method will not be the best“
- SVMs have a linear boundaries
- Decision trees have horizontal and vertical splits
- Random Forest has many more of these horizontal and vertical axes and are relatively smoother
- Naive Bayes has a smoother boundary compared to Neural Networks
2. Preprocessing
2.1 Numeric Features
Preprocessing depends on the type of model we use.
- Tree based models (Decision Trees)
- Non-tree based models (KNNs, Neural Nets, Linear Models)
2.1.1 Scaling
Non-tree based models are affected by the scale of the features. Different feature scaling results in different models quality. On the other hand, Decision Trees try to find the best split for a feature, no matter the scale.
Explanation
- Nearest Neighbours
- The scale of features impacts the distance between samples.
- With different scaling of the features nearest neighbours for a selected object can be very different.
- Linear Models / Neural Networks
- Amount of regularization applied to a feature depends on the feature’s scale.
- Optimization methods can perform differently depending on relative scale of features.
- Others
- KMeans, SVMs, LDA, PCA, etc
MinMax Scaler
MinMax Scaler - The distribution of the feature before and after scaling remains the same. The value range is 0 to 1.
sklearn.preprocessing.MinMaxScaler
- Standard Scaler - We always get a standard normal distribution.
sklearn.preprocessing.StandardScaler
2.1.2 Winsorization
Treating outliers by clipping values between two ranges - upper bound and lower bound. Say 1 & 99 percentile of the feature values.
2.1.3 Rank Transformation
Sorts an array and changes their values to indices. Similar to binning.
Example:
rank([-100, 1, 1e-5]) = [0, 1, 2]
rank([1000, 1, 10]) = [2, 0, 1]
Linear Models, KNN and Neural Networks usually benefit from this transformation.
Use scipy.stats.rankdata
to create ranks. For test data, either store the mapping of value ranges to indices or concatenate train & test data before applying the rank transformation.
2.1.4 Log Transformation
Helps all non-tree based models especially neural nets. .
# log transformation
np.log(1 + x)
- Skewness transformation
- This transformation is non-linear and will move outliers relatively closer to other samples. Also, values near zero become more distinguishable.
2.2 Categorical Features
2.2.1 Label Encoding
Works well with tree based methods. Linear methods struggle to extract meaning out of label encoded features. Categories encoded with numbers that close are to each other (usually) are not more related then categories encoded with numbers that far away from each other.
- Categorical features are ordinal in nature
- When the number of categorical features in the dataset is huge
Methods -
- Alphabetically Sorted [S, C, Q] -> [2, 1, 3]
sklearn.preprocessing.LabelEncoder
- Order of Appearance
[S, C, Q] -> [1, 2, 3]
Pandas.factorize
2.2.2 Frequency Encoding
Encode categories on the frequency of their occurrence. Preserves information about value distribution.
Example: For a given list [S, C, Q, S, S, C, Q, S, C, S]
the feature column can be encoded by replacing [S, C, Q] -> [0.5, 0.3, 0.2]
Can be useful for both tree based & non-tree based models.
2.2.3 One hot Encoding
Works best for non-tree based models. One-hot encoded features are already scaled as the maximum value is 1 and minimum value is 0. Hence, suitable for non-tree based models which suffer from scaling issues.
On each split, trees can only separate one category from the others. So greater the number of categories, greater will be the number of splits.
2.2.4 Mean Target Encoding
Encoding categorical variables with a mean target value (and also other target statistics) is a popular method for working with high cardinality features, especially useful for tree-based models. Mean encodings let models converge faster. Useful when working with high cardinality categorical features. Easy to overfit.
means = df.groupby('X')['y'].mean()
df['X'] = df['X'].map(means)
It is useful to apply target encoding when the number of samples of each category type belonging to the target value equal to say 1 is reasonable i.e if y
has values only 0 and 1 & say for the category a
belonging to x_0
doesn’t have any target value as 1, then it’s mean will be 0.
For continous target value, feature is replaced by average target value.
Example - If profession=teacher is the categorical feature & salary is target, then teacher is replaced by average salary of teachers in the training set
2.2.5 Binary Encoding
Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.
First the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns. This encodes the data in fewer dimensions that one-hot, but with some distortion of the distances.
Different categories may share some of the same features.
3 Datetime
3.1 Periodicity
Useful to capture repetitive patterns in data. We can add features like
- Day number in week
- Month
- Season
- Year
- Second
- Minute
- Hour
3.2 Time passed since a particular event
Case 1 -
In this case, all the samples become comparable between each other on one time scale. Days passed since January 1, 2000.
Case 2 -
In this case, the date will depend on the sample we are calculating this for. For example - number of days since last holiday, number of days since last weekend, number of days to the next holiday, etc.
3.3 Difference between Dates
Number of days between two events. Example - Subtracting end_date - start_date gives number of years loan amount was paid.
4. Coordinates
4.1 Distance based Features
Find out the most interesting points in the map say the best school/hospital in the town, a museum, etc and calculate distance of the samples to this point. Add them as a feature in the training and test dataset.
4.2 Aggregated Statistics
Calculate aggregated statistics for objects surrounding areas such as the total number of flats in the vicinity which can then be interpreted as areas of popularity.’
Or you could calculate $average price flat$ grouped by say $pin code, area$ which would indicate expensiveness.
5. Missing Values
- Missing values can be represented in any way not necessarily as NaN. Some examples are -1, 999, , empty string, NaN, etc.
- Sometimes missing values can contain information by themselves.
5.1 Imputation
- Use a value outside the range of the normal values for a variable. like -1 , -9999 etc.
- Replace with a likelihood – e.g. something that relates to the target variable.
- Replace with something which makes sense. For example - sometimes null may mean zero
- You may consider removing rows with many null values
- Try to predict missing values based on subsets of know values. For example - in time series data, where rows are not independent of each other, then a missing value can be replaced by say averaging values from within a window
5.2 Adding isNull column
For each of the numerical features we can add an isNull
column representing whether a particular row for a column in discussion contains empty values (NaN, NaT, None)
. Useful specially for tree-based methods & neural nets.
The downside is that we double the number of columns.
5.3 General
- Some models like XGBoost and CatBoost can deal with missing values out-of-box. These models have special methods to treat them and a model’s quality can benefit from it
- Do not fill NaNs before feature generation
- Impute with a feature mean/median
- Remove rows with missing values
- Reconstruct the missing values
5.4 Issue while generating new feature from existing feature which has null values
Order should be -
Missing Value Imputation -> Feature Generation
We should be very careful about replacing missing values before the feature generation.
6. Feature Extraction in Text and Images
6.1 Bag of Words
- N-grams can help utilise local context around each word because n-grams encode sequences of words.
- It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
- N-grams features are typically sparse. N-grams deal with counts of words occurrences, and not every word can be found in a document.
- Bag of words usually produces longer vectors than Word2Vec. Number of features in Bag of words approach is usually equal to number of unique words, while number of features in w2v is restricted to a constant, like 300 or so.
- Meaning of a value in BOW matrix is the number of a word’s occurrences in a document. Values in vectors cannot be interpreted.
- Word2Vec and Bag of words give different results. We can combine them to get more variety of results.
6.1.1 Countvectorizer
Count the occurence of each word in a given document.
Needs scaling at the end to be use in linear models, as most occurring words will have higher counts and assume more importance. This is overcome by tf-idf vectorizer.
sklearn.feature_extraction.text.CountVectorize
6.1.2 Tf-idf vectorizer
TF-IDF is applied to a matrix where each column represents a word, each row represents a document, and each value shows the number of times a particular word occurred in a particular document.
Making sequences/documents of different lengths more comparable
Term Frequency measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization. Term-frequency (TF) normalises sum of the column values to 1.
BOOST MORE IMPORTANT FEATURES
Normalizing feature by the inverse fraction of documents. In this case features corresponding to frequent words will be scaled down compared to features corresponding to rarer words.
IDF scales features inversely proportionally to a number of word occurrences over documents
7. Geographic Data
7.1 ZIP Codes
- Letting zip code as a numeric variable is not a good idea since some models might consider the numeric ordering or distances as something to learn.
- Look up demographic variables based on zipcode. For example - With City Data you can look up income distribution, age ranges, etc.
- Get latitude and longitude based on zip codes, which can be useful, especially for tree based models