1. EDA
2. Data Cleaning
3. Validation types
4. Data splitting strategies
- Quizes
- Data Leaks
  - Basic Data Leaks

1. EDA

Better understand the data
Build an intuition about the data
Generate Hypothesis
Gain insights

1.1 Building intuition about data

Get domain knowledge

Understand the data & how people usually tackle the problem
Understand different features, how they relate with each other & how does it affect the target feature
It helps to deeper understand the problem

Check if the data is intuitive/satisfactory

And agrees with domain knowledge
Looking at quantiles is useful

Understand how the data was generated

Crucial to setup proper validation

1.2 Anonymised Data

1.2.1 Explore Individual Features

Guess the meaning of the features
Guess the types of the column

Play around with, you may never know what you might find

pd.Series.value_counts
pd.DataFrame.dtypes
mean and standard deviation
pd.DataFrame.sort_values(by="")
np.diff

1.3 Visualization Tools

1.3.1 Visualization of Individual Features

Misleading Histograms

plt.hist(x)

# if degenerate graph; like all values pointing to 0
plt.hist(log(x))

# horizontal line implies repeating values & the missing vertical lines implies well shuffled data
plt.plot(x, '.')

df.describe()

x.value_counts()

x.isnull()

1.3.2 Visualization of feature relations

Scatter Plots

Can be used to check if the data distribution between train and test set are the same

Scatter Matrix
Correlation Matrix

2. Data Cleaning

Concatenating train and test data first

2.1 Remove constant columns

Remove features that have only 1 value for the entire feature.

traintest.nunique(axis==1) == 1

Sometimes it might happen that there is constant value in the train data but multiple values for that feature in the test data.

Remove that feature
Train multiple models, each model for a different value in that feature

2.2 Remove duplicated columns

2.2.1 Numerical

traintest.T.drop_duplicates()

2.2.2 Categorical

# label encode each categorical feature as numeric & repeat as above
for f in categorical_feats:
 traintest[f] = traintest[f].factorize()

traintest.T.drop_duplicates()

I usually look at the value_counts of each feature. If the feature has low cardinality, then category-wise count for 2 same features will also be the same.

2.3 Others

Find duplicated rows and understand why they are duplicated
Check if the dataset is shuffled

3. Validation types

Holdout - Divide train into train and validation such that each sample belongs either to train or validation. There should be no repetition. In case, there are duplicated samples in the data, it might happen that one of the duplicated example is in val, this would lead to higher scores and improper hyperparameter selection.

sklearn.model_selection.ShuffleSplit

K-Fold - K times slower than Holdout. Different than repeating Holdout K times in the sense, that it’s entirely possible that a sample may never be a part of validation set or might be a part of validation set multiple times. K-Fold however ensures that each sample is a part of validation set only once.

sklearn.model_selection.Kfold

Leave-one-out
- Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model N times (if N is the number of samples in the dataset).
- In the end you will get LOO predictions for every sample in the trainset and can calculate loss.

Small amount of data Faster model to train

Stratification - Preserves same target distribution over different folds
- Small training data
- Unbalanced datasets
- Multiclass classification problem

4. Data splitting strategies

We should make train/validation split to mimic the train/test split.

Random/rowwise split
Time based split - Similar are moving window validation split
Id-based split
Combined split

Quizes

Suppose we are given a huge dataset. We did a KFold validation once and noticed that scores on each fold are roughly the same. Which validation type is most practical to use?

Use holdout validation scheme as the data is homogenous

If validation scores differ noticeably for each fold in KFold, we should stick with KFold in order to select statistically significant changes in scores while tuning a model.
The features we generate depend on the train-test data splitting method. Is this true?

True

Performance increase on a fixed cross-validation split guaranties performance increase on any cross-validation split. Incorrect. You can overfit to the specific CV-split. You should change your split from time to time to reduce the chance of overfitting.
On Kaggle, make submissions which are

Performing best on validation data - Assuming that train and test data distribution is same.
Performing best on public LB - Assuming that train and test data distribution is different.

Data Leaks

Basic Data Leaks

Row Based - Data may sometimes be shuffled by target variable. So simply adding row number might improve the score.
Id Based - Sometimes, ID may contain traces of information connected to target variable.
Meta-Data - Example image creation date, camera resolution etc in a classical cats vs dogs contest, where cats images were taken before dogs or taken using specific camera.