1. EDA

  • Better understand the data
  • Build an intuition about the data
  • Generate Hypothesis
  • Gain insights

1.1 Building intuition about data

  1. Get domain knowledge
  • Understand the data & how people usually tackle the problem
  • Understand different features, how they relate with each other & how does it affect the target feature
  • It helps to deeper understand the problem
  1. Check if the data is intuitive/satisfactory
  • And agrees with domain knowledge
  • Looking at quantiles is useful
  1. Understand how the data was generated
  • Crucial to setup proper validation

1.2 Anonymised Data

1.2.1 Explore Individual Features

  • Guess the meaning of the features
  • Guess the types of the column

Play around with, you may never know what you might find

pd.Series.value_counts
pd.DataFrame.dtypes
mean and standard deviation
pd.DataFrame.sort_values(by="")
np.diff

1.3 Visualization Tools

1.3.1 Visualization of Individual Features

Misleading Histograms

plt.hist(x)

# if degenerate graph; like all values pointing to 0
plt.hist(log(x))

# horizontal line implies repeating values & the missing vertical lines implies well shuffled data
plt.plot(x, '.')

df.describe()

x.value_counts()

x.isnull()

1.3.2 Visualization of feature relations

  1. Scatter Plots
  • Can be used to check if the data distribution between train and test set are the same
  1. Scatter Matrix
  2. Correlation Matrix

2. Data Cleaning

Concatenating train and test data first

2.1 Remove constant columns

Remove features that have only 1 value for the entire feature.

traintest.nunique(axis==1) == 1

Sometimes it might happen that there is constant value in the train data but multiple values for that feature in the test data.

  • Remove that feature
  • Train multiple models, each model for a different value in that feature

2.2 Remove duplicated columns

2.2.1 Numerical

traintest.T.drop_duplicates()

2.2.2 Categorical

# label encode each categorical feature as numeric & repeat as above
for f in categorical_feats:
 traintest[f] = traintest[f].factorize()

traintest.T.drop_duplicates()

I usually look at the value_counts of each feature. If the feature has low cardinality, then category-wise count for 2 same features will also be the same.

2.3 Others

  • Find duplicated rows and understand why they are duplicated
  • Check if the dataset is shuffled

3. Validation types

  • Holdout - Divide train into train and validation such that each sample belongs either to train or validation. There should be no repetition. In case, there are duplicated samples in the data, it might happen that one of the duplicated example is in val, this would lead to higher scores and improper hyperparameter selection.
sklearn.model_selection.ShuffleSplit
  • K-Fold - K times slower than Holdout. Different than repeating Holdout K times in the sense, that it’s entirely possible that a sample may never be a part of validation set or might be a part of validation set multiple times. K-Fold however ensures that each sample is a part of validation set only once.
sklearn.model_selection.Kfold
  • Leave-one-out
    • Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model N times (if N is the number of samples in the dataset).
    • In the end you will get LOO predictions for every sample in the trainset and can calculate loss.

Small amount of data Faster model to train

  • Stratification - Preserves same target distribution over different folds
    • Small training data
    • Unbalanced datasets
    • Multiclass classification problem

4. Data splitting strategies

We should make train/validation split to mimic the train/test split.

  • Random/rowwise split
  • Time based split - Similar are moving window validation split
  • Id-based split
  • Combined split

Quizes

  1. Suppose we are given a huge dataset. We did a KFold validation once and noticed that scores on each fold are roughly the same. Which validation type is most practical to use?

Use holdout validation scheme as the data is homogenous

  1. If validation scores differ noticeably for each fold in KFold, we should stick with KFold in order to select statistically significant changes in scores while tuning a model.

  2. The features we generate depend on the train-test data splitting method. Is this true?

True

  1. Performance increase on a fixed cross-validation split guaranties performance increase on any cross-validation split. Incorrect. You can overfit to the specific CV-split. You should change your split from time to time to reduce the chance of overfitting.

  2. On Kaggle, make submissions which are

  • Performing best on validation data - Assuming that train and test data distribution is same.
  • Performing best on public LB - Assuming that train and test data distribution is different.

Data Leaks

Basic Data Leaks

  • Row Based - Data may sometimes be shuffled by target variable. So simply adding row number might improve the score.
  • Id Based - Sometimes, ID may contain traces of information connected to target variable.
  • Meta-Data - Example image creation date, camera resolution etc in a classical cats vs dogs contest, where cats images were taken before dogs or taken using specific camera.