A Gentle Customer Segmentation approach for mail-order Industries

Arvato Financial Solutions

Image for post
Image for post
Ideation Architecture

Introduction

Preamble

Data engineering

Customer base segmentation (Unsupervised learning)

Supervised learning

Methods

Data Engineering

  • Unsupervised data description

In this category, we have the German demographic population dataset (‘Udacity_AZDIAS_052018.csv’) with 891, 211 persons (rows) by 366 features (columns). The second dataset (‘Udacity_CUSTOMERS_052018.csv’) contains customers’ information from the mail-order company with 191, 652 customers (rows) by 366 features (columns).

Image for post
Image for post
Overview of the population (Azdias) Dataset
Image for post
Image for post
Overview of Customers dataset
  • Supervised data description

Here, both datasets are similar to those obtained by the unsupervised category with the exception of a response feature added to the training dataset (‘ Udacity_MAILOUT_052018_TRAIN.csv’ — 42, 982 persons (rows) by 367 (columns)) for training and validation purpose. The second dataset (‘ Udacity_MAILOUT_052018_TEST.csv’ — 42, 833 persons by 366 columns ) excludes just the response variable, which will be used for the Kaggle competition to test our supervised learning model.

  • Missing Data Analysis

From the attributes metadata, we notice missing values are not just described by numerical NANs, but by assigned special characters. The ‘CAMEO_INTL_2015’ attribute, for instance, describes missing values as ‘XX’. The ‘AGER_TYP’ attribute equally takes ‘-1’ for missing values, same as 80 other columns of numerical and ordinal type with 0, -1 or 9 considered as null value entries. All these attributes with missing data characters other than the numerical null representation are identified and replaced accordingly.

Image for post
Image for post
Column codes for missing values

Having a dataset with null entries completely identified as the numerical representative for null values (nan), we now identify and eliminate those columns and rows with null entries far greater than the average.

Image for post
Image for post
Barplot of Missing data ratio per column

From the barplot, we notice six columns ( “ALTER_HH”, “GEBURTSJAHR”, “KBA05_BAUMAX”, “KK_KUNDENTYP”, “AGER_TYP”, “TITEL_KZ”) averaging to more than 30 % of NaN values, leading to little or no insight from them.

On the other hand, a horizontal analysis of the dataset makes us realize there are up to 111,068 rows in the ‘AZDIAS’ dataset with more than 70% of NaN values. After comparing these rows to the remaining rows with less NaN values, we realize there is some similarity between them. This gives us a basis for dropping all the rows with more than 70% of NaN values, thereby obtaining a data frame prepared for more accurate findings.

The set of bar plots below gives a comparison between the feature distribution of four randomly selected columns from the population datasets (on the left) and feature distribution on the same columns in the customers' dataset (on the right).

Image for post
Image for post
Distribution of features in the population dataset (left chart), against the customers' dataset (right chart) for the first two random columns
Image for post
Image for post
Distribution of features in the population dataset (left chart), against the customers’ dataset (right chart) for the last two random columns
  • Inspect and re-encode required features

Upon analysis of our feature types, we notice a total of 28 features (7 of which are mixed types, and 21 categorical) requires some engineering. Binary Categorical features are left unchanged, while multi-level categorical ones are one-hot-Encoded to their binary counterparts. To enhance computational efficiency, and avoid too many features over the number of samples ratio, ‘CAMEO_DEUG’ feature with up to 44 categories was dropped.

Out of the mixed type features, ‘ PRAEGENDE_JUGENDJAHRE’ (described as the dominating movement in the person’s youth) is engineered into two important features — ‘decades’ and ‘movement’ (mainstream vs. avantgarde). ‘ CAMEO_INTL_2015’ (international typology) as well is engineered into — ‘wealth’ and ‘life stage’ features.

As for ‘WOHNLAGE’ (residential area), the two flags (7 and 8) are replaced with NaNs, which will later be imputed with the features mean given that we don’t know whether they describe a locality of high quality or not.

Three other mixed type features (‘LP_FAMILIE_GROB’,’LP_LEBENSPHASE_GROB’ and ‘LP_STATUS_GROB’) are dropped, while their corresponding detailed versions are retained.

‘LNR’ is a unique identifier without null values but contains no insights apart from being an identifier for each sample thus, it should be dropped.

Out of the remaining numerical features, NZ_HH_TITEL, ANZ_TITEL, and ANZ_KINDER are dropped because they are biased towards many zero values (33486, 35674 and 33821 zero entries respectively), rendering them useless. Though numeric, ‘AKT_DAT_KL’, ‘PLZ8_BAUMAX’ and ‘ARBEIT’ are one-hot-encoded, given that their numeric values represent defined categories.

The following table summarizes all our engineered features with their transformed status, and justifications.

Image for post
Image for post
Table of selected features to process and justifications

Feature Transformation

Image for post
Image for post
Engineered Dataframe before normalization
Image for post
Image for post
Engineered Dataframe after Normalization

Feature Reduction

  • Principal component analysis

We use Sklearn’s PCA for feature reduction and the scree plot to depict the best number of features representing the entire dataset.

Image for post
Image for post
PCA Scree plot

From the scree plot, we can notice the first 260 components (out of 430) provides more than 90% of the information from the whole dataset. The remaining features provide quite a little information, which can otherwise be explained by the first 260. The table below depicts the first 10 features (out of 260) with the most information:

Image for post
Image for post
10 most explained features by PCA

It is worth noting that from the table, tangible assets such as cars, vans, trailers, and motorcycles owned by individuals, happens to be critical factors we should consider when predicting the class of a sample.

Image for post
Image for post
Summary of the Data engineering flow process

Unsupervised learning model

KMeans clustering is used as the predictive model for the unlabeled dataset. To decide the appropriate number of clusters for this dataset, the elbow method is adopted using Sklearn’s MiniBatchKMeans to improve computational efficiency. A number of cluster ranges are chosen from 5 to 30, depending on the point where the mean distance between clusters starts decreasing by very minimal amounts.

Image for post
Image for post
Resulting Curve from Elbow method

The point most similar to an elbow from the graph is at 15, which qualify 15 as the desired number of clusters for accurate segmentation.

  • Comparing Customer data to demographics data

After applying the data cleaning, imputation, and scaling transformations fitted on the demographic dataset to the Customers’ dataset, we predict the cluster to which each customer belongs, by transforming it with the fitted clusters obtained from the demographic dataset. This process results in a set of demographic clusters overrepresented by customers and another set of clusters underrepresented by customers. The bar chart below depicts customer distribution on each demographic cluster.

Image for post
Image for post
Customers (bars in green) cluster distribution bar chart

Overrepresented vs Underrepresented Clusters

From the chart, we clearly identify clusters 8, 2, 14 as having the most represented customers, while clusters 6, 7, 9 has the least represented customers. Following are the characteristics describing a few selected features from the overrepresented clusters against the underrepresented ones:

Image for post
Image for post
Variation of ‘KBA05_ANHANG’ (Share of trailers) between both clusters

From the chart, we see that potential customers are most likely to have more trailer shares (‘KBA05_ANHANG’ feature) compared to the underrepresented cluster (chart on the right) with an overall fewer number of trailer shares.

Image for post
Image for post
Variation of ‘KBA05_KRSOBER’ (share of upper-class cars) between both clusters

The above chart also informs us of the difference in individuals’ share of upper-class cars. The cluster overrepresented by customers has a minimal number of persons with upper-class car shares below the average, unlike the underrepresented cluster with a larger number of persons with upper-class car shares below the average. The green bars in both bar charts clearly illustrate the difference.

In terms of age distribution, 45 years old or lesser people are highly probable to become a customer in Arvato industries, unlike the elderly, aging from 46 years and above. The bar chart below visually shows the difference as well:

Image for post
Image for post
Variation in age difference (<30yrs = 1, 30–45 = 2, 46–60 = 3, >60yrs = 4)

These are just three, out of the total 260 features selected during principal component analysis. Each of these features depicts a difference, an account as to why customers are more in some clusters compared to others.

Supervised Learning

To achieve high-performance metrics, we employ four main models in this section with the use of Sklearn’s Pipeline. The first is Gradient Boosting Classifier (GBC), the second is Light Gradient Boost Machine (Light GBM), then we have the XGBoost machine (GBM implementation designed for speed and performance) and lastly, we implement a Neural Network model with Keras.

Image for post
Image for post
Adopted Classification models
  • Keras Neural Network

For a prepared data frame of 85795 samples, 421 features and a binary response variable (Is a customer — 1 and Is not a customer — 0), our neural network architecture consists of one input layer, one output layer, and two hidden layers as described below:

Image for post
Image for post
Keras Neural Network Architecture

Metric performance

Upon tuning and tweaking a set of hyperparameters such as learning rate, batch size, epoch size, and dropout, the model happens to be largely biased towards the predicting class 0 (Is not a customer) even though our model is performing with a good training accuracy of 0.98, and validation accuracy of 0.98.

Due to its biased nature, this model produces a poor ROC_AUC score of just 0.54 percent. This tells us that the little number of customers who ought to be predicted as customers(1) are instead predicted as non-customers (0). It fully correlates with the biased nature of our mail_out dataset, with just 1.2 % of the population identified as customers, while the remaining bulk is not customers. A solution to this problem will be either to Upsample the number of customers or downsample the large bulk who are not customers.

  • Gradient Boosting Classifier (GBC)

As one of the most used machine learning techniques for regression and classification problems, GBC produces predictions in the form of an ensemble of weak learning models. GBC is used here as a classifier overs a grid search Pipeline, with learning rate, the number of estimators, max_depth and min samples splits considered as tuning hyperparameters.

Metric Performance

Producing an optimized ROC score of 0.4999, the best hyperparameter combination for this estimator is:

{‘learning_rate’: 0.1,
‘max_depth’: 3,
‘min_samples_split’: 4,
‘n_estimators’: 100}

Though a considerable range of parameters was taken into account, GBC happens to produce poor results, with a huge execution wall time. We better look to other boosting approaches to save in time and performance.

  • Light Gradient Boost Machine (Light GBM)

Unlike GBC, Light GBM handles categorical features by taking the input of feature names. It does not convert to one-hot-encoding and is much faster than one-hot-encoding. Light GBM uses a particular process to find the split value of categorical features [link].

The grid search for Light GBM is built on four main parameters with the following quantities:

{‘learning_rate’ : [0.01,0.001,0.16,0.1],
‘max_depth’: [3, 5, 10,30],
‘n_estimators’ : [100,200,300,400, 500,1000,2000],
‘min_samples_split’: [2, 4]
}

Metric Performance

Unlike GBC, grid search on Light GBM is able to obtain a better ROC performance of 0.7503 on test data, within a wall time of 54 minutes 13 seconds with best parameters as:

{‘learning_rate’: 0.001,
‘max_depth’: 30,
‘min_samples_split’: 2,
‘n_estimators’: 1000}
Wall time: 54min 13s

Image for post
Image for post
ROC-AUC for Light GBM
  • XGBoost (GBM implementation designed for speed and performance)

The difference between XGBoost and Light GBM is its inability to handle categorical features by itself. It only accepts numerical values similar to Random Forest. But again, we have all those handled in the data cleaning section. Following are the grid search parameters adopted for this approach:

{‘learning_rate’ : [0.01,0.001,0.16,0.1],
‘max_depth’: [2,3, 5, 10,30,40],
‘n_estimators’ : [50,100,200,300,400],
‘min_child_weight’ : [1,3,6]}

Metric performance

Out of all four approaches applied, XGBoost produced the best metric performance with ROC score of 0.7741 on test data at a wall time of 2 hours 3 minutes and 56 seconds (probably huge because of the large combinations of hyperparameters), with the following best parameters:

{‘learning_rate’: 0.01,
‘max_depth’: 3,
‘n_estimators’: 100}
Wall time: 2h 3min 56s

Image for post
Image for post
ROC-AUC for XGBoost

NB: The weights obtained from this framework was collected and used in the project’s Kaggle competition, obtaining an ROC score of 0.79781 upon prediction, with a margin of 0.00268 difference from the first score. [link]

Conclusion

  • we extract 260 features that carry the most information necessary for accurate analysis, From the broad range of features provided.
  • We study and segment the German population (Azdias.csv) into 15 clusters, based on the similarities between each person with respect to their features.
  • We partition the current company’s customers across the 15 population segments (clusters), to see which segments contain the most customers.
  • We take a deeper look at the segments with the most customers and analyze what features these segments have in common. These features enable us to know what properties make any individual to willingly become a customer. Three of these features are the age range and the upper-class car shares and trailer shares.
  • Using a set of supervised learning frameworks, we predict in real time if a person belonging to the German population will become a customer or not, with an ROC accuracy of 79.781.

Considering these accomplishments, we can affirm to have accomplished our purpose, which is to segment the German population and predict which segments would most likely reach out to the mail-order company, after the reception of a corresponding mail order.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store