A Gentle Customer Segmentation approach for mail-order Industries
Arvato Financial Solutions

Introduction
In this project, we apply unsupervised learning techniques to identify segments of the population that represent the core customer base for a mail-order sales company in Germany. We also adopt supervised learning techniques on partitioned and labeled data to identify and predict if a sample from the population is a customer or not. Real-life data is provided by Bertelsmann Arvato Analytics, from which current insights are collected to provide accurate metrics. The goal of this project is to predict which segments of the population would most likely reach out to the company, after the reception of a corresponding mail order.
Preamble
Split into two main parts, our project structure achieves its objectives as follows:
Data engineering
Here, we perform data extraction techniques such as imputation, selection, and one hot encoding, resulting in a numerical representation of all important features, ready for ingestion.
Customer base segmentation (Unsupervised learning)
In this section, we perform feature extraction with Principal Component Analysis (PCA) then study how each feature and its contribution varies across the data. Upon choosing the most descriptive features, we implement a clustering algorithm and achieve accurate metrics through an iterative sequence of tuning.
Supervised learning
With a target variable describing the response of each customer to the mail-order proposal, we predict customer response by developing a set of supervised learning approaches to produce consistent metrics, through a pipelined metric search criteria.
Methods
Data Engineering
As every other real-world data, the one provided in our case is quite dirty and needs to be cleaned and structured before it is processed. Four different datasets exist for this project; two for unsupervised learning and the remaining for supervised learning.
- Unsupervised data description
In this category, we have the German demographic population dataset (‘Udacity_AZDIAS_052018.csv’) with 891, 211 persons (rows) by 366 features (columns). The second dataset (‘Udacity_CUSTOMERS_052018.csv’) contains customers’ information from the mail-order company with 191, 652 customers (rows) by 366 features (columns).


- Supervised data description
Here, both datasets are similar to those obtained by the unsupervised category with the exception of a response feature added to the training dataset (‘ Udacity_MAILOUT_052018_TRAIN.csv’ — 42, 982 persons (rows) by 367 (columns)) for training and validation purpose. The second dataset (‘ Udacity_MAILOUT_052018_TEST.csv’ — 42, 833 persons by 366 columns ) excludes just the response variable, which will be used for the Kaggle competition to test our supervised learning model.
- Missing Data Analysis
From the attributes metadata, we notice missing values are not just described by numerical NANs, but by assigned special characters. The ‘CAMEO_INTL_2015’ attribute, for instance, describes missing values as ‘XX’. The ‘AGER_TYP’ attribute equally takes ‘-1’ for missing values, same as 80 other columns of numerical and ordinal type with 0, -1 or 9 considered as null value entries. All these attributes with missing data characters other than the numerical null representation are identified and replaced accordingly.

Having a dataset with null entries completely identified as the numerical representative for null values (nan), we now identify and eliminate those columns and rows with null entries far greater than the average.

From the barplot, we notice six columns ( “ALTER_HH”, “GEBURTSJAHR”, “KBA05_BAUMAX”, “KK_KUNDENTYP”, “AGER_TYP”, “TITEL_KZ”) averaging to more than 30 % of NaN values, leading to little or no insight from them.
On the other hand, a horizontal analysis of the dataset makes us realize there are up to 111,068 rows in the ‘AZDIAS’ dataset with more than 70% of NaN values. After comparing these rows to the remaining rows with less NaN values, we realize there is some similarity between them. This gives us a basis for dropping all the rows with more than 70% of NaN values, thereby obtaining a data frame prepared for more accurate findings.
The set of bar plots below gives a comparison between the feature distribution of four randomly selected columns from the population datasets (on the left) and feature distribution on the same columns in the customers' dataset (on the right).


- Inspect and re-encode required features
Upon analysis of our feature types, we notice a total of 28 features (7 of which are mixed types, and 21 categorical) requires some engineering. Binary Categorical features are left unchanged, while multi-level categorical ones are one-hot-Encoded to their binary counterparts. To enhance computational efficiency, and avoid too many features over the number of samples ratio, ‘CAMEO_DEUG’ feature with up to 44 categories was dropped.
Out of the mixed type features, ‘ PRAEGENDE_JUGENDJAHRE’ (described as the dominating movement in the person’s youth) is engineered into two important features — ‘decades’ and ‘movement’ (mainstream vs. avantgarde). ‘ CAMEO_INTL_2015’ (international typology) as well is engineered into — ‘wealth’ and ‘life stage’ features.
As for ‘WOHNLAGE’ (residential area), the two flags (7 and 8) are replaced with NaNs, which will later be imputed with the features mean given that we don’t know whether they describe a locality of high quality or not.
Three other mixed type features (‘LP_FAMILIE_GROB’,’LP_LEBENSPHASE_GROB’ and ‘LP_STATUS_GROB’) are dropped, while their corresponding detailed versions are retained.
‘LNR’ is a unique identifier without null values but contains no insights apart from being an identifier for each sample thus, it should be dropped.
Out of the remaining numerical features, NZ_HH_TITEL, ANZ_TITEL, and ANZ_KINDER are dropped because they are biased towards many zero values (33486, 35674 and 33821 zero entries respectively), rendering them useless. Though numeric, ‘AKT_DAT_KL’, ‘PLZ8_BAUMAX’ and ‘ARBEIT’ are one-hot-encoded, given that their numeric values represent defined categories.
The following table summarizes all our engineered features with their transformed status, and justifications.

Feature Transformation
Before dimensionality reduction, we impute and normalize the data, so that the principal component vectors are not influenced by differences in feature scales. To meet this objective, Sklearn’s Imputer (replacing null entries with feature’s mean) and its standard scaler library are used.


Feature Reduction
After data engineering, we notice a large increase (from 79 to 132) in the number of features. Considering there is quite a large number of samples, we still need to perform some feature reduction not just to enhance computational efficiency, but also to explain feature importance.
- Principal component analysis
We use Sklearn’s PCA for feature reduction and the scree plot to depict the best number of features representing the entire dataset.

From the scree plot, we can notice the first 260 components (out of 430) provides more than 90% of the information from the whole dataset. The remaining features provide quite a little information, which can otherwise be explained by the first 260. The table below depicts the first 10 features (out of 260) with the most information:

It is worth noting that from the table, tangible assets such as cars, vans, trailers, and motorcycles owned by individuals, happens to be critical factors we should consider when predicting the class of a sample.

Unsupervised learning model
- Clustering
KMeans clustering is used as the predictive model for the unlabeled dataset. To decide the appropriate number of clusters for this dataset, the elbow method is adopted using Sklearn’s MiniBatchKMeans to improve computational efficiency. A number of cluster ranges are chosen from 5 to 30, depending on the point where the mean distance between clusters starts decreasing by very minimal amounts.

The point most similar to an elbow from the graph is at 15, which qualify 15 as the desired number of clusters for accurate segmentation.
- Comparing Customer data to demographics data
After applying the data cleaning, imputation, and scaling transformations fitted on the demographic dataset to the Customers’ dataset, we predict the cluster to which each customer belongs, by transforming it with the fitted clusters obtained from the demographic dataset. This process results in a set of demographic clusters overrepresented by customers and another set of clusters underrepresented by customers. The bar chart below depicts customer distribution on each demographic cluster.

Overrepresented vs Underrepresented Clusters
From the chart, we clearly identify clusters 8, 2, 14 as having the most represented customers, while clusters 6, 7, 9 has the least represented customers. Following are the characteristics describing a few selected features from the overrepresented clusters against the underrepresented ones:

From the chart, we see that potential customers are most likely to have more trailer shares (‘KBA05_ANHANG’ feature) compared to the underrepresented cluster (chart on the right) with an overall fewer number of trailer shares.

The above chart also informs us of the difference in individuals’ share of upper-class cars. The cluster overrepresented by customers has a minimal number of persons with upper-class car shares below the average, unlike the underrepresented cluster with a larger number of persons with upper-class car shares below the average. The green bars in both bar charts clearly illustrate the difference.
In terms of age distribution, 45 years old or lesser people are highly probable to become a customer in Arvato industries, unlike the elderly, aging from 46 years and above. The bar chart below visually shows the difference as well:

These are just three, out of the total 260 features selected during principal component analysis. Each of these features depicts a difference, an account as to why customers are more in some clusters compared to others.
Supervised Learning
As mentioned in the preamble, we deal with two supervised datasets in this section. The MAILOUT training dataset containing a response feature and the MAILOUT test dataset which will be used on the Kaggle competition, to test our model’s performance. The same data preprocessing and feature engineering is applied to these datasets except for the number of rows which are left untouched, in order to prevent information losses. The response variable is separated and kept for evaluation purposes, while the same imputation and scaling techniques are applied to the remaining data. The resulting data is then split into a training and validation set, ready for classification.
To achieve high-performance metrics, we employ four main models in this section with the use of Sklearn’s Pipeline. The first is Gradient Boosting Classifier (GBC), the second is Light Gradient Boost Machine (Light GBM), then we have the XGBoost machine (GBM implementation designed for speed and performance) and lastly, we implement a Neural Network model with Keras.

- Keras Neural Network
For a prepared data frame of 85795 samples, 421 features and a binary response variable (Is a customer — 1 and Is not a customer — 0), our neural network architecture consists of one input layer, one output layer, and two hidden layers as described below:

Metric performance
Upon tuning and tweaking a set of hyperparameters such as learning rate, batch size, epoch size, and dropout, the model happens to be largely biased towards the predicting class 0 (Is not a customer) even though our model is performing with a good training accuracy of 0.98, and validation accuracy of 0.98.
Due to its biased nature, this model produces a poor ROC_AUC score of just 0.54 percent. This tells us that the little number of customers who ought to be predicted as customers(1) are instead predicted as non-customers (0). It fully correlates with the biased nature of our mail_out dataset, with just 1.2 % of the population identified as customers, while the remaining bulk is not customers. A solution to this problem will be either to Upsample the number of customers or downsample the large bulk who are not customers.
- Gradient Boosting Classifier (GBC)
As one of the most used machine learning techniques for regression and classification problems, GBC produces predictions in the form of an ensemble of weak learning models. GBC is used here as a classifier overs a grid search Pipeline, with learning rate, the number of estimators, max_depth and min samples splits considered as tuning hyperparameters.
Metric Performance
Producing an optimized ROC score of 0.4999, the best hyperparameter combination for this estimator is:
{‘learning_rate’: 0.1,
‘max_depth’: 3,
‘min_samples_split’: 4,
‘n_estimators’: 100}
Though a considerable range of parameters was taken into account, GBC happens to produce poor results, with a huge execution wall time. We better look to other boosting approaches to save in time and performance.
- Light Gradient Boost Machine (Light GBM)
Unlike GBC, Light GBM handles categorical features by taking the input of feature names. It does not convert to one-hot-encoding and is much faster than one-hot-encoding. Light GBM uses a particular process to find the split value of categorical features [link].
The grid search for Light GBM is built on four main parameters with the following quantities:
{‘learning_rate’ : [0.01,0.001,0.16,0.1],
‘max_depth’: [3, 5, 10,30],
‘n_estimators’ : [100,200,300,400, 500,1000,2000],
‘min_samples_split’: [2, 4]
}
Metric Performance
Unlike GBC, grid search on Light GBM is able to obtain a better ROC performance of 0.7503 on test data, within a wall time of 54 minutes 13 seconds with best parameters as:
{‘learning_rate’: 0.001,
‘max_depth’: 30,
‘min_samples_split’: 2,
‘n_estimators’: 1000}
Wall time: 54min 13s

- XGBoost (GBM implementation designed for speed and performance)
The difference between XGBoost and Light GBM is its inability to handle categorical features by itself. It only accepts numerical values similar to Random Forest. But again, we have all those handled in the data cleaning section. Following are the grid search parameters adopted for this approach:
{‘learning_rate’ : [0.01,0.001,0.16,0.1],
‘max_depth’: [2,3, 5, 10,30,40],
‘n_estimators’ : [50,100,200,300,400],
‘min_child_weight’ : [1,3,6]}
Metric performance
Out of all four approaches applied, XGBoost produced the best metric performance with ROC score of 0.7741 on test data at a wall time of 2 hours 3 minutes and 56 seconds (probably huge because of the large combinations of hyperparameters), with the following best parameters:
{‘learning_rate’: 0.01,
‘max_depth’: 3,
‘n_estimators’: 100}
Wall time: 2h 3min 56s

NB: The weights obtained from this framework was collected and used in the project’s Kaggle competition, obtaining an ROC score of 0.79781 upon prediction, with a margin of 0.00268 difference from the first score. [link]
Conclusion
With an in-depth parkour towards Data Science methodologies, we provide a set of insightful results to enhance the Arvato Industries decision-making process:
- we extract 260 features that carry the most information necessary for accurate analysis, From the broad range of features provided.
- We study and segment the German population (Azdias.csv) into 15 clusters, based on the similarities between each person with respect to their features.
- We partition the current company’s customers across the 15 population segments (clusters), to see which segments contain the most customers.
- We take a deeper look at the segments with the most customers and analyze what features these segments have in common. These features enable us to know what properties make any individual to willingly become a customer. Three of these features are the age range and the upper-class car shares and trailer shares.
- Using a set of supervised learning frameworks, we predict in real time if a person belonging to the German population will become a customer or not, with an ROC accuracy of 79.781.
Considering these accomplishments, we can affirm to have accomplished our purpose, which is to segment the German population and predict which segments would most likely reach out to the mail-order company, after the reception of a corresponding mail order.