Improve customer experience in using credit cards. For that we will divide customers into groups and adapt bank services based on a group behavior
Public datasets available on Kaggle: https://www.kaggle.com/arjunbhasin2013/ccdata
Licence: CC0: Public Domain
Dataset includes characteristics (18 variables) of 9000 active credit card holders during the last 6 months. Variables:
In our analysis we will assume that data is not biased, which means that it was randomly generated and includes information about different clients in terms of gender, age, income, marital status, etc
data <- read.csv('/Users/korol/Documents/CC GENERAL.csv')
glimpse(data)
## Rows: 8,950
## Columns: 18
## $ CUST_ID <chr> "C10001", "C10002", "C10003", "C10004~
## $ BALANCE <dbl> 40.90075, 3202.46742, 2495.14886, 166~
## $ BALANCE_FREQUENCY <dbl> 0.818182, 0.909091, 1.000000, 0.63636~
## $ PURCHASES <dbl> 95.40, 0.00, 773.17, 1499.00, 16.00, ~
## $ ONEOFF_PURCHASES <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0~
## $ INSTALLMENTS_PURCHASES <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2~
## $ CASH_ADVANCE <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, ~
## $ PURCHASES_FREQUENCY <dbl> 0.166667, 0.000000, 1.000000, 0.08333~
## $ ONEOFF_PURCHASES_FREQUENCY <dbl> 0.000000, 0.000000, 1.000000, 0.08333~
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000~
## $ CASH_ADVANCE_FREQUENCY <dbl> 0.000000, 0.250000, 0.000000, 0.08333~
## $ CASH_ADVANCE_TRX <int> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ PURCHASES_TRX <int> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, ~
## $ CREDIT_LIMIT <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1~
## $ PAYMENTS <dbl> 201.8021, 4103.0326, 622.0667, 0.0000~
## $ MINIMUM_PAYMENTS <dbl> 139.50979, 1072.34022, 627.28479, NA,~
## $ PRC_FULL_PAYMENT <dbl> 0.000000, 0.222222, 0.000000, 0.00000~
## $ TENURE <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1~
print(paste("duplicates in the file", sum(duplicated(data))))
## [1] "duplicates in the file 0"
print(paste("NA values in the file", sum(is.na(data))))
## [1] "NA values in the file 314"
summary(data)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## Length:8950 Min. : 0.0 Min. :0.0000 Min. : 0.00
## Class :character 1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63
## Mode :character Median : 873.4 Median :1.0000 Median : 361.28
## Mean : 1564.5 Mean :0.8773 Mean : 1003.20
## 3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13
## Max. :19043.1 Max. :1.0000 Max. :49039.57
##
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :0.00000
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.:0.08333
## Median : 38.0 Median : 89.0 Median : 0.0 Median :0.50000
## Mean : 592.4 Mean : 411.1 Mean : 978.9 Mean :0.49035
## 3rd Qu.: 577.4 3rd Qu.: 468.6 3rd Qu.: 1113.8 3rd Qu.:0.91667
## Max. :40761.2 Max. :22500.0 Max. :47137.2 Max. :1.00000
##
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.08333 Median :0.1667
## Mean :0.20246 Mean :0.3644
## 3rd Qu.:0.30000 3rd Qu.:0.7500
## Max. :1.00000 Max. :1.0000
##
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 50
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1600
## Median :0.0000 Median : 0.000 Median : 7.00 Median : 3000
## Mean :0.1351 Mean : 3.249 Mean : 14.71 Mean : 4494
## 3rd Qu.:0.2222 3rd Qu.: 4.000 3rd Qu.: 17.00 3rd Qu.: 6500
## Max. :1.5000 Max. :123.000 Max. :358.00 Max. :30000
## NA's :1
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. : 0.0 Min. : 0.02 Min. :0.0000 Min. : 6.00
## 1st Qu.: 383.3 1st Qu.: 169.12 1st Qu.:0.0000 1st Qu.:12.00
## Median : 856.9 Median : 312.34 Median :0.0000 Median :12.00
## Mean : 1733.1 Mean : 864.21 Mean :0.1537 Mean :11.52
## 3rd Qu.: 1901.1 3rd Qu.: 825.49 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :50721.5 Max. :76406.21 Max. :1.0000 Max. :12.00
## NA's :313
No duplicates in our dataset. However, we have 1 missing value in credit limit and 313 in minimum payments
data_t <- drop_na(data,CREDIT_LIMIT)
Find a median and mean values:
data_calcul <- drop_na(data_t,MINIMUM_PAYMENTS)
mean_limit <- mean(data_calcul$MINIMUM_PAYMENTS)
median_limit <- median(data_calcul$MINIMUM_PAYMENTS)
Density graph wthout NA values:
ggplot(data=data_t,aes(x=MINIMUM_PAYMENTS))+geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.9)+
labs(title = "Minimum payment distribution")
## Warning: Removed 313 rows containing non-finite values (stat_density).
Density graph with NA filled with mean:
data_r_mean <- mutate(data_t, MINIMUM_PAYMENTS = ifelse(is.na(MINIMUM_PAYMENTS), mean_limit, MINIMUM_PAYMENTS))
summary(data_r_mean)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## Length:8949 Min. : 0.0 Min. :0.0000 Min. : 0.0
## Class :character 1st Qu.: 128.4 1st Qu.:0.8889 1st Qu.: 39.8
## Mode :character Median : 873.7 Median :1.0000 Median : 361.5
## Mean : 1564.6 Mean :0.8774 Mean : 1003.3
## 3rd Qu.: 2054.4 3rd Qu.:1.0000 3rd Qu.: 1110.2
## Max. :19043.1 Max. :1.0000 Max. :49039.6
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. :0.00000
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.:0.08333
## Median : 38.0 Median : 89.0 Median : 0 Median :0.50000
## Mean : 592.5 Mean : 411.1 Mean : 979 Mean :0.49041
## 3rd Qu.: 577.8 3rd Qu.: 468.6 3rd Qu.: 1114 3rd Qu.:0.91667
## Max. :40761.2 Max. :22500.0 Max. :47137 Max. :1.00000
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.08333 Median :0.1667
## Mean :0.20248 Mean :0.3645
## 3rd Qu.:0.30000 3rd Qu.:0.7500
## Max. :1.00000 Max. :1.0000
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 50
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1600
## Median :0.0000 Median : 0.000 Median : 7.00 Median : 3000
## Mean :0.1351 Mean : 3.249 Mean : 14.71 Mean : 4494
## 3rd Qu.:0.2222 3rd Qu.: 4.000 3rd Qu.: 17.00 3rd Qu.: 6500
## Max. :1.5000 Max. :123.000 Max. :358.00 Max. :30000
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. : 0.0 Min. : 0.02 Min. :0.0000 Min. : 6.00
## 1st Qu.: 383.3 1st Qu.: 170.88 1st Qu.:0.0000 1st Qu.:12.00
## Median : 857.1 Median : 335.66 Median :0.0000 Median :12.00
## Mean : 1733.3 Mean : 864.30 Mean :0.1537 Mean :11.52
## 3rd Qu.: 1901.3 3rd Qu.: 864.30 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :50721.5 Max. :76406.21 Max. :1.0000 Max. :12.00
ggplot(data=data_r_mean,aes(x=MINIMUM_PAYMENTS))+geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Minimum payment distribution")
Density graph with NA filled with median:
data_r_median <- mutate(data_t, MINIMUM_PAYMENTS = ifelse(is.na(MINIMUM_PAYMENTS), median_limit, MINIMUM_PAYMENTS))
summary(data_r_median)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## Length:8949 Min. : 0.0 Min. :0.0000 Min. : 0.0
## Class :character 1st Qu.: 128.4 1st Qu.:0.8889 1st Qu.: 39.8
## Mode :character Median : 873.7 Median :1.0000 Median : 361.5
## Mean : 1564.6 Mean :0.8774 Mean : 1003.3
## 3rd Qu.: 2054.4 3rd Qu.:1.0000 3rd Qu.: 1110.2
## Max. :19043.1 Max. :1.0000 Max. :49039.6
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. :0.00000
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.:0.08333
## Median : 38.0 Median : 89.0 Median : 0 Median :0.50000
## Mean : 592.5 Mean : 411.1 Mean : 979 Mean :0.49041
## 3rd Qu.: 577.8 3rd Qu.: 468.6 3rd Qu.: 1114 3rd Qu.:0.91667
## Max. :40761.2 Max. :22500.0 Max. :47137 Max. :1.00000
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.08333 Median :0.1667
## Mean :0.20248 Mean :0.3645
## 3rd Qu.:0.30000 3rd Qu.:0.7500
## Max. :1.00000 Max. :1.0000
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 50
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1600
## Median :0.0000 Median : 0.000 Median : 7.00 Median : 3000
## Mean :0.1351 Mean : 3.249 Mean : 14.71 Mean : 4494
## 3rd Qu.:0.2222 3rd Qu.: 4.000 3rd Qu.: 17.00 3rd Qu.: 6500
## Max. :1.5000 Max. :123.000 Max. :358.00 Max. :30000
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. : 0.0 Min. : 0.02 Min. :0.0000 Min. : 6.00
## 1st Qu.: 383.3 1st Qu.: 170.88 1st Qu.:0.0000 1st Qu.:12.00
## Median : 857.1 Median : 312.45 Median :0.0000 Median :12.00
## Mean : 1733.3 Mean : 845.00 Mean :0.1537 Mean :11.52
## 3rd Qu.: 1901.3 3rd Qu.: 788.72 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :50721.5 Max. :76406.21 Max. :1.0000 Max. :12.00
ggplot(data=data_r_median,aes(x=MINIMUM_PAYMENTS))+geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Minimum payment distribution")
As we can see median affects less the distribution, so to avoid misleading results we will replace missing values with mean values
In this section we will review the variables and remove outliers
BALANCE - Balance amount left on the bank account:
ggplot(data=data_r_mean,aes(x=BALANCE))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Balance distribution")+
ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_r_mean$BALANCE)
We can see significat number of outliers. This can negatively affect our analysis, so we will reduce its number
def <- function(x){
quantiles <- quantile( x, c(.05, .95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
data_r_mean$BALANCE <- def( data_r_mean$BALANCE )
Overall, we can see that the distribution of BALANCE variable is skewed right. Which give us an idea that most of owners of the credit cards have not so high balance on the card.
BALANCE_FREQUENCY - How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
ggplot(data=data_r_mean,aes(x=BALANCE_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.2)+
labs(title = "Balance frequency update distribution")+
ylab("Number of accounts")
Majority of accounts are being updated quite often. And taken into consideration the previous result, we can make an aasumption that users operate with small amounts.
PURCHASES - Amount of purchases made from account
ggplot(data=data_r_mean,aes(x=PURCHASES))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Amount of purchases distribution")+
ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_r_mean$PURCHASES)
We can also notice outliers, which should be removed.
data_r_mean$PURCHASES <- def( data_r_mean$PURCHASES )
The distribution on purchases amounts shows us that the assumption we made make sence. Clients do not do big purchases (or do but not so often).
ONEOFF_PURCHASES - Maximum purchase amount done in one-go
ggplot(data=data_r_mean,aes(x=ONEOFF_PURCHASES))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Amount of purchases in one-go distribution")+
ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_r_mean$ONEOFF_PURCHASES)
data_r_mean$ONEOFF_PURCHASES <- def( data_r_mean$ONEOFF_PURCHASES )
We can see a quite similar situation as we saw on purchase distribution graph. However, this can mean different users’ behavior: some people make purchases rare, but spend a lot, others prefer to do a lot of small purchases. It will be clear when we will check a heatmap
INSTALLMENTS_PURCHASES - Amount of purchase done in installment
ggplot(data=data_r_mean,aes(x=INSTALLMENTS_PURCHASES))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Amount of purchases in installment distribution")+
ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_r_mean$INSTALLMENTS_PURCHASES)
data_r_mean$INSTALLMENTS_PURCHASES <- def( data_r_mean$INSTALLMENTS_PURCHASES )
According to the distribution, not too many our clients use this way of payment. However, we can make a hypothesis that clients who spend a lot in one-go, use this option more often than the one who buys more often, but on a small amount.
CASH_ADVANCE - Cash in advance given to the user
ggplot(data=data_r_mean,aes(x=CASH_ADVANCE))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Amount of purchases (loan) distribution")+
ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_r_mean$CASH_ADVANCE)
data_r_mean$CASH_ADVANCE <- def( data_r_mean$CASH_ADVANCE )
As we can see some of our clients use the option of “payment by the given cash”. Plus, we can notice that the amount spent are quite high. It can depend on the fact that loan money people use mostly on expensive purchases.
PURCHASES_FREQUENCY - How frequently the Purchases are being made
ggplot(data=data_r_mean,aes(x=PURCHASES_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.15)+
labs(title = "Frequency of purchases distribution")+
ylab("Number of accounts")
According to this distribution frequency of purchases is different from one client to another. Some of them make purchases quite often, other ones very rare. It can depend on Purchase amount. We will check this hypothesis on the heatmap.
ONEOFF_PURCHASES_FREQUENCY - How frequently Purchases are happening in one-go
ggplot(data=data_r_mean,aes(x=ONEOFF_PURCHASES_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.15)+
labs(title = "One-go purchases frequency distribution")+
ylab("Number of accounts")
As we can see majority of card owners do not make purchases in one-go. This can be a good insight in terms of what clients are buying. For example, if clients would use credit cards mostly for purchases of something valuable we could see that frequency would be quite high. However, our clients use credit cards for ordinary purchases as well.
PURCHASES_INSTALLMENTS_FREQUENCY - How frequently purchases in installments are being done
ggplot(data=data_r_mean,aes(x=PURCHASES_INSTALLMENTS_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.15)+
labs(title = "Purchases in installement frequency distribution")+
ylab("Number of accounts")
Clearly there are different types of credit owners can be spotted: majority does not buy in installments, however, there is clearly a group who is using this option
CASH_ADVANCE_FREQUENCY - How frequently the cash in advance being paid
ggplot(data=data_r_mean,aes(x=CASH_ADVANCE_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.15)+
labs(title = "Cash in advance frequency payment distribution")+
ylab("Number of accounts")
As we can see our clients do not use often the cash borrowed against the credit limit. Which can give us an idea that most customers use credit cards as a debit one. This can be improved with a new customer relationship strategy
PURCHASESTRX - Numbe of purchase transactions made
ggplot(data=data_r_mean,aes(x=PURCHASES_TRX ))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Number of transactions distribution")+
ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can notice that some accounts stay inactive. Probably the bank should “wake up” the users by organizing a marketing campaign
CREDIT_LIMIT - Limit of Credit Card for user
ggplot(data=data_r_mean,aes(x=CREDIT_LIMIT ))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Credit card limit distribution")+
ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_r_mean$CREDIT_LIMIT)
data_r_mean$CREDIT_LIMIT <- def( data_r_mean$CREDIT_LIMIT )
We can see a quite common picture: there a lot of users with quite low credit card limit, as well as the one with quite a high one. The credit limit usually depends on the income of the user and card type
PRC_FULL_PAYMENT - Percent of full payment paid by user
ggplot(data=data_r_mean,aes(x=PRC_FULL_PAYMENT ))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, binwidth =0.15 )+
labs(title = "Percent of full payment paid distribution")+
ylab("Number of accounts")
As we can see there are still a lot of payments which were not payed by users. Probably they would need some notification or any other reminders about their debt. This can be also one of the reasons why customers avoid cash in advance option.
PAYMENTS - Amount of Payment done by user
ggplot(data=data_r_mean,aes(x=PAYMENTS ))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
labs(title = "Amount of payment distribution")+
ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_r_mean$PAYMENTS)
data_r_mean$PAYMENTS <- def( data_r_mean$PAYMENTS )
According to payment distribution, some users stay inactive, what proves us the necessity of changing of the marketing campaign.
All connections can be displayed in the correlation matrix:
Correlation matrix
df<- select (data_r_mean,-c(CUST_ID))
correlation<-cor(df)
corrplot(correlation, method="circle",tl.cex = 0.5,tl.srt = 30)
Now, after we analyzed the variables and removed all outliers, we can move to the next step - division of users into the groups. We will try 2 methods: k-means and agglomerative clustering (hierarchy)
In order our clustering method to work better, we should reduce number of dimensions. For that we will use PCA method.
pca <- prcomp(df, scale = TRUE)
fviz_eig(pca)
According to the graph first 2 components describe around 50% of data, however this is not enough for us. We will use 9 components in order to reach 90% of data description.
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2344 1.8842 1.21793 1.12493 1.02469 0.96892 0.85881
## Proportion of Variance 0.2937 0.2088 0.08726 0.07444 0.06176 0.05522 0.04339
## Cumulative Proportion 0.2937 0.5025 0.58976 0.66420 0.72597 0.78119 0.82457
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.85135 0.74729 0.62905 0.58829 0.52817 0.48473 0.4555
## Proportion of Variance 0.04264 0.03285 0.02328 0.02036 0.01641 0.01382 0.0122
## Cumulative Proportion 0.86721 0.90006 0.92334 0.94369 0.96010 0.97392 0.9861
## PC15 PC16 PC17
## Standard deviation 0.41181 0.21374 0.14331
## Proportion of Variance 0.00998 0.00269 0.00121
## Cumulative Proportion 0.99610 0.99879 1.00000
pca$rotation[,1:9]
## PC1 PC2 PC3
## BALANCE 0.02958569 0.42843002 -0.08665421
## BALANCE_FREQUENCY 0.11459608 0.17547432 -0.24540345
## PURCHASES 0.40870226 0.06646847 0.14572747
## ONEOFF_PURCHASES 0.33392459 0.09726801 0.41990106
## INSTALLMENTS_PURCHASES 0.34552565 -0.01437961 -0.33065026
## CASH_ADVANCE -0.10627769 0.43490122 -0.07356351
## PURCHASES_FREQUENCY 0.35860753 -0.11217420 -0.28190157
## ONEOFF_PURCHASES_FREQUENCY 0.30999017 0.06051366 0.38525283
## PURCHASES_INSTALLMENTS_FREQUENCY 0.30383274 -0.11769475 -0.50867778
## CASH_ADVANCE_FREQUENCY -0.13349292 0.41787969 -0.11367215
## CASH_ADVANCE_TRX -0.09047753 0.39608952 -0.13016314
## PURCHASES_TRX 0.36541988 0.05419991 -0.03910502
## CREDIT_LIMIT 0.16294067 0.27666802 0.16524186
## PAYMENTS 0.21124716 0.29082922 0.14572193
## MINIMUM_PAYMENTS 0.02764953 0.16591386 -0.22422978
## PRC_FULL_PAYMENT 0.14617060 -0.16996727 0.04798094
## TENURE 0.08315733 0.01853191 0.02860733
## PC4 PC5 PC6
## BALANCE 0.287517965 -0.01044361 0.108597100
## BALANCE_FREQUENCY 0.250172811 -0.47719641 -0.399208698
## PURCHASES -0.021785638 -0.04127086 0.086890402
## ONEOFF_PURCHASES 0.015615830 -0.19118024 0.065722438
## INSTALLMENTS_PURCHASES -0.080323472 0.15946612 0.053141717
## CASH_ADVANCE -0.216008913 0.09251977 -0.001677022
## PURCHASES_FREQUENCY -0.091647515 -0.13550677 -0.042988119
## ONEOFF_PURCHASES_FREQUENCY 0.006082614 -0.31068704 -0.040095971
## PURCHASES_INSTALLMENTS_FREQUENCY -0.102260169 0.03290521 -0.024489524
## CASH_ADVANCE_FREQUENCY -0.270425918 -0.14981032 -0.097540912
## CASH_ADVANCE_TRX -0.307613144 -0.05737734 -0.150742599
## PURCHASES_TRX -0.001652891 -0.04558888 0.074885832
## CREDIT_LIMIT 0.085728061 0.35509056 0.038800278
## PAYMENTS -0.139215061 0.32402291 -0.021733652
## MINIMUM_PAYMENTS 0.429796532 0.15597438 0.569865480
## PRC_FULL_PAYMENT -0.440579896 0.34898908 -0.101455843
## TENURE 0.456835644 0.42301956 -0.659715267
## PC7 PC8 PC9
## BALANCE -0.07060779 0.28815489 0.018607651
## BALANCE_FREQUENCY -0.47751179 0.04058945 0.379415236
## PURCHASES 0.12968382 -0.08182988 0.164570992
## ONEOFF_PURCHASES 0.05610442 -0.12261122 0.014011698
## INSTALLMENTS_PURCHASES 0.20863180 0.04512768 0.240743724
## CASH_ADVANCE 0.01420721 0.06672741 0.022060925
## PURCHASES_FREQUENCY -0.06057857 0.06909679 -0.347048314
## ONEOFF_PURCHASES_FREQUENCY -0.11802041 -0.09250240 -0.355368845
## PURCHASES_INSTALLMENTS_FREQUENCY 0.10290834 0.11589627 -0.149705334
## CASH_ADVANCE_FREQUENCY 0.05998574 -0.15635027 -0.132416312
## CASH_ADVANCE_TRX 0.10364445 -0.28860330 -0.298544176
## PURCHASES_TRX 0.20116944 -0.08784955 -0.009832214
## CREDIT_LIMIT -0.20664994 0.61243691 -0.272834897
## PAYMENTS 0.06731767 -0.14723918 0.509407446
## MINIMUM_PAYMENTS -0.29383628 -0.49074710 -0.140167060
## PRC_FULL_PAYMENT -0.68114083 -0.17502701 -0.002395611
## TENURE 0.15091584 -0.27687042 -0.201210553
colMain <- colorRampPalette(brewer.pal(8, "Blues"))(25)
heatmap(pca$rotation[,1:9],Colv=NA, Rowv=NA, scale='none',col=colMain)
On the graph we can see composition of each principal component.
pca_df <- cbind(df[0],pca$x[,1:9])
We will use Elbow method to define optimal number of clusters:
fviz_nbclust(pca_df, kmeans, method = "wss")
As we can see 5 clusters looks like an optimal value for k-means method
fviz_nbclust(pca_df, hcut, method = "wss")
We will use also 5 clusters.
K-means:
km.res <- eclust(pca_df, "kmeans", k = 5,
nstart = 25, graph = FALSE)
Agglomerative clustering (single-linkage method)
res.hc <- eclust(pca_df, "hclust", k = 5,
method = "single", graph = FALSE)
print(res.hc)
##
## Call:
## stats::hclust(d = x, method = hc_method)
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 8949
We will use silhouette score:
Check model quality for K-means:
fviz_silhouette(km.res)
## cluster size ave.sil.width
## 1 1 2938 0.32
## 2 2 2332 0.20
## 3 3 1300 0.19
## 4 4 1218 0.12
## 5 5 1161 0.18
Check model quality for Agglomerative clustering:
fviz_silhouette(res.hc)
## cluster size ave.sil.width
## 1 1 3543 0.26
## 2 2 840 0.25
## 3 3 3178 0.10
## 4 4 521 0.25
## 5 5 867 0.08
According to the silhouette score K-means does clustering better than Agglomerative clustering method.
We will use for result interpretation better model: K-means
First we will see at the composition of each cluster:
cluster <- c(1: 5)
center <- km.res$centers
center_df <- data.frame(cluster, center)
center_reshape <- gather(center_df, features, values,PC1:PC9)
center_reshape
## cluster features values
## 1 1 PC1 -1.37218255
## 2 2 PC1 0.93332431
## 3 3 PC1 -1.31529294
## 4 4 PC1 -1.20772795
## 5 5 PC1 4.33751379
## 6 1 PC2 -0.10812685
## 7 2 PC2 -1.23157150
## 8 3 PC2 -1.48652162
## 9 4 PC2 3.46664149
## 10 5 PC2 0.77503032
## 11 1 PC3 0.31287389
## 12 2 PC3 -1.13956679
## 13 3 PC3 0.86660116
## 14 4 PC3 -0.35364147
## 15 5 PC3 0.89784675
## 16 1 PC4 0.54949104
## 17 2 PC4 -0.15820480
## 18 3 PC4 -0.56841332
## 19 4 PC4 -0.43654619
## 20 5 PC4 0.02168776
## 21 1 PC5 -0.45763662
## 22 2 PC5 -0.07317632
## 23 3 PC5 1.07031026
## 24 4 PC5 0.25284707
## 25 5 PC5 -0.15864558
## 26 1 PC6 -0.21539288
## 27 2 PC6 -0.20714160
## 28 3 PC6 0.74975832
## 29 4 PC6 -0.04873688
## 30 5 PC6 0.17274264
## 31 1 PC7 -0.11325811
## 32 2 PC7 -0.21512317
## 33 3 PC7 0.67620975
## 34 4 PC7 -0.17590096
## 35 5 PC7 0.14607602
## 36 1 PC8 -0.08897819
## 37 2 PC8 0.12712013
## 38 3 PC8 -0.05840418
## 39 4 PC8 0.15204442
## 40 5 PC8 -0.12428156
## 41 1 PC9 0.30043477
## 42 2 PC9 -0.09793344
## 43 3 PC9 -0.48926991
## 44 4 PC9 -0.14123239
## 45 5 PC9 0.13245079
ggplot(data = center_reshape, aes(x = features, y = cluster, fill = values)) +
scale_y_continuous(breaks = seq(1, 7, by = 1)) +
geom_tile() +
coord_equal() +
theme_classic()
On the graph we can see that overall components PC4, PC5, PC6, PC7, PC8, PC9 have quite similar values for each cluster. The main difference among them is restricted in components PC1, PC2, PC3. That is why we will focus on them.
Cluster 1: Economical
These clients have low income. They buy rarely and little. However, when they want to buy something, they use installment or cash in advance payment methods. They do not borrow too much. However, the bank should pay attention to them, as they often do not pay off the debt on the credit card.
Cluster 2: Restricted
Clients of this group also earn a little, but they make a lot of purchases. They rarely use loans and mostly spend their own funds. They do not often buy something in one go, usually they spend often but a small amounts
Cluster 3: Opportunists
This cluster has the poorest customers. However, they shop quite often. They often pay for purchases with borrowed funds. They are somewhere between cluster 1 and 2.
Cluster 4: Middle class
Clients of this cluster have an average income and make purchases quite rarely. However, if they do decide to buy, they spend a lot. They often pay with borrowed funds. It should be noted separately that these clients also have a high level of debt.
Cluster 5: Wealthy clients
These clients are very wealthy and spend a lot. They buy both in one go and in the usual way. These clients rarely use borrowed funds, but if they do, they regularly pay off the debt.
For Economical: Economical class does not use often the credit cards. Our main goal is to wake them up. For that the bank should send them more often the marketing propositions from their partners: promotions, discounts, etc. in order to encourage them to buy. Another way to wake them up is to propose a cashback. For someone who does not earn much, it could be a great opportunity to save money
For Restricted: Clients of this group buy a lot, but pay with their own money. One of the possible reasons can be a fear to have a loan. Probably for them the mechanism of repayment is not clear enough. One of potential solutions can be an integration of instructions in the bank application and user friendly interface, where they can see clearly how the payments will be taken. Plus bank workers can call these clients from time to time to check if they have all necessary information
For Opportunists: Opportunists are already buying a lot. Their main problem is paying for borrowing money. As we can see it is very hard for them to return back the loan. That is why one of the potential solutions can be a proposition of a loan in order to cover the amount borrowed with the card. At the same time, these clients would appreciate if the payment period could be increased.
For Middle class: These clients are using the card only for buying something expensive and do not use it on a dayily basis. Strategy of communication with these clients should be based on two things: convince them that the credit card can be used as a debit one and help them to pay their debts on time. As for the first problem, the bank should provide them a clear explanation how the card works (in the application, by phone, in the office). The system should be clear and prevent customers when they are about to exceed the credit limit, in order to avoid misunderstanding. As for the second problem, the bank can propose clients to set up automatic payments to cover their debt from their salary and notification can also serve in this situation
For Wealthy clients: Wealthy clients use credit card as a debit one. Due to the high credit limit they are rarely exceeding it. The bank can encourage them to make costly purchases by sending propositions from their partners. However, the proposition should be adapted to the client profile. Here we would need to dive deeper and analyze their preferences. For example, we can check where they usually eat, what they like to do, etc. Analysis should help us to identify their needs. After that clients can be classified into groups and have their own newsletter with recommendations