Business task

Improve customer experience in using credit cards. For that we will divide customers into groups and adapt bank services based on a group behavior

Datasource description

Public datasets available on Kaggle: https://www.kaggle.com/arjunbhasin2013/ccdata

Licence: CC0: Public Domain

Dataset includes characteristics (18 variables) of 9000 active credit card holders during the last 6 months. Variables:

In our analysis we will assume that data is not biased, which means that it was randomly generated and includes information about different clients in terms of gender, age, income, marital status, etc

Data cleaning and organization

1. Import of libraries and files

data <- read.csv('/Users/korol/Documents/CC GENERAL.csv')

2. Data errors check

glimpse(data)
## Rows: 8,950
## Columns: 18
## $ CUST_ID                          <chr> "C10001", "C10002", "C10003", "C10004~
## $ BALANCE                          <dbl> 40.90075, 3202.46742, 2495.14886, 166~
## $ BALANCE_FREQUENCY                <dbl> 0.818182, 0.909091, 1.000000, 0.63636~
## $ PURCHASES                        <dbl> 95.40, 0.00, 773.17, 1499.00, 16.00, ~
## $ ONEOFF_PURCHASES                 <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0~
## $ INSTALLMENTS_PURCHASES           <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2~
## $ CASH_ADVANCE                     <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, ~
## $ PURCHASES_FREQUENCY              <dbl> 0.166667, 0.000000, 1.000000, 0.08333~
## $ ONEOFF_PURCHASES_FREQUENCY       <dbl> 0.000000, 0.000000, 1.000000, 0.08333~
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000~
## $ CASH_ADVANCE_FREQUENCY           <dbl> 0.000000, 0.250000, 0.000000, 0.08333~
## $ CASH_ADVANCE_TRX                 <int> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ PURCHASES_TRX                    <int> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, ~
## $ CREDIT_LIMIT                     <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1~
## $ PAYMENTS                         <dbl> 201.8021, 4103.0326, 622.0667, 0.0000~
## $ MINIMUM_PAYMENTS                 <dbl> 139.50979, 1072.34022, 627.28479, NA,~
## $ PRC_FULL_PAYMENT                 <dbl> 0.000000, 0.222222, 0.000000, 0.00000~
## $ TENURE                           <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1~
print(paste("duplicates in the file", sum(duplicated(data))))
## [1] "duplicates in the file 0"
print(paste("NA values in the file", sum(is.na(data))))
## [1] "NA values in the file 314"
summary(data)
##    CUST_ID             BALANCE        BALANCE_FREQUENCY   PURCHASES       
##  Length:8950        Min.   :    0.0   Min.   :0.0000    Min.   :    0.00  
##  Class :character   1st Qu.:  128.3   1st Qu.:0.8889    1st Qu.:   39.63  
##  Mode  :character   Median :  873.4   Median :1.0000    Median :  361.28  
##                     Mean   : 1564.5   Mean   :0.8773    Mean   : 1003.20  
##                     3rd Qu.: 2054.1   3rd Qu.:1.0000    3rd Qu.: 1110.13  
##                     Max.   :19043.1   Max.   :1.0000    Max.   :49039.57  
##                                                                           
##  ONEOFF_PURCHASES  INSTALLMENTS_PURCHASES  CASH_ADVANCE     PURCHASES_FREQUENCY
##  Min.   :    0.0   Min.   :    0.0        Min.   :    0.0   Min.   :0.00000    
##  1st Qu.:    0.0   1st Qu.:    0.0        1st Qu.:    0.0   1st Qu.:0.08333    
##  Median :   38.0   Median :   89.0        Median :    0.0   Median :0.50000    
##  Mean   :  592.4   Mean   :  411.1        Mean   :  978.9   Mean   :0.49035    
##  3rd Qu.:  577.4   3rd Qu.:  468.6        3rd Qu.: 1113.8   3rd Qu.:0.91667    
##  Max.   :40761.2   Max.   :22500.0        Max.   :47137.2   Max.   :1.00000    
##                                                                                
##  ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
##  Min.   :0.00000            Min.   :0.0000                  
##  1st Qu.:0.00000            1st Qu.:0.0000                  
##  Median :0.08333            Median :0.1667                  
##  Mean   :0.20246            Mean   :0.3644                  
##  3rd Qu.:0.30000            3rd Qu.:0.7500                  
##  Max.   :1.00000            Max.   :1.0000                  
##                                                             
##  CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX  PURCHASES_TRX     CREDIT_LIMIT  
##  Min.   :0.0000         Min.   :  0.000   Min.   :  0.00   Min.   :   50  
##  1st Qu.:0.0000         1st Qu.:  0.000   1st Qu.:  1.00   1st Qu.: 1600  
##  Median :0.0000         Median :  0.000   Median :  7.00   Median : 3000  
##  Mean   :0.1351         Mean   :  3.249   Mean   : 14.71   Mean   : 4494  
##  3rd Qu.:0.2222         3rd Qu.:  4.000   3rd Qu.: 17.00   3rd Qu.: 6500  
##  Max.   :1.5000         Max.   :123.000   Max.   :358.00   Max.   :30000  
##                                                            NA's   :1      
##     PAYMENTS       MINIMUM_PAYMENTS   PRC_FULL_PAYMENT     TENURE     
##  Min.   :    0.0   Min.   :    0.02   Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:  383.3   1st Qu.:  169.12   1st Qu.:0.0000   1st Qu.:12.00  
##  Median :  856.9   Median :  312.34   Median :0.0000   Median :12.00  
##  Mean   : 1733.1   Mean   :  864.21   Mean   :0.1537   Mean   :11.52  
##  3rd Qu.: 1901.1   3rd Qu.:  825.49   3rd Qu.:0.1429   3rd Qu.:12.00  
##  Max.   :50721.5   Max.   :76406.21   Max.   :1.0000   Max.   :12.00  
##                    NA's   :313

No duplicates in our dataset. However, we have 1 missing value in credit limit and 313 in minimum payments

3. Data cleaning

Credit limit: delete the line with NA value

data_t <- drop_na(data,CREDIT_LIMIT)

Filling NA values for MINIMUM_PAYMENTS:

Find a median and mean values:

data_calcul <- drop_na(data_t,MINIMUM_PAYMENTS)
mean_limit <- mean(data_calcul$MINIMUM_PAYMENTS)
median_limit <- median(data_calcul$MINIMUM_PAYMENTS)

Density graph wthout NA values:

ggplot(data=data_t,aes(x=MINIMUM_PAYMENTS))+geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.9)+
  labs(title = "Minimum payment distribution")
## Warning: Removed 313 rows containing non-finite values (stat_density).

Density graph with NA filled with mean:

data_r_mean <- mutate(data_t, MINIMUM_PAYMENTS = ifelse(is.na(MINIMUM_PAYMENTS), mean_limit, MINIMUM_PAYMENTS))
summary(data_r_mean) 
##    CUST_ID             BALANCE        BALANCE_FREQUENCY   PURCHASES      
##  Length:8949        Min.   :    0.0   Min.   :0.0000    Min.   :    0.0  
##  Class :character   1st Qu.:  128.4   1st Qu.:0.8889    1st Qu.:   39.8  
##  Mode  :character   Median :  873.7   Median :1.0000    Median :  361.5  
##                     Mean   : 1564.6   Mean   :0.8774    Mean   : 1003.3  
##                     3rd Qu.: 2054.4   3rd Qu.:1.0000    3rd Qu.: 1110.2  
##                     Max.   :19043.1   Max.   :1.0000    Max.   :49039.6  
##  ONEOFF_PURCHASES  INSTALLMENTS_PURCHASES  CASH_ADVANCE   PURCHASES_FREQUENCY
##  Min.   :    0.0   Min.   :    0.0        Min.   :    0   Min.   :0.00000    
##  1st Qu.:    0.0   1st Qu.:    0.0        1st Qu.:    0   1st Qu.:0.08333    
##  Median :   38.0   Median :   89.0        Median :    0   Median :0.50000    
##  Mean   :  592.5   Mean   :  411.1        Mean   :  979   Mean   :0.49041    
##  3rd Qu.:  577.8   3rd Qu.:  468.6        3rd Qu.: 1114   3rd Qu.:0.91667    
##  Max.   :40761.2   Max.   :22500.0        Max.   :47137   Max.   :1.00000    
##  ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
##  Min.   :0.00000            Min.   :0.0000                  
##  1st Qu.:0.00000            1st Qu.:0.0000                  
##  Median :0.08333            Median :0.1667                  
##  Mean   :0.20248            Mean   :0.3645                  
##  3rd Qu.:0.30000            3rd Qu.:0.7500                  
##  Max.   :1.00000            Max.   :1.0000                  
##  CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX  PURCHASES_TRX     CREDIT_LIMIT  
##  Min.   :0.0000         Min.   :  0.000   Min.   :  0.00   Min.   :   50  
##  1st Qu.:0.0000         1st Qu.:  0.000   1st Qu.:  1.00   1st Qu.: 1600  
##  Median :0.0000         Median :  0.000   Median :  7.00   Median : 3000  
##  Mean   :0.1351         Mean   :  3.249   Mean   : 14.71   Mean   : 4494  
##  3rd Qu.:0.2222         3rd Qu.:  4.000   3rd Qu.: 17.00   3rd Qu.: 6500  
##  Max.   :1.5000         Max.   :123.000   Max.   :358.00   Max.   :30000  
##     PAYMENTS       MINIMUM_PAYMENTS   PRC_FULL_PAYMENT     TENURE     
##  Min.   :    0.0   Min.   :    0.02   Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:  383.3   1st Qu.:  170.88   1st Qu.:0.0000   1st Qu.:12.00  
##  Median :  857.1   Median :  335.66   Median :0.0000   Median :12.00  
##  Mean   : 1733.3   Mean   :  864.30   Mean   :0.1537   Mean   :11.52  
##  3rd Qu.: 1901.3   3rd Qu.:  864.30   3rd Qu.:0.1429   3rd Qu.:12.00  
##  Max.   :50721.5   Max.   :76406.21   Max.   :1.0000   Max.   :12.00
ggplot(data=data_r_mean,aes(x=MINIMUM_PAYMENTS))+geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Minimum payment distribution")

Density graph with NA filled with median:

data_r_median <- mutate(data_t, MINIMUM_PAYMENTS = ifelse(is.na(MINIMUM_PAYMENTS), median_limit, MINIMUM_PAYMENTS))
summary(data_r_median) 
##    CUST_ID             BALANCE        BALANCE_FREQUENCY   PURCHASES      
##  Length:8949        Min.   :    0.0   Min.   :0.0000    Min.   :    0.0  
##  Class :character   1st Qu.:  128.4   1st Qu.:0.8889    1st Qu.:   39.8  
##  Mode  :character   Median :  873.7   Median :1.0000    Median :  361.5  
##                     Mean   : 1564.6   Mean   :0.8774    Mean   : 1003.3  
##                     3rd Qu.: 2054.4   3rd Qu.:1.0000    3rd Qu.: 1110.2  
##                     Max.   :19043.1   Max.   :1.0000    Max.   :49039.6  
##  ONEOFF_PURCHASES  INSTALLMENTS_PURCHASES  CASH_ADVANCE   PURCHASES_FREQUENCY
##  Min.   :    0.0   Min.   :    0.0        Min.   :    0   Min.   :0.00000    
##  1st Qu.:    0.0   1st Qu.:    0.0        1st Qu.:    0   1st Qu.:0.08333    
##  Median :   38.0   Median :   89.0        Median :    0   Median :0.50000    
##  Mean   :  592.5   Mean   :  411.1        Mean   :  979   Mean   :0.49041    
##  3rd Qu.:  577.8   3rd Qu.:  468.6        3rd Qu.: 1114   3rd Qu.:0.91667    
##  Max.   :40761.2   Max.   :22500.0        Max.   :47137   Max.   :1.00000    
##  ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
##  Min.   :0.00000            Min.   :0.0000                  
##  1st Qu.:0.00000            1st Qu.:0.0000                  
##  Median :0.08333            Median :0.1667                  
##  Mean   :0.20248            Mean   :0.3645                  
##  3rd Qu.:0.30000            3rd Qu.:0.7500                  
##  Max.   :1.00000            Max.   :1.0000                  
##  CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX  PURCHASES_TRX     CREDIT_LIMIT  
##  Min.   :0.0000         Min.   :  0.000   Min.   :  0.00   Min.   :   50  
##  1st Qu.:0.0000         1st Qu.:  0.000   1st Qu.:  1.00   1st Qu.: 1600  
##  Median :0.0000         Median :  0.000   Median :  7.00   Median : 3000  
##  Mean   :0.1351         Mean   :  3.249   Mean   : 14.71   Mean   : 4494  
##  3rd Qu.:0.2222         3rd Qu.:  4.000   3rd Qu.: 17.00   3rd Qu.: 6500  
##  Max.   :1.5000         Max.   :123.000   Max.   :358.00   Max.   :30000  
##     PAYMENTS       MINIMUM_PAYMENTS   PRC_FULL_PAYMENT     TENURE     
##  Min.   :    0.0   Min.   :    0.02   Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:  383.3   1st Qu.:  170.88   1st Qu.:0.0000   1st Qu.:12.00  
##  Median :  857.1   Median :  312.45   Median :0.0000   Median :12.00  
##  Mean   : 1733.3   Mean   :  845.00   Mean   :0.1537   Mean   :11.52  
##  3rd Qu.: 1901.3   3rd Qu.:  788.72   3rd Qu.:0.1429   3rd Qu.:12.00  
##  Max.   :50721.5   Max.   :76406.21   Max.   :1.0000   Max.   :12.00
ggplot(data=data_r_median,aes(x=MINIMUM_PAYMENTS))+geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Minimum payment distribution")

As we can see median affects less the distribution, so to avoid misleading results we will replace missing values with mean values

Analysis of variable and outliers detection:

In this section we will review the variables and remove outliers

BALANCE - Balance amount left on the bank account:

ggplot(data=data_r_mean,aes(x=BALANCE))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Balance distribution")+
  ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_r_mean$BALANCE)

We can see significat number of outliers. This can negatively affect our analysis, so we will reduce its number

def <- function(x){
  quantiles <- quantile( x, c(.05, .95 ) )
  x[ x < quantiles[1] ] <- quantiles[1]
  x[ x > quantiles[2] ] <- quantiles[2]
  x
}
data_r_mean$BALANCE <- def( data_r_mean$BALANCE )

Overall, we can see that the distribution of BALANCE variable is skewed right. Which give us an idea that most of owners of the credit cards have not so high balance on the card.

BALANCE_FREQUENCY - How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)

ggplot(data=data_r_mean,aes(x=BALANCE_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.2)+
  labs(title = "Balance frequency update distribution")+
  ylab("Number of accounts")

Majority of accounts are being updated quite often. And taken into consideration the previous result, we can make an aasumption that users operate with small amounts.

PURCHASES - Amount of purchases made from account

ggplot(data=data_r_mean,aes(x=PURCHASES))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Amount of purchases distribution")+
  ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_r_mean$PURCHASES)

We can also notice outliers, which should be removed.

data_r_mean$PURCHASES <- def( data_r_mean$PURCHASES )

The distribution on purchases amounts shows us that the assumption we made make sence. Clients do not do big purchases (or do but not so often).

ONEOFF_PURCHASES - Maximum purchase amount done in one-go

ggplot(data=data_r_mean,aes(x=ONEOFF_PURCHASES))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Amount of purchases in one-go distribution")+
  ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_r_mean$ONEOFF_PURCHASES)

data_r_mean$ONEOFF_PURCHASES <- def( data_r_mean$ONEOFF_PURCHASES )

We can see a quite similar situation as we saw on purchase distribution graph. However, this can mean different users’ behavior: some people make purchases rare, but spend a lot, others prefer to do a lot of small purchases. It will be clear when we will check a heatmap

INSTALLMENTS_PURCHASES - Amount of purchase done in installment

ggplot(data=data_r_mean,aes(x=INSTALLMENTS_PURCHASES))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Amount of purchases in installment distribution")+
  ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_r_mean$INSTALLMENTS_PURCHASES)

data_r_mean$INSTALLMENTS_PURCHASES <- def( data_r_mean$INSTALLMENTS_PURCHASES )

According to the distribution, not too many our clients use this way of payment. However, we can make a hypothesis that clients who spend a lot in one-go, use this option more often than the one who buys more often, but on a small amount.

CASH_ADVANCE - Cash in advance given to the user

ggplot(data=data_r_mean,aes(x=CASH_ADVANCE))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Amount of purchases (loan) distribution")+
  ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_r_mean$CASH_ADVANCE)

data_r_mean$CASH_ADVANCE <- def( data_r_mean$CASH_ADVANCE )

As we can see some of our clients use the option of “payment by the given cash”. Plus, we can notice that the amount spent are quite high. It can depend on the fact that loan money people use mostly on expensive purchases.

PURCHASES_FREQUENCY - How frequently the Purchases are being made

ggplot(data=data_r_mean,aes(x=PURCHASES_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.15)+
  labs(title = "Frequency of purchases distribution")+
  ylab("Number of accounts")

According to this distribution frequency of purchases is different from one client to another. Some of them make purchases quite often, other ones very rare. It can depend on Purchase amount. We will check this hypothesis on the heatmap.

ONEOFF_PURCHASES_FREQUENCY - How frequently Purchases are happening in one-go

ggplot(data=data_r_mean,aes(x=ONEOFF_PURCHASES_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.15)+
  labs(title = "One-go purchases frequency distribution")+
  ylab("Number of accounts")

As we can see majority of card owners do not make purchases in one-go. This can be a good insight in terms of what clients are buying. For example, if clients would use credit cards mostly for purchases of something valuable we could see that frequency would be quite high. However, our clients use credit cards for ordinary purchases as well.

PURCHASES_INSTALLMENTS_FREQUENCY - How frequently purchases in installments are being done

ggplot(data=data_r_mean,aes(x=PURCHASES_INSTALLMENTS_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.15)+
  labs(title = "Purchases in installement frequency distribution")+
  ylab("Number of accounts")

Clearly there are different types of credit owners can be spotted: majority does not buy in installments, however, there is clearly a group who is using this option

CASH_ADVANCE_FREQUENCY - How frequently the cash in advance being paid

ggplot(data=data_r_mean,aes(x=CASH_ADVANCE_FREQUENCY))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8,binwidth =0.15)+
  labs(title = "Cash in advance frequency payment distribution")+
  ylab("Number of accounts")

As we can see our clients do not use often the cash borrowed against the credit limit. Which can give us an idea that most customers use credit cards as a debit one. This can be improved with a new customer relationship strategy

PURCHASESTRX - Numbe of purchase transactions made

ggplot(data=data_r_mean,aes(x=PURCHASES_TRX ))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Number of transactions distribution")+
  ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can notice that some accounts stay inactive. Probably the bank should “wake up” the users by organizing a marketing campaign

CREDIT_LIMIT - Limit of Credit Card for user

ggplot(data=data_r_mean,aes(x=CREDIT_LIMIT ))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Credit card limit distribution")+
  ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_r_mean$CREDIT_LIMIT)

data_r_mean$CREDIT_LIMIT <- def( data_r_mean$CREDIT_LIMIT )

We can see a quite common picture: there a lot of users with quite low credit card limit, as well as the one with quite a high one. The credit limit usually depends on the income of the user and card type

PRC_FULL_PAYMENT - Percent of full payment paid by user

ggplot(data=data_r_mean,aes(x=PRC_FULL_PAYMENT ))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8, binwidth =0.15 )+
  labs(title = "Percent of full payment paid distribution")+
  ylab("Number of accounts")

As we can see there are still a lot of payments which were not payed by users. Probably they would need some notification or any other reminders about their debt. This can be also one of the reasons why customers avoid cash in advance option.

PAYMENTS - Amount of Payment done by user

ggplot(data=data_r_mean,aes(x=PAYMENTS ))+geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  labs(title = "Amount of payment distribution")+
  ylab("Number of accounts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_r_mean$PAYMENTS)

data_r_mean$PAYMENTS <- def( data_r_mean$PAYMENTS )

According to payment distribution, some users stay inactive, what proves us the necessity of changing of the marketing campaign.

All connections can be displayed in the correlation matrix:

Correlation matrix

df<- select (data_r_mean,-c(CUST_ID))
correlation<-cor(df)
corrplot(correlation, method="circle",tl.cex = 0.5,tl.srt = 30)

Data modeling: Clustering of users

Now, after we analyzed the variables and removed all outliers, we can move to the next step - division of users into the groups. We will try 2 methods: k-means and agglomerative clustering (hierarchy)

First step: decrease number of dimensions

In order our clustering method to work better, we should reduce number of dimensions. For that we will use PCA method.

pca <- prcomp(df, scale = TRUE)
fviz_eig(pca)

According to the graph first 2 components describe around 50% of data, however this is not enough for us. We will use 9 components in order to reach 90% of data description.

summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2344 1.8842 1.21793 1.12493 1.02469 0.96892 0.85881
## Proportion of Variance 0.2937 0.2088 0.08726 0.07444 0.06176 0.05522 0.04339
## Cumulative Proportion  0.2937 0.5025 0.58976 0.66420 0.72597 0.78119 0.82457
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     0.85135 0.74729 0.62905 0.58829 0.52817 0.48473 0.4555
## Proportion of Variance 0.04264 0.03285 0.02328 0.02036 0.01641 0.01382 0.0122
## Cumulative Proportion  0.86721 0.90006 0.92334 0.94369 0.96010 0.97392 0.9861
##                           PC15    PC16    PC17
## Standard deviation     0.41181 0.21374 0.14331
## Proportion of Variance 0.00998 0.00269 0.00121
## Cumulative Proportion  0.99610 0.99879 1.00000
pca$rotation[,1:9]
##                                          PC1         PC2         PC3
## BALANCE                           0.02958569  0.42843002 -0.08665421
## BALANCE_FREQUENCY                 0.11459608  0.17547432 -0.24540345
## PURCHASES                         0.40870226  0.06646847  0.14572747
## ONEOFF_PURCHASES                  0.33392459  0.09726801  0.41990106
## INSTALLMENTS_PURCHASES            0.34552565 -0.01437961 -0.33065026
## CASH_ADVANCE                     -0.10627769  0.43490122 -0.07356351
## PURCHASES_FREQUENCY               0.35860753 -0.11217420 -0.28190157
## ONEOFF_PURCHASES_FREQUENCY        0.30999017  0.06051366  0.38525283
## PURCHASES_INSTALLMENTS_FREQUENCY  0.30383274 -0.11769475 -0.50867778
## CASH_ADVANCE_FREQUENCY           -0.13349292  0.41787969 -0.11367215
## CASH_ADVANCE_TRX                 -0.09047753  0.39608952 -0.13016314
## PURCHASES_TRX                     0.36541988  0.05419991 -0.03910502
## CREDIT_LIMIT                      0.16294067  0.27666802  0.16524186
## PAYMENTS                          0.21124716  0.29082922  0.14572193
## MINIMUM_PAYMENTS                  0.02764953  0.16591386 -0.22422978
## PRC_FULL_PAYMENT                  0.14617060 -0.16996727  0.04798094
## TENURE                            0.08315733  0.01853191  0.02860733
##                                           PC4         PC5          PC6
## BALANCE                           0.287517965 -0.01044361  0.108597100
## BALANCE_FREQUENCY                 0.250172811 -0.47719641 -0.399208698
## PURCHASES                        -0.021785638 -0.04127086  0.086890402
## ONEOFF_PURCHASES                  0.015615830 -0.19118024  0.065722438
## INSTALLMENTS_PURCHASES           -0.080323472  0.15946612  0.053141717
## CASH_ADVANCE                     -0.216008913  0.09251977 -0.001677022
## PURCHASES_FREQUENCY              -0.091647515 -0.13550677 -0.042988119
## ONEOFF_PURCHASES_FREQUENCY        0.006082614 -0.31068704 -0.040095971
## PURCHASES_INSTALLMENTS_FREQUENCY -0.102260169  0.03290521 -0.024489524
## CASH_ADVANCE_FREQUENCY           -0.270425918 -0.14981032 -0.097540912
## CASH_ADVANCE_TRX                 -0.307613144 -0.05737734 -0.150742599
## PURCHASES_TRX                    -0.001652891 -0.04558888  0.074885832
## CREDIT_LIMIT                      0.085728061  0.35509056  0.038800278
## PAYMENTS                         -0.139215061  0.32402291 -0.021733652
## MINIMUM_PAYMENTS                  0.429796532  0.15597438  0.569865480
## PRC_FULL_PAYMENT                 -0.440579896  0.34898908 -0.101455843
## TENURE                            0.456835644  0.42301956 -0.659715267
##                                          PC7         PC8          PC9
## BALANCE                          -0.07060779  0.28815489  0.018607651
## BALANCE_FREQUENCY                -0.47751179  0.04058945  0.379415236
## PURCHASES                         0.12968382 -0.08182988  0.164570992
## ONEOFF_PURCHASES                  0.05610442 -0.12261122  0.014011698
## INSTALLMENTS_PURCHASES            0.20863180  0.04512768  0.240743724
## CASH_ADVANCE                      0.01420721  0.06672741  0.022060925
## PURCHASES_FREQUENCY              -0.06057857  0.06909679 -0.347048314
## ONEOFF_PURCHASES_FREQUENCY       -0.11802041 -0.09250240 -0.355368845
## PURCHASES_INSTALLMENTS_FREQUENCY  0.10290834  0.11589627 -0.149705334
## CASH_ADVANCE_FREQUENCY            0.05998574 -0.15635027 -0.132416312
## CASH_ADVANCE_TRX                  0.10364445 -0.28860330 -0.298544176
## PURCHASES_TRX                     0.20116944 -0.08784955 -0.009832214
## CREDIT_LIMIT                     -0.20664994  0.61243691 -0.272834897
## PAYMENTS                          0.06731767 -0.14723918  0.509407446
## MINIMUM_PAYMENTS                 -0.29383628 -0.49074710 -0.140167060
## PRC_FULL_PAYMENT                 -0.68114083 -0.17502701 -0.002395611
## TENURE                            0.15091584 -0.27687042 -0.201210553
colMain <- colorRampPalette(brewer.pal(8, "Blues"))(25)
heatmap(pca$rotation[,1:9],Colv=NA, Rowv=NA, scale='none',col=colMain)

On the graph we can see composition of each principal component.

pca_df <- cbind(df[0],pca$x[,1:9])

Second step: indentification of optimal number of clusters for each model

We will use Elbow method to define optimal number of clusters:

fviz_nbclust(pca_df, kmeans, method = "wss")

As we can see 5 clusters looks like an optimal value for k-means method

fviz_nbclust(pca_df, hcut, method = "wss")

We will use also 5 clusters.

Third step: clustering implementation

K-means:

km.res <- eclust(pca_df, "kmeans", k = 5,
                 nstart = 25, graph = FALSE)

Agglomerative clustering (single-linkage method)

res.hc <- eclust(pca_df, "hclust", k = 5,
                 method = "single", graph = FALSE) 
print(res.hc)
## 
## Call:
## stats::hclust(d = x, method = hc_method)
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 8949

Fourth step: model quality check

We will use silhouette score:

Check model quality for K-means:

fviz_silhouette(km.res)
##   cluster size ave.sil.width
## 1       1 2938          0.32
## 2       2 2332          0.20
## 3       3 1300          0.19
## 4       4 1218          0.12
## 5       5 1161          0.18

Check model quality for Agglomerative clustering:

fviz_silhouette(res.hc)
##   cluster size ave.sil.width
## 1       1 3543          0.26
## 2       2  840          0.25
## 3       3 3178          0.10
## 4       4  521          0.25
## 5       5  867          0.08

According to the silhouette score K-means does clustering better than Agglomerative clustering method.

Interpretation of the results:

We will use for result interpretation better model: K-means

First we will see at the composition of each cluster:

cluster <- c(1: 5)
center <- km.res$centers
center_df <- data.frame(cluster, center)
center_reshape <- gather(center_df, features, values,PC1:PC9)
center_reshape
##    cluster features      values
## 1        1      PC1 -1.37218255
## 2        2      PC1  0.93332431
## 3        3      PC1 -1.31529294
## 4        4      PC1 -1.20772795
## 5        5      PC1  4.33751379
## 6        1      PC2 -0.10812685
## 7        2      PC2 -1.23157150
## 8        3      PC2 -1.48652162
## 9        4      PC2  3.46664149
## 10       5      PC2  0.77503032
## 11       1      PC3  0.31287389
## 12       2      PC3 -1.13956679
## 13       3      PC3  0.86660116
## 14       4      PC3 -0.35364147
## 15       5      PC3  0.89784675
## 16       1      PC4  0.54949104
## 17       2      PC4 -0.15820480
## 18       3      PC4 -0.56841332
## 19       4      PC4 -0.43654619
## 20       5      PC4  0.02168776
## 21       1      PC5 -0.45763662
## 22       2      PC5 -0.07317632
## 23       3      PC5  1.07031026
## 24       4      PC5  0.25284707
## 25       5      PC5 -0.15864558
## 26       1      PC6 -0.21539288
## 27       2      PC6 -0.20714160
## 28       3      PC6  0.74975832
## 29       4      PC6 -0.04873688
## 30       5      PC6  0.17274264
## 31       1      PC7 -0.11325811
## 32       2      PC7 -0.21512317
## 33       3      PC7  0.67620975
## 34       4      PC7 -0.17590096
## 35       5      PC7  0.14607602
## 36       1      PC8 -0.08897819
## 37       2      PC8  0.12712013
## 38       3      PC8 -0.05840418
## 39       4      PC8  0.15204442
## 40       5      PC8 -0.12428156
## 41       1      PC9  0.30043477
## 42       2      PC9 -0.09793344
## 43       3      PC9 -0.48926991
## 44       4      PC9 -0.14123239
## 45       5      PC9  0.13245079
ggplot(data = center_reshape, aes(x = features, y = cluster, fill = values)) +
  scale_y_continuous(breaks = seq(1, 7, by = 1)) +
  geom_tile() +
  coord_equal() +
  theme_classic()

On the graph we can see that overall components PC4, PC5, PC6, PC7, PC8, PC9 have quite similar values for each cluster. The main difference among them is restricted in components PC1, PC2, PC3. That is why we will focus on them.

Cluster 1: Economical

These clients have low income. They buy rarely and little. However, when they want to buy something, they use installment or cash in advance payment methods. They do not borrow too much. However, the bank should pay attention to them, as they often do not pay off the debt on the credit card.

Cluster 2: Restricted

Clients of this group also earn a little, but they make a lot of purchases. They rarely use loans and mostly spend their own funds. They do not often buy something in one go, usually they spend often but a small amounts

Cluster 3: Opportunists

This cluster has the poorest customers. However, they shop quite often. They often pay for purchases with borrowed funds. They are somewhere between cluster 1 and 2.

Cluster 4: Middle class

Clients of this cluster have an average income and make purchases quite rarely. However, if they do decide to buy, they spend a lot. They often pay with borrowed funds. It should be noted separately that these clients also have a high level of debt.

Cluster 5: Wealthy clients

These clients are very wealthy and spend a lot. They buy both in one go and in the usual way. These clients rarely use borrowed funds, but if they do, they regularly pay off the debt.

Recommendations: