5. Correlation Tests with R

Batur Şeker
4 min readJan 30, 2021

Used dataset

This story is the continuation of this article.

#Get working directory
getwd()

#Set working directory
setwd(“C:\\Users\\batur\\Desktop\\R Tutorial”)

#Read csv data file and store as data frame
bankChurnersData=read.csv(file=”BankChurners.csv”)

#Drop columns has number of 22 and 23
df <- bankChurnersData[-c(22:23)]

#Encode Attrition_Flag column of df as a factor — Binary variable
df$Attrition_Flag=factor(df$Attrition_Flag,levels=c(“Attrited Customer”,”Existing Customer”))

#Encode Gender column of df as a factor — Binary variable
df$Gender=factor(df$Gender,levels=c(“M”,”F”))

#Encode Education_Level column of df as an ordered factor — Ordinal variable
df$Education_Level=factor(df$Education_Level, ordered=TRUE, levels=c(“Unknown”,”Uneducated”,”High School”,”College”,”Graduate”,”Post-Graduate”,”Doctorate”) )

#Encode Marital_Status column of df as a factor — Nominal variable
df$Marital_Status=factor(df$Marital_Status,levels=c(“Married”,”Single”,”Unknown”,”Divorced”))

#Encode Income_Category column of df as an ordered factor — Ordinal variable
df$Income_Category=factor(df$Income_Category,ordered=TRUE,levels=c(“Unknown”,”Less than $40K”,”$40K — $60K”,”$60K — $80K”,”$80K — $120K”,”$120K +”))

#Encode Card_Category column of df as an ordered factor — Ordinal variable
df$Card_Category<-factor(df$Card_Category,ordered=TRUE,levels = c(“Blue”,”Silver”,”Gold”,”Platinum”))

#first 100 row
df_4=head(df,100)

Not 1: Use Pearson correlation test (parametric correlation test) when data are normal distributed and group variances are equivalent
Not 2: Use Spearman and Kendal correlation tests (non-parametric correlation test) when data are NOT normal distributed and/or group variances are NOT equivalent

1. Pearson Correlation Test

1.1. Normality Test with Shapiro
1.Determine Hypothesis
H0: Examples have normal distribution.
HA: Examples have not normal distribution

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Normality Test with Shapiro
model_8<-lm(Customer_Age ~ Education_Level, data = df_4)
shapiro.test(residuals(model_8))

data: residuals(model_8)
W = 0.98895, p-value = 0.5805

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.5805
decision: p>0.05 H0 is accepted

5.Interpret
Examples have normal distribution.

1.2. Homogeneity of Variance with Bartlett
1.Determine Hypothesis
H0: The variances in each of the groups are the same.
HA: At least two of them differ.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Homogeneity of Variance with Bartlett
bartlett.test(Customer_Age ~ Education_Level, data = df_4)

data: Customer_Age by Education_Level
Bartlett’s K-squared = 5.0845, df = 6, p-value = 0.533

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.533
decision: p>0.05 H0 is accepted

5.Interpret
The variances in each of the groups are the same.

1.3. Pearson Correlation Test
1.Determine Hypothesis
H0: Correlation between two variables is not significant.
HA: Correlation between two variables is significant.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Pearson Correlation Test
cor.test(~Customer_Age+Total_Trans_Amt, method=”pearson”, data = df_4)

data: Customer_Age and Total_Trans_Amt
t = 0.86922, df = 98, p-value = 0.3869
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1108551 0.2790908
sample estimates:
cor
0.08746755

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.3869
decision: p>0.05 H0 is accepted

5.Interpret
Correlation between two variables is not significant.

2. Spearman Correlation Test

2.1. Normality Test with Shapiro
1.Determine Hypothesis
H0: Examples have normal distribution.
HA: Examples have not normal distribution

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Normality Test with Shapiro
shapiro.test(df_4$Credit_Limit)

data: df_4$Credit_Limit
W = 0.80145, p-value = 2.745e-10

#Normality Test with Shapiro
shapiro.test(df_4$Months_on_book)

data: df_4$Months_on_book
W = 0.95657, p-value = 0.002308

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 2.745e-10, p-value = 0.002308
decision: For both test, p<=0.05 H0 is rejected

5.Interpret
Examples have not normal distribution

Not: There is no need to test Bartlett Test.

2.2. Spearman Correlation Test
1.Determine Hypothesis
H0: Correlation between two variables is not significant.
HA: Correlation between two variables is significant.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Spearman Correlation Test
cor.test(~Credit_Limit+Months_on_book, method=”spearman”, data = df_4)

data: Credit_Limit and Months_on_book
S = 186064, p-value = 0.2484
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.1164937

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.2484
decision: p>0.05 H0 is accepted

5.Interpret
Correlation between two variables is not significant.

3. Kendall Correlation Test

3.1. Normality Test with Shapiro
1.Determine Hypothesis
H0: Examples have normal distribution.
HA: Examples have not normal distribution

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Normality Test with Shapiro
shapiro.test(df_4$Total_Relationship_Count)

data: df_4$Total_Relationship_Count
W = 0.90375, p-value = 2.131e-06

#Normality Test with Shapiro
shapiro.test(df_4$Contacts_Count_12_mon)

data: df_4$Contacts_Count_12_mon
W = 0.8477, p-value = 9.764e-09

#Normality Test with Shapiro
shapiro.test(df_4$Avg_Open_To_Buy)

data: df_4$Avg_Open_To_Buy
W = 0.8125, p-value = 6.105e-10

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 2.131e-06, p-value = 9.764e-09, p-value = 6.105e-10
decision: For three tests, p<=0.05 H0 is rejected

5.Interpret
Examples have not normal distribution

Not: There is no need to test Bartlett Test.

3.2. Kendall Correlation Test
1.Determine Hypothesis
H0: Correlation between two variables is not significant.
HA: Correlation between two variables is significant.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Create df_7 data frame to use in kendall correlatin test
df_7 <- data.frame(Total_Relationship_Count=df_4$Total_Relationship_Count,
Contacts_Count_12_mon=df_4$Contacts_Count_12_mon,
Avg_Open_To_Buy=df_4$Avg_Open_To_Buy)

#Kendall Correlation Test
cor(df_7, method = “kendall”)

Total_Relationship_Count Contacts_Count_12_mon Avg_Open_To_Buy
Total_Relationship_Count 1.000000000 -0.04659116 -0.009917818
Contacts_Count_12_mon -0.046591165 1.00000000 0.079211334
Avg_Open_To_Buy -0.009917818 0.07921133 1.000000000

4.Make Decision
correlation coefficient: -1 negative correlation
correlation coefficient: 0 no association
correlation coefficient: 1 positive correlation

0< correlation coefficient <0.3 negligible linear relation
0.3< correlation coefficient <0.7 weak linear relation
0.7< correlation coefficient strong linear relation

Total_Relationship_Count — Total_Relationship_Count:
correlation coefficient:1
Contacts_Count_12_mon — Total_Relationship_Count:
correlation coefficient: -0.046591165
Avg_Open_To_Buy — Contacts_Count_12_mon:
correlation coefficient: 0.07921133

5.Interpret
Total_Relationship_Count and Total_Relationship_Count have positive correlation and strong linear relation.
Contacts_Count_12_mon and Total_Relationship_Count have negative correlation and negligible linear relation.
Avg_Open_To_Buy and Contacts_Count_12_mon have positive correlation and negligible linear relation.

--

--