3. Parametric Tests with R

Batur Şeker
8 min readJan 30, 2021

Used dataset

This story is the continuation of this article.

#Get working directory
getwd()

#Set working directory
setwd(“C:\\Users\\batur\\Desktop\\R Tutorial”)

#Read csv data file and store as data frame
bankChurnersData=read.csv(file=”BankChurners.csv”)

#Drop columns has number of 22 and 23
df <- bankChurnersData[-c(22:23)]

#Encode Attrition_Flag column of df as a factor — Binary variable
df$Attrition_Flag=factor(df$Attrition_Flag,levels=c(“Attrited Customer”,”Existing Customer”))

#Encode Gender column of df as a factor — Binary variable
df$Gender=factor(df$Gender,levels=c(“M”,”F”))

#Encode Education_Level column of df as an ordered factor — Ordinal variable
df$Education_Level=factor(df$Education_Level, ordered=TRUE, levels=c(“Unknown”,”Uneducated”,”High School”,”College”,”Graduate”,”Post-Graduate”,”Doctorate”) )

#Encode Marital_Status column of df as a factor — Nominal variable
df$Marital_Status=factor(df$Marital_Status,levels=c(“Married”,”Single”,”Unknown”,”Divorced”))

#Encode Income_Category column of df as an ordered factor — Ordinal variable
df$Income_Category=factor(df$Income_Category,ordered=TRUE,levels=c(“Unknown”,”Less than $40K”,”$40K — $60K”,”$60K — $80K”,”$80K — $120K”,”$120K +”))

#Encode Card_Category column of df as an ordered factor — Ordinal variable
df$Card_Category<-factor(df$Card_Category,ordered=TRUE,levels = c(“Blue”,”Silver”,”Gold”,”Platinum”))

#first 100 row
df_4=head(df,100)

#second 100 row
df_5=df[seq(101:200),]

#Check Customer_Age is normal distributed
qqnorm(df_4$Customer_Age)
qqline(df_4$Customer_Age)

Result: According to the above shape df_4$Customer_Age is normal distributed because line and dots have nearly 45 degree angle.

1. One Sample T-Test
To use one sample t-test, examples should have normal distribution. Because of this, initially test data whether it has normal distribution or not.

1.1.Normality Test with Shapiro
1.Determine Hypothesis
H0: Examples have normal distribution.
HA: Examples have not normal distribution

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Normality Test with Shapiro
shapiro.test(df_4$Customer_Age)

data: df_4$Customer_Age
W = 0.99006, p-value = 0.6691

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.6691
decision: p>0.05 H0 is accepted.

5.Interpret
Examples have normal distribution. Because of this, parametric tests can be used.

1.2.One sample t-test
1.Determine Hypothesis
H0: μ=30
HA: μ≠30

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#One sample t-test
t.test(df_4$Customer_Age,mu=30)

data: df_4$Customer_Age
t = 27.918, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
48.22554 51.01446
sample estimates:
mean of x
49.62

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value < 2.2e-16
decision: p<=0.05 H0 is rejected

5.Interpret
Mean of Customer_Age is not 30.

2. Independent 2-group T-Test
To use independent 2-group t-test, examples should have normal distribution and should have same variance values.

2.1. Normality Test with Shapiro
1.Determine Hypothesis
H0: Examples have normal distribution.
HA: Examples have not normal distribution

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Normality Test with Shapiro
model<-lm(Customer_Age ~ Gender, data = df_4)
shapiro.test(residuals(model))

data: residuals(model)
W = 0.99073, p-value = 0.7236

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.7236
decision: p>0.05 H0 is accepted.

5.Interpret
Examples have normal distribution.

2.2. Homogeneity of Variance with Bartlett
1.Determine Hypothesis
H0: The variances in each of the groups are the same.
HA: At least two of them differ.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Homogeneity of Variance with Bartlett
bartlett.test(Customer_Age ~ Gender, data = df_4)

data: Customer_Age by Gender
Bartlett’s K-squared = 0.71698, df = 1, p-value = 0.3971

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.3971
decision: p>0.05 H0 is accepted.

5.Interpret
The variances in each of the groups are the same.

2.3. Independent 2-group t-test
1.Determine Hypothesis
H0: Means of two independent groups are same.
HA: Means of two independent groups are different.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
t.test(Customer_Age ~ Gender, data = df_4)

data: Customer_Age by Gender
t = -0.28499, df = 51.65, p-value = 0.7768
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.677098 2.762652
sample estimates:
mean in group M mean in group F
49.47826 49.93548

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.7768
decision: p>0.05 H0 is accepted.

5.Interpret
Mean of two independent groups are same.

3. Paired T-Test
To use Paired T-Test, examples should have normal distribution. In addition to this, normally, this test should be used to test two related groups of samples. However, there is not related data to use Paired T-Test in my dataset. Because of this, I use two separated parts of Customer_Age column to use Paired T-Test.

3.1. Normality Test with Shapiro
1.Determine Hypothesis
H0: Examples have normal distribution.
HA: Examples have not normal distribution

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Normality Test with Shapiro
shapiro.test(df_4$Customer_Age)

data: df_4$Customer_Age
W = 0.99006, p-value = 0.6691

#Normality Test with Shapiro
shapiro.test(df_5$Customer_Age)

data: df_5$Customer_Age
W = 0.99006, p-value = 0.6691

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.6691
decision: For both test p>0.05 H0 is accepted.

5.Interpret
Both data have normal distribution.

3.2. Paired T-Test
1.Determine Hypothesis
H0: Means of two related groups are same.
HA: Means of two related groups are different.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
t.test(df_4$Customer_Age,df_5$Customer_Age,paired=TRUE)

data: df_4$Customer_Age and df_5$Customer_Age
t = NaN, df = 99, p-value = NA
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
NaN NaN
sample estimates:
mean of the differences
0

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = NA , mean of the differences 0
decision: H0 is accepted.

5.Interpret
Mean of two independent groups are same.

4. One-way ANOVA
To use One-way ANOVA, examples should have normal distribution and should have same variance values.

4.1. Normality Test with Shapiro
1.Determine Hypothesis
H0: Examples have normal distribution.
HA: Examples have not normal distribution

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Normality Test with Shapiro
model_2<-lm(Customer_Age ~ Education_Level, data = df_4)
shapiro.test(residuals(model_2))

data: residuals(model_2)
W = 0.98895, p-value = 0.5805

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.5805
decision: p>0.05 H0 is accepted.

5.Interpret
Examples have normal distribution.

4.2. Homogeneity of Variance with Bartlett
1.Determine Hypothesis
H0: The variances in each of the groups are the same.
HA: At least two of them differ.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Homogeneity of Variance with Bartlett
bartlett.test(Customer_Age ~ Education_Level, data = df_4)

data: Customer_Age by Education_Level
Bartlett’s K-squared = 5.0845, df = 6, p-value = 0.533

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.533
decision: p>0.05 H0 is accepted.

5.Interpret
The variances in each of the groups are the same.

4.3. One-way ANOVA
1.Determine Hypothesis
H0: The means of the different groups are the same (At least 3 groups)
HA: At least one sample mean is not equal to the others.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#One way ANOVA
one_way_anova_result= aov(Customer_Age ~ Education_Level, data=df_4)
summary(one_way_anova_result)

Df Sum Sq Mean Sq F value Pr(>F)
Education_Level 6 689 114.85 2.543 0.0252 *
Residuals 93 4200 45.17
— -
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Range test
TukeyHSD(one_way_anova_result, conf.level = 0.95)

Fit: aov(formula = Customer_Age ~ Education_Level, data = df_4)
$Education_Level
diff lwr upr p adj
Uneducated-Unknown -6.519608 -14.1564021 1.117186 0.1464268
High School-Unknown -3.630719 -10.4808753 3.219437 0.6842875
College-Unknown 1.522059 -7.1620993 10.206217 0.9983645
Graduate-Unknown -1.461049 -7.3957487 4.473650 0.9895000
Post-Graduate-Unknown -5.352941 -20.4942883 9.788406 0.9365187
Doctorate-Unknown 3.813725 -5.8044125 13.431864 0.8943682
High School-Uneducated 2.888889 -4.6596157 10.437393 0.9093828
College-Uneducated 8.041667 -1.2033256 17.286659 0.1314214
Graduate-Uneducated 5.058559 -1.6701811 11.787298 0.2717978
Post-Graduate-Uneducated 1.166667 -14.3031643 16.636498 0.9999879
Doctorate-Uneducated 10.333333 0.2059517 20.460715 0.0423859
College-High School 5.152778 -3.4538417 13.759397 0.5485500
Graduate-High School 2.169670 -3.6509792 7.990319 0.9193396
Post-Graduate-High School -1.722222 -16.8192314 13.374787 0.9998615
Doctorate-High School 7.444444 -2.1037426 16.992631 0.2321179
Graduate-College -2.983108 -10.8805706 4.914354 0.9144544
Post-Graduate-College -6.875000 -22.8877964 9.137796 0.8530571
Doctorate-College 2.291667 -8.6471557 13.230489 0.9955946
Post-Graduate-Graduate -3.891892 -18.5961669 10.812383 0.9846652
Doctorate-Graduate 5.274775 -3.6394774 14.189027 0.5625175
Doctorate-Post-Graduate 9.166667 -7.3712783 25.704612 0.6371410

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted

One way ANOVA Result:
p-value=0.0252
p<=0.05 H0 is rejected

TukeyHSD Results:
For Uneducated-Unknown : p adj = 0.1464268 decision: p>0.05 H0 is accepted. (first line)
For High School-Unknown : p adj = 0.6842875 decision: p>0.05 H0 is accepted. (second line)

5.Interpret
One way ANOVA Result:
At least one sample mean is not equal to the others.

TukeyHSD Results:
p-adj column gives decisions. According to the first two line of result, for Uneducated-Unknown and High School-Unknown combinations, the means of the different groups are the same.

5. Two-way ANOVA
To use Two-way ANOVA, examples should have normal distribution and should have same variance values.

5.1. Normality Test with Shapiro
1.Determine Hypothesis
H0: Examples have normal distribution.
HA: Examples have not normal distribution

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Normality Test with Shapiro
model_3<-lm(Customer_Age ~ Education_Level, data = df_4)
shapiro.test(residuals(model_3))

data: residuals(model_3)
W = 0.98895, p-value = 0.5805

#Normality Test with Shapiro
model_4<-lm(Customer_Age ~ Gender, data = df_4)
shapiro.test(residuals(model_4))

data: residuals(model_4)
W = 0.99073, p-value = 0.7236

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.5805, p-value = 0.7236
decision: For both test, p>0.05 H0 is accepted.

5.Interpret
Examples have normal distribution.

5.2. Homogeneity of Variance with Bartlett
1.Determine Hypothesis
H0: The variances in each of the groups are the same.
HA: At least two of them differ.

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Homogeneity of Variance with Bartlett
bartlett.test(Customer_Age~Education_Level, data = df_4)

data: Customer_Age by Education_Level
Bartlett’s K-squared = 5.0845, df = 6, p-value = 0.533

#Homogeneity of Variance with Bartlett
bartlett.test(Customer_Age~Gender, data = df_4)

data: Customer_Age by Gender
Bartlett’s K-squared = 0.71698, df = 1, p-value = 0.3971

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted
p-value = 0.533, p-value = 0.3971
decision: For both test, p>0.05 H0 is accepted.

5.Interpret
The variances in each of the groups are the same.

5.3. Two-way ANOVA
1.Determine Hypothesis
H0–1: There is no difference in the means of first factor
H0–2: There is no difference in the means of second factor
H0–3: There is no interaction between first factor and second factor

HA-1: There is a difference in the means of first factor
HA-2: There is a difference in the means of second factor
HA-3: There is an interaction between first factor and second factor

2. Select α (Significant Level)
α : 0.05

3.Test Statistics
#Two-way ANOVA
two_way_anova_result=aov(Customer_Age~Gender*Education_Level,data= df_4)
summary(two_way_anova_result)

Df Sum Sq Mean Sq F value Pr(>F)
Gender 1 4 4.47 0.098 0.7549
Education_Level 6 690 114.96 2.522 0.0268 *
Gender:Education_Level 5 229 45.82 1.005 0.4197
Residuals 87 3966 45.59
— -
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Range test
TukeyHSD(two_way_anova_result)

Result:
Fit: aov(formula = Customer_Age ~ Gender * Education_Level, data = df_4)

$Gender
diff lwr upr p adj
F-M 0.457223 -2.444503 3.358949 0.7548889

$Education_Level
diff lwr upr p adj
Uneducated-Unknown -6.564434 -14.248577 1.119709 0.1454421
High School-Unknown -3.624742 -10.517370 3.267886 0.6906486

4.Make Decision
p<=0.05 H0 is rejected
p>0.05 H0 is accepted

Two-way ANOVA Result:
HO-1: p-value=0.7549 decision: p>0.05 H0–1 is accepted
HO-2: p-value=0.0268 decision: p<=0.05 H0–2 is rejected
HO-3: p-value=0.4197 decision: p>0.05 H0–3 is accepted

TukeyHSD Results:
For Gender : p adj = 0.7548889 decision: p>0.05 H0 is accepted. (first line)


5.Interpret
Two-way ANOVA Result:
H0–1: There is no difference in the means of Gender factor
HA-2: There is a difference in the means of Education_Level factor
H0–3: There is no interaction between Gender factor and Education_Level factor

TukeyHSD Results:
p-adj column gives decisions. According to the first line of result, for F-M combination, the means of F and M groups are the same.

Next article

--

--