if (!"pacman" %in% rownames(installed.packages())) install.packages("pacman")
library(pacman)
pacman::p_load(abdiv, car, cv, datawizard, dendextend, descr, effectsize, fpc, ggstatsplot, haven, knitr, labelled, lmtest, lm.beta, naniar, NbClust, nhstplot, nomclust, patchwork, PMCMRplus, psych, pwr, pwrss, qgraph, rstatix, sandwich, tidyverse, vegan)
Connect with me on Open Science Framework | Contact me via LinkedIn
It might be necessary to use right-click -> open in a new browser window, depending on your machine.
If this exercise grabs your attention, please check out our master study programs at the Otto-von-Guericke-University in Magdeburg (Germany) by clicking on the logo!
R analysis script presenting the examples presented in the Lecture slides.
Only modify (also not in part) with appropriate permission! See the license information at the end of the document.
Goal: students should practice analyzing data with R and RStudio. Also, one goal to make the students familiar with functions provided by some modern packages that are routinely used in market research tasks. These skills will serve them well in the progress of their study time and in the future. The internet provides countless further examples. Besides, there exist very good textbooks (Chapman & Feit, 2019; Field, Miles, & Field, 2012).
1 Loading packages that we will use
R is a context-sensitive language. Thus, ‘data’ will be interpreted not in the same way as ‘Data’ will.
In R most functionality is provided by additional packages.
Most of the packages are well-documented, See: https://cran.r-project.org/
The code chunk below first evaluates if the package pacman (Rinker & Kurkiewicz, 2018) is already installed to your machine. If yes, the corresponding package will be loaded. If not, R will install the package (as long as the package is still maintained on the Comprehensive R Archive Network CRAN.
Alternatively, you can do this manually first by executing install.packages(“pacman”) and then library(pacman).
The second line then loads the package pacman
The third line uses the function p_load() from the pacman package to install (if necessary) and load all packages that we provide as arguments (e.g., tidyverse (Wickham et al., 2019)). This only works if all packages are still available on CRAN.
Here is the R session info, which gives you information on my machine, all loaded packages, and their version:
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
[3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
[5] LC_TIME=German_Germany.utf8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] nomclust_2.8.0 clValid_0.7 cluster_2.1.6 fpc_2.2-13
[5] NbClust_3.0.1 descr_1.1.8 patchwork_1.3.0 dendextend_1.19.0
[9] abdiv_0.2.0 vegan_2.6-8 lattice_0.22-6 permute_0.9-7
[13] qgraph_1.9.8 downlit_0.4.4 xml2_1.3.6 quarto_1.4.4
[17] pwr_1.3-0 cv_2.0.3 doParallel_1.0.17 iterators_1.0.14
[21] foreach_1.5.2 lm.beta_1.7-2 sandwich_3.1-1 lmtest_0.9-40
[25] zoo_1.8-12 pwrss_0.3.1 knitr_1.49 ggstatsplot_0.12.5
[29] PMCMRplus_1.9.12 downloadthis_0.4.1 car_3.1-3 carData_3.0-5
[33] rstatix_0.7.2 effectsize_1.0.0 psych_2.4.12 ggpubr_0.6.0
[37] nhstplot_1.3.0 naniar_1.1.0 haven_2.5.4 labelled_2.13.0
[41] datawizard_1.0.0 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[45] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
[49] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 pacman_0.5.1
loaded via a namespace (and not attached):
[1] splines_4.4.2 later_1.4.1 rpart_4.1.23
[4] lifecycle_1.0.4 prabclus_2.3-4 processx_3.8.4
[7] MASS_7.3-61 insight_1.0.1 backports_1.5.0
[10] magrittr_2.0.3 Hmisc_5.2-2 rmarkdown_2.29
[13] yaml_2.3.10 flexmix_2.3-19 pbapply_1.7-2
[16] minqa_1.2.8 multcomp_1.4-28 abind_1.4-8
[19] quadprog_1.5-8 nnet_7.3-19 TH.data_1.1-3
[22] correlation_0.8.6 codetools_0.2-20 tidyselect_1.2.1
[25] SuppDists_1.1-9.8 farver_2.1.2 lme4_1.1-35.5
[28] gmp_0.7-5 viridis_0.6.5 stats4_4.4.2
[31] base64enc_0.1-3 jsonlite_1.8.9 Formula_1.2-5
[34] survival_3.7-0 emmeans_1.10.6 BWStest_0.2.3
[37] tools_4.4.2 Rcpp_1.0.13-1 glue_1.8.0
[40] mnormt_2.1.1 gridExtra_2.3 xfun_0.50
[43] mgcv_1.9-1 kSamples_1.2-10 withr_3.0.2
[46] fastmap_1.2.0 boot_1.3-31 digest_0.6.37
[49] timechange_0.3.0 R6_2.5.1 estimability_1.5.1
[52] visdat_0.6.0 colorspace_2.1-1 gtools_3.9.5
[55] jpeg_0.1-10 diptest_0.77-1 generics_0.1.3
[58] data.table_1.16.2 corpcor_1.6.10 robustbase_0.99-4-1
[61] class_7.3-22 htmlwidgets_1.6.4 parameters_0.24.1
[64] pkgconfig_2.0.3 gtable_0.3.6 modeltools_0.2-23
[67] Rmpfr_0.9-5 statsExpressions_1.6.1 htmltools_0.5.8.1
[70] lavaan_0.6-19 multcompView_0.1-10 scales_1.3.0
[73] png_0.1-8 rstudioapi_0.17.1 tzdb_0.4.0
[76] reshape2_1.4.4 coda_0.19-4.1 checkmate_2.3.2
[79] nlme_3.1-166 nloptr_2.1.1 cachem_1.1.0
[82] foreign_0.8-87 pillar_1.10.1 grid_4.4.2
[85] vctrs_0.6.5 xtable_1.8-4 htmlTable_2.4.3
[88] paletteer_1.6.0 evaluate_1.0.3 pbivnorm_0.6.0
[91] zeallot_0.1.0 mvtnorm_1.3-2 cli_3.6.3
[94] compiler_4.4.2 rlang_1.1.4 ggsignif_0.6.4
[97] fdrtool_1.2.18 mclust_6.1.1 rematch2_2.1.2
[100] ps_1.8.1 plyr_1.8.9 stringi_1.8.4
[103] viridisLite_0.4.2 munsell_0.5.1 bayestestR_0.15.1
[106] Matrix_1.7-1 hms_1.1.3 glasso_1.11
[109] kernlab_0.9-33 igraph_2.1.4 broom_1.0.7
[112] memoise_2.0.1 DEoptimR_1.1-3-1
2 Cross Tabulation - Example: Gender and Mobile Phone Design.
2.1 Testing for Independence
200 respondents (100 males, 100 females) were surveyed regarding their preference for smartphone colors. Expected frequencies (in braces) and observed frequencies are reported in the following table:
Male customers | Female customers | \(\sum\) | |
---|---|---|---|
Black | 28 | 12 | 40 |
Silver | 48 | 36 | 84 |
White | 24 | 52 | 76 |
\(\sum\) | 100 | 100 | 200 |
In our example, the analysis draws on a (Pearson) \(\chi^{2}\)-test (without Yate’s continuity correction (Yates, 1934)). Furthermore, we know from lectures in statistics that the df in a \(\chi^{2}\)-test equals: (number of rows -1) * (number of columns -1). Thus, df = (3-1)*(2-1) = 2. The degrees of freedom (df) are 2.
We first create the table within R:
Next, we ask for a \(\chi^{2}\)-test (without Yate’s continuity correction). Before doing so, we ask for the critical value of the \(\chi^{2}\) test statistic at \(\alpha\)=5% error rate and df=2:
critical_value <- qchisq(p=0.95, df=2)
critical_value
[1] 5.991465
Note that this value perfectly mirrors the one discussed during the lecture. Usually, the market researchers interprets a statistical test results as to be significant when the empirical value of the test statistic exceeds the critical value. As seen in the lecture, however, most often research prefer to calculate the empirical p-value, which represents the probability of finding a value of the corresponding test statistic that is at minimum the observed one (i.e., the probability of finding such an extreme value when \(H_{0}\) is true Field et al. (2012, p. 54)).
Let us calculate both the test statistic (\(\chi^{2}\)) and the corresponding p-value.
chi2_test <- smartphone_table %>% chisq.test(correct = FALSE)
chi2_test
Pearson's Chi-squared test
data: .
X-squared = 18.43, df = 2, p-value = 9.953e-05
The test results align to the lecture calculation. The \(\chi^{2}\)-value of 18.43 leads to a significant p-value of roughly 0 Thus, the data support the hypothesis that both genders prefer different colors for their smartphone.
Let’s visualize the test actual statistic value, the critical value (green line) and the theoretical distribution.
chi2_test %>% plotchisqtest() + geom_vline(xintercept = critical_value,
col = "green")
We have seen in the lecture that we can also switch to Fisher’s exact test (Fisher, 1922) or Yate’s correction for the 𝜒2-test. These methods are also commonly used in market research problems (Lichters, Bengart, Sarstedt, & Vogt, 2015; Lichters, Brunnlieb, Nave, Sarstedt, & Vogt, 2016). We will look at Fisher’s exact test next. This test can almost always be applied when contingency tables are not too big.
Fishers_test <- smartphone_table %>% fisher.test()
Fishers_test
Fisher's Exact Test for Count Data
data: .
p-value = 0.0001077
alternative hypothesis: two.sided
Also, Fisher’s exact test is significant based on the given data (the empirical p-value is 0). Note that the test does not come with a test statistic since it is an exact test that does not make assumptions about parameter distributions, etc.
2.2 Effect Sizes for Crosstabs tests
In the lecture, we also discussed a series of effect size measures that market researchers routinely apply when analyzing contingency tables (e.g., Sawyer & Ball, 1981). Here, we have distinguished between 2x2 tables and bigger tables.
2.2.1 2x2 tables
Some effect sizes are only applicable in the case of 2x2 tables. In the lecture, we have discussed the Phi coefficient (Yule, 1912): \(\phi= \sqrt{\frac{\chi^{2}}{N}}\)
We first construct a small 2x2 problem to illustrate its calculation:
MMA exam passed | MMA exam failed | \(\sum\) | |
---|---|---|---|
Regular beer drinker | 28 | 14 | 42 |
Not drinking beer | 67 | 94 | 161 |
\(\sum\) | 95 | 108 | 203 |
The corresponding \(\chi^{2}\)-test (with Yate’s correction) would result in:
chi2_test_MMA_exam <- MMA_exam %>% chisq.test(correct = TRUE)
chi2_test_MMA_exam
Pearson's Chi-squared test with Yates' continuity correction
data: .
X-squared = 7.4205, df = 1, p-value = 0.006449
Now, we calculate \(\phi\). Here, we use the effectsize package (Ben-Shachar, Lüdecke, & Makowski, 2020), which also corrects phi for small sample sizes by default. It also provides a function (interpret) that allows us to interpret an effect strength according to established guidelines (Cohen, 1988).
Note, that below we explicitly use the notation effectsize::phi. This clarifies that we want to use the function phi() from the package effectsize because other packages provide functions with the same name.
MMA_exam %>%
effectsize::phi(adjust = TRUE) %>%
interpret(rules = "cohen1988")
Phi (adj.) | 95% CI | Interpretation
------------------------------------------
0.19 | [0.05, 1.00] | small
- One-sided CIs: upper bound fixed at [1.00].
- Interpretation rule: cohen1988
2.2.2 Larger tables
Whereas the Phi coefficient is only useful for 2x2 tables, the following effect size measures could additionally be applied to larger tables.
- Next, we go for:
- Pearson’s Contingency Coefficient: \(C= \sqrt{\frac{\chi^{2}}{\chi^{2}+N}}\)
- Cohen’s omega (which is not bound to +1): \(\omega= \sqrt{\frac{C}{1-C}}\)
- Cramer’s V: \(V= \sqrt{\frac{\chi^{2}}{n*(min(k, l)-1)}}\)
smartphone_table %>%
effectsize::pearsons_c() %>%
interpret(rules = "cohen1988")
Pearson's C | 95% CI | Interpretation
-------------------------------------------
0.29 | [0.17, 1.00] | small
- One-sided CIs: upper bound fixed at [1.00].
- Interpretation rule: cohen1988
smartphone_table %>%
effectsize::cohens_w() %>%
interpret(rules = "cohen1988")
Cohen's w | 95% CI | Interpretation
-----------------------------------------
0.30 | [0.18, 1.00] | moderate
- One-sided CIs: upper bound fixed at [1.00].
- Interpretation rule: cohen1988
smartphone_table %>%
effectsize::cramers_v() %>%
interpret(rules = "cohen1988")
Cramer's V (adj.) | 95% CI | Interpretation
-------------------------------------------------
0.29 | [0.15, 1.00] | small
- One-sided CIs: upper bound fixed at [1.00].
- Interpretation rule: cohen1988
We see that there is little difference in the different effect size measures for our introductory lecture example. However, we see even small differences can lead to relatively significant variation in the interpretation of effect strength according to Cohen (1988).
3 ANOVA - Example One-way ANOVA on effectiveness of three different in-store promotion campaigns
3.1 Import data from lecturer’s homepage
First, we will download the data from our Chair’s homepage at the Otto-von-Guericke-University.
The dataset is stored as a .rds file within the following directory: https://marketing.ovgu.de/marketing_media/Downloads/MMA/table_6_1_sales.rds
The dataset’s name is table_6_1_sales.rds.
We first set the url (Uniform Resource Locator) for the download.
url_of_file <- "https://marketing.ovgu.de/marketing_media/Downloads/MMA/table_6_1_sales.rds"
We then download the file to our current R project.
If anything does not work out as expected on your machine, you can download the data by clicking on the following button.
Let’s have a look at the dataset:
print(sales_data)
# A tibble: 10 × 4
service_type point_of_sale_display free_tasting_stand in_store_announcements
<chr> <dbl> <dbl> <dbl>
1 Personal 50 55 45
2 Personal 52 55 50
3 Personal 43 49 45
4 Personal 48 57 46
5 Personal 47 55 42
6 Self-service 45 49 43
7 Self-service 44 48 42
8 Self-service 49 54 45
9 Self-service 51 54 47
10 Self-service 44 44 42
3.2 Check assumptions
Of course, the sample size is tiny. It would be better to have at least n=30 per group.
3.2.1 Test for normality via Shapiro-Wilk
For testing normality, we rely on the shapiro_test() function provided by the package rstatix (Kassambara, 2023), which conducts the Shapiro-Wilk test (Shapiro & Wilk, 1965).
shapiro_test(sales_data$free_tasting_stand)
# A tibble: 1 × 3
variable statistic p.value
<chr> <dbl> <dbl>
1 sales_data$free_tasting_stand 0.879 0.127
shapiro_test(sales_data$point_of_sale_display)
# A tibble: 1 × 3
variable statistic p.value
<chr> <dbl> <dbl>
1 sales_data$point_of_sale_display 0.936 0.507
shapiro_test(sales_data$in_store_announcements)
# A tibble: 1 × 3
variable statistic p.value
<chr> <dbl> <dbl>
1 sales_data$in_store_announcements 0.896 0.197
As the empirical p-value for all three Shapiro-Wilk-tests is clearly above 5%, we conclude that it is not unlikely (enough) to observe the group-wise sales data when sampling from a population in which \(H_{0}\) is true. Thus, we see support for the normality of sales data within each group.
3.2.2 Check for Equality of Variances Assumption via Levene
To compute Levene’s test for homogeneity of variance across groups, we rely on the function leveneTest() provided by the package car (Fox & Weisberg, 2019). One drawback of doing so is that we first have to flip the dataset from a wide into the ling (sometimes also called stacked) format. The next code chunk does this job with the help of the function pivot_longer(), provided by the tidyr package (Wickham, Vaughan, & Girlich, 2024).
Transform to long format
sales_data_long <- sales_data %>% pivot_longer(
cols = point_of_sale_display:in_store_announcements,
names_to = "promotion_type",
values_to = "sales"
)
## arrange factor levels
sales_data_long$promotion_type <- factor(sales_data_long$promotion_type,
levels = c("point_of_sale_display", "free_tasting_stand", "in_store_announcements")
)
sales_data_long$promotion_type <- to_labelled(sales_data_long$promotion_type)
sales_data_long$service_type <- factor(sales_data_long$service_type,
levels = c("Personal", "Self-service")
)
sales_data_long$service_type <- to_labelled(sales_data_long$service_type)
## sort
sales_data_long <- sales_data_long %>% arrange(promotion_type, service_type)
##see data in long format
print(sales_data_long, n = 30)
# A tibble: 30 × 3
service_type promotion_type sales
<dbl+lbl> <dbl+lbl> <dbl>
1 1 [Personal] 1 [point_of_sale_display] 50
2 1 [Personal] 1 [point_of_sale_display] 52
3 1 [Personal] 1 [point_of_sale_display] 43
4 1 [Personal] 1 [point_of_sale_display] 48
5 1 [Personal] 1 [point_of_sale_display] 47
6 2 [Self-service] 1 [point_of_sale_display] 45
7 2 [Self-service] 1 [point_of_sale_display] 44
8 2 [Self-service] 1 [point_of_sale_display] 49
9 2 [Self-service] 1 [point_of_sale_display] 51
10 2 [Self-service] 1 [point_of_sale_display] 44
11 1 [Personal] 2 [free_tasting_stand] 55
12 1 [Personal] 2 [free_tasting_stand] 55
13 1 [Personal] 2 [free_tasting_stand] 49
14 1 [Personal] 2 [free_tasting_stand] 57
15 1 [Personal] 2 [free_tasting_stand] 55
16 2 [Self-service] 2 [free_tasting_stand] 49
17 2 [Self-service] 2 [free_tasting_stand] 48
18 2 [Self-service] 2 [free_tasting_stand] 54
19 2 [Self-service] 2 [free_tasting_stand] 54
20 2 [Self-service] 2 [free_tasting_stand] 44
21 1 [Personal] 3 [in_store_announcements] 45
22 1 [Personal] 3 [in_store_announcements] 50
23 1 [Personal] 3 [in_store_announcements] 45
24 1 [Personal] 3 [in_store_announcements] 46
25 1 [Personal] 3 [in_store_announcements] 42
26 2 [Self-service] 3 [in_store_announcements] 43
27 2 [Self-service] 3 [in_store_announcements] 42
28 2 [Self-service] 3 [in_store_announcements] 45
29 2 [Self-service] 3 [in_store_announcements] 47
30 2 [Self-service] 3 [in_store_announcements] 42
See value labels
sales_data_long$promotion_type %>% val_labels()
point_of_sale_display free_tasting_stand in_store_announcements
1 2 3
sales_data_long$service_type %>% val_labels()
Personal Self-service
1 2
Levene’s test
sales_data_long %>%
leveneTest(sales ~ labelled::to_factor(promotion_type), data = .)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.8287 0.4474
27
As the empirical p-value is not significant, we conclude that we have little support for a violation of the assumption of homoscedasticity. Thus, we can proceed with One-way ANOVA
3.3 Conduct the One-way ANOVA
Note that below, we apply the function to_factor() from the labelled package (Larmarange, 2023) to the variable promotion_type. We must do so because otherwise, the aov() function will not recognize that we have a categorical variable. Thus, we have to explicitly tell to handle this variable as a categorical factor variable.
one_way_ANOVA_results <- aov(
formula = sales ~ labelled::to_factor(promotion_type),
data = sales_data_long
) %>% anova_summary(detailed = TRUE)
one_way_ANOVA_results
Effect DFn DFd SSn SSd F p p<.05 ges
1 1 2 27 273.8 310.2 11.916 0.000195 1 0.469
The results are identical to the lecture example. Particularity, the between-sum-of-squares (numerator) are 273.8, whereas the within-sum-of-squares (denominator) are 310.2. As a result, the F-Ratio is 11.92, with a corresponding p-value of approx. 0. Because the empirical p-value is below 0.05, we conclude that the data is very unlikely to come from a sample with an underlying population in which \(H_{0}\) holds true. Thus, we find support for \(H_{A}\); the three promotion types result in different sales numbers.
Keep in mind that from the global F-test in ANOVA, we aren’t able to tell which levels of the factor variable (i.e., which advertising groups) are different from each other. Thus, in the absence of hypotheses, we have to apply post-hoc tests (Sarstedt & Mooi, 2019, Chapter 6). For an application in market research, see Girard et al. (2019).
3.3.1 Post-hoc Test
One option, when the assumptions of normality and homoscedasticity are met, is to use Tukey’s procedure (Tukey, 1949).
We will do so next, by using the function tukeyTest(), provided by the PMCMRplus package (Pohlert, 2023). We present the results next to group-wise descriptive statistics.
sales_data_long %>%
group_by(promotion_type) %>%
summarise(
mean = mean(sales),
SD = sd(sales),
n = n()
)
# A tibble: 3 × 4
promotion_type mean SD n
<dbl+lbl> <dbl> <dbl> <int>
1 1 [point_of_sale_display] 47.3 3.20 10
2 2 [free_tasting_stand] 52 4.19 10
3 3 [in_store_announcements] 44.7 2.58 10
PMCMRplus::tukeyTest(
formula = sales ~ labelled::to_factor(promotion_type),
data = sales_data_long
)
Pairwise comparisons using Tukey's test
data: sales by labelled::to_factor(promotion_type)
point_of_sale_display free_tasting_stand
free_tasting_stand 0.01207 -
in_store_announcements 0.21797 0.00014
P value adjustment method: single-step
alternative hypothesis: two.sided
We see that the differences between the advertising types POS and Free tasting stand are statistically significant. Likewise, the difference between the advertising types, Free tasting stand, and in-store announcements are significant. However, the test doesn’t support a significant difference between POS and in-store announcements. As the Free tasting stand results in the highest sales numbers (significantly higher than the other two types), we would recommend implementing this type of advertising in the future (given that all three types do cost the same).
3.3.2 Doing One-way ANOVA graphically
As an alternative to the test procedures above, we can visualize the ANOVA results with the help of the function ggbetweenstats() provided by the package ggstatsplot (Patil, 2021).
sales_data_long %>% ggstatsplot::ggbetweenstats(
y = sales,
x = promotion_type,
var.equal = TRUE,
p.adjust.method = "bonferroni",
bf.message = FALSE
)
The results mirror those seen before. Note, however, that the post-hoc procedure used here is a different one based on the well-known Bonferroni correction (Bonferroni, 1936).
4 ANOVA - Example Two-way ANOVA on effectiveness of three different in-store promotion campaigns and two service types
Uses the same data set as for One-Way ANOVA
Let’s have a look at the data set:
print(sales_data)
# A tibble: 10 × 4
service_type point_of_sale_display free_tasting_stand in_store_announcements
<chr> <dbl> <dbl> <dbl>
1 Personal 50 55 45
2 Personal 52 55 50
3 Personal 43 49 45
4 Personal 48 57 46
5 Personal 47 55 42
6 Self-service 45 49 43
7 Self-service 44 48 42
8 Self-service 49 54 45
9 Self-service 51 54 47
10 Self-service 44 44 42
4.1 Check assumptions
Of course, the sample size is tiny. It would be better to have at least n=30 per group.
4.1.1 Test for normality (Shapiro-Wilk)
For testing normality, we rely on the shapiro_test() function provided by the package rstatix (Kassambara, 2023), which conducts the Shapiro-Wilk test (Shapiro & Wilk, 1965).
sales_data_long %>%
group_by(service_type, promotion_type) %>%
rstatix::shapiro_test(sales)
# A tibble: 6 × 5
service_type promotion_type variable statistic p
<dbl+lbl> <dbl+lbl> <chr> <dbl> <dbl>
1 1 [Personal] 1 [point_of_sale_display] sales 0.978 0.921
2 1 [Personal] 2 [free_tasting_stand] sales 0.768 0.0435
3 1 [Personal] 3 [in_store_announcements] sales 0.931 0.601
4 2 [Self-service] 1 [point_of_sale_display] sales 0.833 0.147
5 2 [Self-service] 2 [free_tasting_stand] sales 0.900 0.410
6 2 [Self-service] 3 [in_store_announcements] sales 0.871 0.272
The p-value for the cell personal service x Free tasting stand is lower than 5%. Thus, the assumption of normality is not met in our example. Because of the small sample size, we usually would prefer a non-parametric alternative to Two-way ANOVA. However, for this introductory example, we will ignore this.
4.1.2 Check for Equality of Variances Assumption (Levene)
Levene’s test
sales_data_long %>%
car::leveneTest(data = .,
sales ~ labelled::to_factor(service_type) * labelled::to_factor(promotion_type))
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 0.383 0.8554
24
As the empirical p-value is not significant, we conclude that we have little support for a violation of the assumption of homoscedasticity.
4.2 Conduct the Two-way ANOVA
First, with the aov() function (stats package), then anova_test() provided by rstatix (Kassambara, 2023), which preserves factor names in output.
two_way_ANOVA_results <- aov(
formula = sales ~ labelled::to_factor(service_type) * labelled::to_factor(promotion_type),
data = sales_data_long
) %>%
anova_summary(detailed = TRUE)
two_way_ANOVA_results
Effect DFn DFd SSn SSd F p p<.05 ges
1 2 1 24 48.133 248.8 4.643 0.04100 2 0.162
2 1 2 24 273.800 248.8 13.206 0.00014 2 0.524
3 3 2 24 13.267 248.8 0.640 0.53600 1 0.051
two_way_ANOVA_results <- sales_data_long %>%
anova_test(sales ~ labelled::to_factor(service_type) * labelled::to_factor(promotion_type),
type = 3,
detailed = TRUE
)
two_way_ANOVA_results
ANOVA Table (type III tests)
Effect
1 (Intercept)
2 labelled::to_factor(service_type)
3 labelled::to_factor(promotion_type)
4 labelled::to_factor(service_type):labelled::to_factor(promotion_type)
SSn SSd DFn DFd F p p<.05 ges
1 69120.000 248.8 1 24 6667.524 7.32e-31 * 0.996
2 48.133 248.8 1 24 4.643 4.10e-02 * 0.162
3 273.800 248.8 2 24 13.206 1.36e-04 * 0.524
4 13.267 248.8 2 24 0.640 5.36e-01 0.051
The results are identical to the lecture example. In particular, the F-ratio is significant for the type of advertising factor (factor 1) and for the service factor (factor 2). In contrast, the interaction between both factors is not significant based on the p values, which clearly exceed 5%. Thus, the observed data does not support the claim that the influence that different advertising types have on sales depends on the service type provided (and vice versa).
To complete the picture , we can visualize the means of each condition in an interaction plot (although, we have found ou that the interaction in our example is not significant).
sales_data_long %>%
ggplot(aes(x = labelled::to_factor(promotion_type),
color = labelled::to_factor(service_type),
group = labelled::to_factor(service_type),
y = sales,
linetype = labelled::to_factor(service_type),
shape = labelled::to_factor(service_type))) +
stat_summary(fun = mean, geom = "path", size = 1.1) +
stat_summary(fun = mean, geom = "point", size = 3, color = "black") +
ggtitle("Interaction plot of means") +
labs(x = "Type of promotion campaign",
y = "mean of sales") +
theme(plot.title = element_text(hjust = 0.5))
4.3 Post-hoc Tests in two-way ANOVA
In case of Two-way ANOVA, we first need to establish a new variable indicating the cell membership for each case (i.e., 1, 2, …, or 6).
sales_data_long <- sales_data_long %>%
mutate(
cell_membership = case_when(
service_type == 1 & promotion_type == 1 ~ 1,
service_type == 1 & promotion_type == 2 ~ 2,
service_type == 1 & promotion_type == 3 ~ 3,
service_type == 2 & promotion_type == 1 ~ 4,
service_type == 2 & promotion_type == 2 ~ 5,
service_type == 2 & promotion_type == 3 ~ 6,
TRUE ~ NA
)
)
val_labels(sales_data_long$cell_membership) <- c(
"Personal | point_of_sale_display" = 1,
"Personal | free_tasting_stand" = 2,
"Personal | in_store_announcements" = 3,
"Self-service | point_of_sale_display" = 4,
"Self-service | free_tasting_stand" = 5,
"Self-service | in_store_announcements" = 6
)
print(sales_data_long$cell_membership)
<labelled<double>[30]>
[1] 1 1 1 1 1 4 4 4 4 4 2 2 2 2 2 5 5 5 5 5 3 3 3 3 3 6 6 6 6 6
Labels:
value label
1 Personal | point_of_sale_display
2 Personal | free_tasting_stand
3 Personal | in_store_announcements
4 Self-service | point_of_sale_display
5 Self-service | free_tasting_stand
6 Self-service | in_store_announcements
One option, when the assumptions of normality and homoscedasticity are met, is to use Tukey’s procedure (Tukey, 1949).
We present the results next to group-wise descriptive statistics.
sales_data_long %>%
group_by(service_type, promotion_type) %>%
summarise(
mean = mean(sales),
SD = sd(sales),
n = n()
)
`summarise()` has grouped output by 'service_type'. You can override using the
`.groups` argument.
# A tibble: 6 × 5
# Groups: service_type [2]
service_type promotion_type mean SD n
<dbl+lbl> <dbl+lbl> <dbl> <dbl> <int>
1 1 [Personal] 1 [point_of_sale_display] 48 3.39 5
2 1 [Personal] 2 [free_tasting_stand] 54.2 3.03 5
3 1 [Personal] 3 [in_store_announcements] 45.6 2.88 5
4 2 [Self-service] 1 [point_of_sale_display] 46.6 3.21 5
5 2 [Self-service] 2 [free_tasting_stand] 49.8 4.27 5
6 2 [Self-service] 3 [in_store_announcements] 43.8 2.17 5
post_hoc_results <- PMCMRplus::tukeyTest(
formula = sales ~ labelled::to_factor(cell_membership),
data = sales_data_long
)
post_hoc_results %>% summary()
Pairwise comparisons using Tukey's test
data: sales by labelled::to_factor(cell_membership)
alternative hypothesis: two.sided
P value adjustment method: single-step
H0
q value
Personal | free_tasting_stand - Personal | point_of_sale_display == 0 4.306
Personal | in_store_announcements - Personal | point_of_sale_display == 0 -1.667
Self-service | point_of_sale_display - Personal | point_of_sale_display == 0 -0.972
Self-service | free_tasting_stand - Personal | point_of_sale_display == 0 1.250
Self-service | in_store_announcements - Personal | point_of_sale_display == 0 -2.917
Personal | in_store_announcements - Personal | free_tasting_stand == 0 -5.973
Self-service | point_of_sale_display - Personal | free_tasting_stand == 0 -5.278
Self-service | free_tasting_stand - Personal | free_tasting_stand == 0 -3.056
Self-service | in_store_announcements - Personal | free_tasting_stand == 0 -7.223
Self-service | point_of_sale_display - Personal | in_store_announcements == 0 0.694
Self-service | free_tasting_stand - Personal | in_store_announcements == 0 2.917
Self-service | in_store_announcements - Personal | in_store_announcements == 0 -1.250
Self-service | free_tasting_stand - Self-service | point_of_sale_display == 0 2.222
Self-service | in_store_announcements - Self-service | point_of_sale_display == 0 -1.945
Self-service | in_store_announcements - Self-service | free_tasting_stand == 0 -4.167
Pr(>|q|)
Personal | free_tasting_stand - Personal | point_of_sale_display == 0 0.05532680
Personal | in_store_announcements - Personal | point_of_sale_display == 0 0.84251870
Self-service | point_of_sale_display - Personal | point_of_sale_display == 0 0.98165855
Self-service | free_tasting_stand - Personal | point_of_sale_display == 0 0.94674723
Self-service | in_store_announcements - Personal | point_of_sale_display == 0 0.33899903
Personal | in_store_announcements - Personal | free_tasting_stand == 0 0.00360819
Self-service | point_of_sale_display - Personal | free_tasting_stand == 0 0.01172691
Self-service | free_tasting_stand - Personal | free_tasting_stand == 0 0.29180266
Self-service | in_store_announcements - Personal | free_tasting_stand == 0 0.00040876
Self-service | point_of_sale_display - Personal | in_store_announcements == 0 0.99604508
Self-service | free_tasting_stand - Personal | in_store_announcements == 0 0.33899903
Self-service | in_store_announcements - Personal | in_store_announcements == 0 0.94674723
Self-service | free_tasting_stand - Self-service | point_of_sale_display == 0 0.62393032
Self-service | in_store_announcements - Self-service | point_of_sale_display == 0 0.74090587
Self-service | in_store_announcements - Self-service | free_tasting_stand == 0 0.06806693
Personal | free_tasting_stand - Personal | point_of_sale_display == 0 .
Personal | in_store_announcements - Personal | point_of_sale_display == 0
Self-service | point_of_sale_display - Personal | point_of_sale_display == 0
Self-service | free_tasting_stand - Personal | point_of_sale_display == 0
Self-service | in_store_announcements - Personal | point_of_sale_display == 0
Personal | in_store_announcements - Personal | free_tasting_stand == 0 **
Self-service | point_of_sale_display - Personal | free_tasting_stand == 0 *
Self-service | free_tasting_stand - Personal | free_tasting_stand == 0
Self-service | in_store_announcements - Personal | free_tasting_stand == 0 ***
Self-service | point_of_sale_display - Personal | in_store_announcements == 0
Self-service | free_tasting_stand - Personal | in_store_announcements == 0
Self-service | in_store_announcements - Personal | in_store_announcements == 0
Self-service | free_tasting_stand - Self-service | point_of_sale_display == 0
Self-service | in_store_announcements - Self-service | point_of_sale_display == 0
Self-service | in_store_announcements - Self-service | free_tasting_stand == 0 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
post_hoc_results$p.value %>% kable(digits = 3,
caption = "Pair-wise adjusted p-values (undirected)")
Personal | point_of_sale_display | Personal | free_tasting_stand | Personal | in_store_announcements | Self-service | point_of_sale_display | Self-service | free_tasting_stand | |
---|---|---|---|---|---|
Personal | free_tasting_stand | 0.055 | NA | NA | NA | NA |
Personal | in_store_announcements | 0.843 | 0.004 | NA | NA | NA |
Self-service | point_of_sale_display | 0.982 | 0.012 | 0.996 | NA | NA |
Self-service | free_tasting_stand | 0.947 | 0.292 | 0.339 | 0.624 | NA |
Self-service | in_store_announcements | 0.339 | 0.000 | 0.947 | 0.741 | 0.068 |
From the table above, we can see which pair-wise group comparisons are significant (p-value below 5%) after applying a correction for multiple tests.
From the Shapiro-Wilk test, however, we have seen that the assumption of normality is violated, and we have a small sample size that isn’t able to compensate for the lack of normality. Usually, we would, therefore, use a post-hoc procedure that is robust against violations of normality, such as the Dwass-Steel-Critchlow-Fligner procedure (Douglas & Michael, 1991). We can always use this procedure, but its downside is that it is conservative (i.e., it is more unlikely to find significant differences across groups).
Next, we apply this variant, which is also provided by the PMCMRplus package (Pohlert, 2023). The function is named dscfAllPairsTest().
post_hoc_results <- PMCMRplus::dscfAllPairsTest(
formula = sales ~ labelled::to_factor(cell_membership),
data = sales_data_long
)
post_hoc_results %>% summary()
Pairwise comparisons using Dwass-Steele-Critchlow-Fligner all-pairs test
data: sales by labelled::to_factor(cell_membership)
P value adjustment method: single-step
H0
q value
Personal | free_tasting_stand - Personal | point_of_sale_display == 0 -3.140
Personal | in_store_announcements - Personal | point_of_sale_display == 0 -1.783
Self-service | point_of_sale_display - Personal | point_of_sale_display == 0 -0.741
Self-service | free_tasting_stand - Personal | point_of_sale_display == 0 -1.189
Self-service | in_store_announcements - Personal | point_of_sale_display == 0 -2.832
Personal | in_store_announcements - Personal | free_tasting_stand == 0 -3.450
Self-service | point_of_sale_display - Personal | free_tasting_stand == 0 -3.310
Self-service | free_tasting_stand - Personal | free_tasting_stand == 0 -3.009
Self-service | in_store_announcements - Personal | free_tasting_stand == 0 -3.750
Self-service | point_of_sale_display - Personal | in_store_announcements == 0 -0.150
Self-service | free_tasting_stand - Personal | in_store_announcements == 0 -1.932
Self-service | in_store_announcements - Personal | in_store_announcements == 0 -1.363
Self-service | free_tasting_stand - Self-service | point_of_sale_display == 0 -1.505
Self-service | in_store_announcements - Self-service | point_of_sale_display == 0 -2.087
Self-service | in_store_announcements - Self-service | free_tasting_stand == 0 -3.121
Pr(>|q|)
Personal | free_tasting_stand - Personal | point_of_sale_display == 0 0.228177
Personal | in_store_announcements - Personal | point_of_sale_display == 0 0.806241
Self-service | point_of_sale_display - Personal | point_of_sale_display == 0 0.995252
Self-service | free_tasting_stand - Personal | point_of_sale_display == 0 0.959911
Self-service | in_store_announcements - Personal | point_of_sale_display == 0 0.340583
Personal | in_store_announcements - Personal | free_tasting_stand == 0 0.142739
Self-service | point_of_sale_display - Personal | free_tasting_stand == 0 0.177777
Self-service | free_tasting_stand - Personal | free_tasting_stand == 0 0.272703
Self-service | in_store_announcements - Personal | free_tasting_stand == 0 0.085254
Self-service | point_of_sale_display - Personal | in_store_announcements == 0 0.999998
Self-service | free_tasting_stand - Personal | in_store_announcements == 0 0.747418
Self-service | in_store_announcements - Personal | in_store_announcements == 0 0.929369
Self-service | free_tasting_stand - Self-service | point_of_sale_display == 0 0.895657
Self-service | in_store_announcements - Self-service | point_of_sale_display == 0 0.679885
Self-service | in_store_announcements - Self-service | free_tasting_stand == 0 0.234442
Personal | free_tasting_stand - Personal | point_of_sale_display == 0
Personal | in_store_announcements - Personal | point_of_sale_display == 0
Self-service | point_of_sale_display - Personal | point_of_sale_display == 0
Self-service | free_tasting_stand - Personal | point_of_sale_display == 0
Self-service | in_store_announcements - Personal | point_of_sale_display == 0
Personal | in_store_announcements - Personal | free_tasting_stand == 0
Self-service | point_of_sale_display - Personal | free_tasting_stand == 0
Self-service | free_tasting_stand - Personal | free_tasting_stand == 0
Self-service | in_store_announcements - Personal | free_tasting_stand == 0 .
Self-service | point_of_sale_display - Personal | in_store_announcements == 0
Self-service | free_tasting_stand - Personal | in_store_announcements == 0
Self-service | in_store_announcements - Personal | in_store_announcements == 0
Self-service | free_tasting_stand - Self-service | point_of_sale_display == 0
Self-service | in_store_announcements - Self-service | point_of_sale_display == 0
Self-service | in_store_announcements - Self-service | free_tasting_stand == 0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
post_hoc_results$p.value %>% kable(digits = 3,
caption = "Pair-wise adjusted p-values (undirected)")
Personal | point_of_sale_display | Personal | free_tasting_stand | Personal | in_store_announcements | Self-service | point_of_sale_display | Self-service | free_tasting_stand | |
---|---|---|---|---|---|
Personal | free_tasting_stand | 0.228 | NA | NA | NA | NA |
Personal | in_store_announcements | 0.806 | 0.143 | NA | NA | NA |
Self-service | point_of_sale_display | 0.995 | 0.178 | 1.000 | NA | NA |
Self-service | free_tasting_stand | 0.960 | 0.273 | 0.747 | 0.896 | NA |
Self-service | in_store_announcements | 0.341 | 0.085 | 0.929 | 0.680 | 0.234 |
4.4 Effect sizes in ANOVA
As learned during the lecture, we can generally recommend using the global and partial versions of epsilon² (\(\epsilon\)²) (Iacobucci, Popovich, Moon, & Román, 2022). In a situation where the total sample size n is at a minimum of 200, and we have only a few levels in the largest factor, we could rely on eta² (\(\eta\)²). in market research, we are, most of the time, rather interested in the explanatory power of the factors (the partial effect sizes). Let us do both(partial \(\epsilon\)² and \(\eta\)²). We, again, draw on the use of the effectsize package (Ben-Shachar et al., 2020).
two_way_ANOVA_results <- aov(
formula = sales ~ labelled::to_factor(service_type) * labelled::to_factor(promotion_type),
data = sales_data_long
)
two_way_ANOVA_results %>% effectsize::eta_squared() %>% kable(digits = 3)
Parameter | Eta2_partial | CI | CI_low | CI_high |
---|---|---|---|---|
labelled::to_factor(service_type) | 0.162 | 0.95 | 0.004 | 1 |
labelled::to_factor(promotion_type) | 0.524 | 0.95 | 0.261 | 1 |
labelled::to_factor(service_type):labelled::to_factor(promotion_type) | 0.051 | 0.95 | 0.000 | 1 |
two_way_ANOVA_results %>% effectsize::epsilon_squared() %>% kable(digits = 3)
Parameter | Epsilon2_partial | CI | CI_low | CI_high |
---|---|---|---|---|
labelled::to_factor(service_type) | 0.127 | 0.95 | 0.000 | 1 |
labelled::to_factor(promotion_type) | 0.484 | 0.95 | 0.214 | 1 |
labelled::to_factor(service_type):labelled::to_factor(promotion_type) | 0.000 | 0.95 | 0.000 | 1 |
From the tables above, we see that the values for partial \(\epsilon\)² and \(\eta\)² perfectly mirror those we calculated during the lecture. We conclude that the influence of the promotion type on sales far exceeds the influence of the service type.
5 OLS Regression
In the following section, we reproduce the regression example from the lecture within R. The example is modified from Table 7 of Chapter 7 in Sarstedt & Mooi (2019).
30 randomly sampled supermarkets of comparable size that sell “Organic Valley Milk” milk reported weekly sales of Organic Whole Milk, 0.25 Gallon Cartons, its price, and an index of promotional activities.
Question: Does the sample’s data support the idea that variance in price and promotion activities is related to different sales figures?
In total, we will go through the seven steps that we have discussed during the lecture:
- Checking the regression analysis data requirements
- Specifying and estimating the regression model
- Testing the regression analysis assumptions
- Interpreting the regression results and Goodness of Fit (GoF)
- Validating the regression model
- Using the regression model for predictions
However, we will only focus on the things that incorporate applications in R. Therefore, we will not repeat rules-of-thumb (for example, regarding sample sizes, etc.).
First, we will download the data from our Chair’s homepage at the Otto-von-Guericke-University.
5.1 Import the regression data
The dataset is stored as a .rds file within the following directory: https://marketing.ovgu.de/marketing_media/Downloads/MMA/sales_data_regression.rds
The dataset’s name is sales_data_regression.rds.
We first set the url (Uniform Resource Locator) for the download.
url_of_file <- "https://marketing.ovgu.de/marketing_media/Downloads/MMA/sales_data_regression.rds"
Then, we download the file to our current R project.
If anything does not work out as expected on your machine, you can download the data by clicking on the following button.
Let’s have a look at the dataset:
print(sales_data_regression, n=30)
# A tibble: 30 × 4
supermarket promotion own_price sales
<dbl> <dbl> <dbl> <dbl>
1 1 12.0 1.1 3454
2 2 22.0 1.08 3966
3 3 22.0 1.08 2952
4 4 22.0 1.08 3576
5 5 21.4 1.08 3692
6 6 22.0 1.08 3226
7 7 47.0 1.09 3776
8 8 117. 1.05 14134
9 9 92.0 1.1 5114
10 10 62.0 1.08 4022
11 11 87.0 1.12 4492
12 12 92.0 1.02 10186
13 13 61.4 1.08 7010
14 14 72.0 1.06 4162
15 15 47.0 1.13 3446
16 16 37.0 1.05 3690
17 17 12.0 1.1 3742
18 18 62.0 1.08 7512
19 19 62.0 1.08 9476
20 20 22.0 1.08 3178
21 21 2.04 1.12 2920
22 22 42.0 1.04 8212
23 23 17.0 1.09 3272
24 24 7.04 1.11 2808
25 25 2.04 1.12 2648
26 26 7.04 1.11 3786
27 27 2.04 1.12 2908
28 28 62.0 1.08 3395
29 29 82.0 1.04 4106
30 30 132. 1.02 8754
5.2 Step 1: Checking the regression analysis data requirements
5.2.1 Sample size (detectable R²)
In addition to the things we have already discussed during the lecture, we could use a Power Analysis to evaluate which size of effect (in terms of overall R² value) we can detect at \(\alpha\)=5% with a certain level of statistical power (1-\(\beta\)), based on our n=30 sample size.
In essence, we are looking for the statistical power to detect an R² of a certain level as statistically significant in the global F-test of the regression. Think back to the general form of multiple regression: \(\begin{eqnarray} Y &=& \alpha + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{n}X_{n} + e, \quad e \thicksim N(0,\sigma^2) \newline \end{eqnarray}\)
Recall: the global F-test gets significant when it is improbable to find at least one of the sample’s \(\beta_{n}\)s\(\neq0\) when, in reality, they are all 0 (if \(H_{0}\) holds true in the underlying population). Further recall that in the lecture, we have used the following rules of thumb (Cohen, 1988, p. 79):
Interpretation | |
---|---|
R² < 0.10 | very weak model fit |
0.10 \(\leq\) R² < 0.30 | weak model fit |
0.30 \(\leq\) R² < 0.50 | moderate model fit |
R² \(\geq\) 0.50 | substantial fit |
Thus, we could ask what chance we have to end with a significant F-test, when the underlying population’s true model fit is weak (0.10) or moderate (0.30).
We use the pwrss.f.reg() function from the pwrss package (Bulus, 2023).
pwrss.f.reg(n = 30, k = 2, alpha = 0.05, r2 = 0.10)
Linear Regression (F test)
R-squared Deviation from 0 (zero)
H0: r2 = 0
HA: r2 > 0
------------------------------
Statistical power = 0.32
n = 30
------------------------------
Numerator degrees of freedom = 2
Denominator degrees of freedom = 27
Non-centrality parameter = 3.333
Type I error rate = 0.05
Type II error rate = 0.68
pwrss.f.reg(n = 30, k = 2, alpha = 0.05, r2 = 0.30)
Linear Regression (F test)
R-squared Deviation from 0 (zero)
H0: r2 = 0
HA: r2 > 0
------------------------------
Statistical power = 0.868
n = 30
------------------------------
Numerator degrees of freedom = 2
Denominator degrees of freedom = 27
Non-centrality parameter = 12.857
Type I error rate = 0.05
Type II error rate = 0.132
From the outputs above, we see that the corresponding values for the Power are 0.32 and 0.868. Therefore, we conclude that we have a fair chance of finding a significant F-test when the underlying R² represents a moderate level. However, we do not have a fair chance when the underlying true model fit is only weak.
5.2.2 Multicollinearity among independent variables
To assess whether our independent variables (IVs) are strongly correlated with each other, we will create a correlation matrix. Recall that in linear regression, we want each IV to be highly correlated with the dependent variable (DV) but not with other IVs.
We can visualize the correlation matrix with the help of the function ggcorrmat() provided by the package ggstatsplot (Patil, 2021).
sales_data_regression %>%
select(sales, own_price, promotion) %>%
ggstatsplot::ggcorrmat(p.adjust.method = "bonferroni")
From the outputs above, we see that the corresponding predictor variables promotion and own_price are substantially (and statistically significantly) correlated with each other. This indicates the presence of collinearity.
To assess whether the extend of collinearity among predictors is a problem, we further call for the Variance-inflation-factors (VIF). The package car (Fox & Weisberg, 2019) provides the variance inflation factor as the function vif() for this purpose. Below, lm() is a generic function that helps us to estimate linear regression models.
lm(formula = sales ~ own_price + promotion, data = sales_data_regression) %>% vif()
own_price promotion
1.566884 1.566884
We see that VIF values for both predictors are around 1.567 (note that they will always be the same size for models with only two independent variables). As this value is far below the threshold of 5, we conclude that the extend of multicollinearity is acceptable.
5.3 Step 2: Specifying and estimating the regression model
With regard to Step 2, we have discussed during lecture how to estimate \(\beta_{n}\)s mathematically through application of basic matrix manipulations.
For sake of completeness, we can subsequently let R do this job for us. Recall that:
\[ \hat{\beta}=(X^{T}X)^{-1}X^{T}y \]
Thus, we need to extract several things from our data set sales_data_regression. First, we create the design matrix X, which contains the independent variables as columns. Keep in mind, that if we are willing to estimate the intercept \(\alpha\), we also need to include a first column only comprising 1s.
X <- cbind(1, sales_data_regression$own_price, sales_data_regression$promotion)
X
[,1] [,2] [,3]
[1,] 1 1.10 12.04
[2,] 1 1.08 22.04
[3,] 1 1.08 22.04
[4,] 1 1.08 22.04
[5,] 1 1.08 21.42
[6,] 1 1.08 22.04
[7,] 1 1.09 47.04
[8,] 1 1.05 117.04
[9,] 1 1.10 92.04
[10,] 1 1.08 62.04
[11,] 1 1.12 87.04
[12,] 1 1.02 92.04
[13,] 1 1.08 61.42
[14,] 1 1.06 72.04
[15,] 1 1.13 47.04
[16,] 1 1.05 37.04
[17,] 1 1.10 12.04
[18,] 1 1.08 62.04
[19,] 1 1.08 62.04
[20,] 1 1.08 22.04
[21,] 1 1.12 2.04
[22,] 1 1.04 42.04
[23,] 1 1.09 17.04
[24,] 1 1.11 7.04
[25,] 1 1.12 2.04
[26,] 1 1.11 7.04
[27,] 1 1.12 2.04
[28,] 1 1.08 62.04
[29,] 1 1.04 82.04
[30,] 1 1.02 132.04
Finding the transposed form of X (\(X^{T}\)) is simple:
t(X)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
[2,] 1.10 1.08 1.08 1.08 1.08 1.08 1.09 1.05 1.10 1.08 1.12 1.02
[3,] 12.04 22.04 22.04 22.04 21.42 22.04 47.04 117.04 92.04 62.04 87.04 92.04
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,] 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
[2,] 1.08 1.06 1.13 1.05 1.10 1.08 1.08 1.08 1.12 1.04 1.09 1.11
[3,] 61.42 72.04 47.04 37.04 12.04 62.04 62.04 22.04 2.04 42.04 17.04 7.04
[,25] [,26] [,27] [,28] [,29] [,30]
[1,] 1.00 1.00 1.00 1.00 1.00 1.00
[2,] 1.12 1.11 1.12 1.08 1.04 1.02
[3,] 2.04 7.04 2.04 62.04 82.04 132.04
We also have to extract the vector y containing the values for the dependent variable:
y <- cbind(sales_data_regression$sales)
Now, we are ready to multiply the elements. In R, multiplying a matrix B from the right side to a matrix A is expressed trough the operator ‘%%’ (’A%%B’). The function solve() is helpful in finding the inverse of matrix (i.e., to solve the equation system).
To find \((X^{T}X)^{-1}\), we apply:
[,1] [,2] [,3]
[1,] 77.96711074 -70.52149974 -3.569011e-02
[2,] -70.52149974 63.85887945 3.122106e-02
[3,] -0.03569011 0.03122106 4.219068e-05
Similarly, to find the vector projection \(X^{T}y\), we apply:
That said, we find the estimated regression coefficients by:
Note that the first element is the intercept \(\alpha\), whereas the second and the third element represent the regression coefficients \(\beta_{1}\) and \(\beta_{2}\) for the independent variables own_price and promotion. Also note that the values perfectly mirror those that we have calculated during lecture manually.
We double-check via the function lm(), which helps us to estimate linear models.
lm(formula = sales ~ own_price + promotion, data=sales_data_regression)
Call:
lm(formula = sales ~ own_price + promotion, data = sales_data_regression)
Coefficients:
(Intercept) own_price promotion
30304.05 -25209.86 42.27
5.4 Step 3: Testing the regression analysis assumptions
Gauss-Markov assumptions
We have to work on a list of assumptions:
- linearity
- bias-free errors
- homoscedasticity
- absence of autocorrelation
- normally distributed errors (optional)
Let us start with linearity
5.4.1 Linearity in parameters
Note that this assumption is met, since we have formulated the regression equation in a linear way: \(Y=\alpha + \beta_{1}*price + \beta_{2}*promotion + e\)
However, often, researchers are using this step to additionally evaluate whether it makes sense to transform the independent (IV) or the dependent (DV) variables. This is useful when the the relationship between the DV and an IV is not well approximated by linear one. To get further insight into this issue, we can build a scatter plot of each IV against the DV. Or, we ask for the correlation matrix of the variables involved (if the Pearson correlations are high, a linear relationship seems plausible). Above, we have still created such a correlation matrix.
It seems that a linear relationship between both IVs and the DV fits well, as both absolute correlations are substantial and significantly different from 0.
In addition, we can apply Ramsey’s RESET test (Cook & Weisberg, 1983; Ramsey, 1969) to evaluate whether the relationship between the IVs and the DV is linear.
Let us first save the calibrated regression model:
regression_model <- lm(formula = sales ~ own_price + promotion, data = sales_data_regression)
The package lmtest (Zeileis & Hothorn, 2002) provides Ramsey’s RESET test in the form of the function resettest().
regression_model %>% lmtest::resettest()
RESET test
data: .
RESET = 1.0532, df1 = 2, df2 = 25, p-value = 0.3638
As the correlations are substantial and significant, whereas Ramsey’s RESET test is not, it is plausible to assume that linear relationship between both IVs and the DV fits well.
5.4.2 Bias-free errors
In the absence of systematical errors (e), their expected value should be 0. We can only observe the empirical residuals for each observation in our regression model. Usually, one tests whether the observed errors lead to a significant difference from 0.
We can assess the observation-wise errors through:
regression_model$residuals
1 2 3 4 5 6
371.90670 -42.95093 -1056.95093 -432.95093 -290.74598 -782.95093
7 8 9 10 11 12
-1037.50351 5353.47890 -1349.37703 -1677.59279 -1255.84964 1705.83433
13 14 15 16 17 18
1336.61216 -2464.45042 -359.10919 -1709.23737 659.90670 1812.40721
19 20 21 22 23 24
3776.40721 -830.95093 764.76433 2349.33382 -273.52211 189.33551
25 26 27 28 29 30
492.76433 1167.33551 752.76433 -2304.59279 -3447.30804 -1416.80754
attr(,"label")
[1] "Weekly sales in USD"
Next, we use the function gghistostats() (also from ggstatsplot (Patil, 2021)) to evaluate the residuals graphically:
regression_model$residuals %>% as_tibble() %>%
ggstatsplot::gghistostats(
data = .,
title = "Histogram of model residuals together with normal curve \nand formal test of mean against the value of 0",
x = value,
type = "parametric",
test.value = 0,
bf.message = FALSE,
k = 3L, #number of digits
binwidth = 500,
subtitle = FALSE
)
We see that the mean of the residuals almost perfectly approaches the value of 0. Likewise, a t-test against the constant 0 indicates that it is very likely that the observed mean comes from a underlying population where the true mean is also 0 (the empirical p-value approaches 1). Thus, we do not have problems with biased errors.
5.4.3 Homoscedasticity
Recall, homoscedasticity describes a constant error variance across the levels of the DV. A violation of the principle is called heteroscedasticity. Usually, market researchers first plot the residuals of the regression model \(e_{i}\) against the fitted values of the dependent variable \(\hat{y}_{i}\), and then are searching for eye-catching patterns such as funnels, diabolos, or diamonds (see below).
Next, we do so for our example. Below, we first create a new tibble housing the residuals and the fitted values of the regression model. Afterward, we create a scatterplot using ggplot() provided by the package ggplot2 (Wickham, 2016).
fitted_and_residuals <- tibble(y_hat = regression_model$fitted.values, residuals = regression_model$residuals)
regression_heteroscedasticity_plot <- fitted_and_residuals %>%
ggplot(aes(y = residuals, x = y_hat)) +
geom_vline(xintercept = 0, col = "darkgrey") +
geom_hline(yintercept = 0, col = "darkgrey", ) +
geom_point(alpha = 0.5, shape = 21, size = 3.2, colour = "black", fill = "#A4A4A4", stroke = 1) +
labs(y = "Error (residuals)", x = "Fitted values y_hat (weekly sales in $)") +
ggtitle("Scatter plot residuals (e) by fitted values (y_hat)") +
scale_x_continuous(breaks = c(0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000)) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(
plot.background = element_rect(fill = "lightgrey"),
panel.background = element_rect(fill = "grey98"),
panel.border = element_rect(colour = "black", fill = NA, size = 0.7),
text = element_text(size = 12),
axis.title = element_text(size = 13))
regression_heteroscedasticity_plot
By tendency, we see a funnel shape which is open to the right. This indicates heteroscedasticity.
A formal test is given by the Breusch-Pagan Test (Breusch & Pagan, 1979), in an enhanced version (Koenker, 1981), which the package lmtest (Zeileis & Hothorn, 2002) provides the test as the function bptest().
lmtest::bptest(regression_model)
studentized Breusch-Pagan test
data: regression_model
BP = 9.5337, df = 2, p-value = 0.008507
Drawing on both the visual inspection and the formal test, we conclude that the above regression model violates the assumption of equal error variance (i.e., the presence of heteroscedasticity).
As discussed during the lecture, we should, therefore, not trust the above regression model when interpreting significance tests for its regression coefficients. Potential solutions include (see later applications in this script):
- estimation of robust standard errors (e.g., White, 1980)
- bootstrapping (e.g., Efron & Gong, 1983)
5.4.4 Absence of autocorrelation
We test for the presence of autocorrelation by means of a Durbin-Watson test. The package car (Fox & Weisberg, 2019) provides the D-W test as the function durbinWatsonTest().
regression_model %>% car::durbinWatsonTest()
lag Autocorrelation D-W Statistic p-value
1 0.08628019 1.805584 0.496
Alternative hypothesis: rho != 0
As the empirical p-value far exceeds the \(\alpha\)-level 5%, the data supports the notion that we have no problems with autocorrelation.
5.4.5 Normally distributed errors (optional)
First, we can review the histogram of residuals again (for code see above):
Next, we build a second one, also including a normal curve
regression_model$residuals %>%
as_tibble() %>%
ggplot(data = ., aes(value)) +
geom_histogram(aes(y = after_stat(density)),
fill = "lightgray",
col = "black",
binwidth = 500
) +
stat_function(
fun = dnorm,
args = list(
mean = mean(regression_model$residuals),
sd = sd(regression_model$residuals)
)
)
We already know, that the residuals’ mean is not significantly different from 0. The black density curve in the plot furthermore represents a superimposed normal distribution. We see that the histogram is more or less well following the curve (take into account that our teaching example only has 30 observations). Eye-catching, however, is one observation with a relatively high absolute residual (right side of the plot).
To investigate this further, we apply a Shapiro-Wilk-test (Shapiro & Wilk, 1965) for normality to the obtained residuals of the regression model. For example, the package rstatix (Kassambara, 2023) provides the Shapiro-Wilk-test as the function shapiro_test().
regression_model$residuals %>% rstatix::shapiro_test()
# A tibble: 1 × 3
variable statistic p.value
<chr> <dbl> <dbl>
1 . 0.957 0.257
As the empirical p-value in the Shapiro-Wilk-test exceeds the \(\alpha\)-level 5%, the data supports the notion that we have no problems with non-normal errors.
5.5 Step 4: Interpreting the regression results and Goodness of Fit (GoF)
The process of interpreting the regression model’s results typically covers:
Evaluating the statistical significance of the overall model based on the global F-test (lack of significance means that there is little support for the notion that at minimum one of the independent variables has a linear relationship to the dependent variable)
Evaluating global goodness of fit (GoF) drawing on R² and the adjusted R² (how much information of the dependent variable is explainable by the independent variables)
Evaluating local statistical significance of the regression coefficients through e.g., t-tests (and potentially: dropping of insignificant independent variables)
Interpreting the signs of the regression coefficient (i.e., the direction linear association between independent and dependent variable)
Comparing the independent variables’ strength of association with the dependent variable (involving standardized regression coefficients when the independent variables are observed on a different scale)
5.5.1 Global F-test, R², and adjusted R²
Think back to our lecture, when interpreting the regression model, we usually begin with the global F-Test and proceed with the calculation of R² as well as the adjusted R².
We can review all of these figures by simply calling:
regression_model %>% summary()
Call:
lm(formula = sales ~ own_price + promotion, data = sales_data_regression)
Residuals:
Min 1Q Median 3Q Max
-3447.3 -1206.1 -282.1 761.8 5353.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30304.05 16837.17 1.800 0.08307 .
own_price -25209.86 15237.86 -1.654 0.10962
promotion 42.27 12.39 3.412 0.00204 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1907 on 27 degrees of freedom
Multiple R-squared: 0.5513, Adjusted R-squared: 0.5181
F-statistic: 16.59 on 2 and 27 DF, p-value: 2e-05
Usually, we first look at the global F-test to evaluate whether it is probable to obtain the observed model when in the underlying population all \(\beta_{n}\)s are 0. Based on the empirical p-value, which is almost 0, this is very unlikely. Thus, we continue with the R² value which is 55.13 percent. According to the common classification (see above), this indicates substantial model fit. The same is true for the adjusted R²: 51.81 percent. Note that these values perfectly mirror those seen in the lecture again.
Before continuing with interpreting the regression model’s coefficients, let us calculate the global F-value, the R², and the adjusted R² manually from the data (we are doing so to support the calculations we have familiarize our-self with during the lecture). Recall:
\[R^{2}=\frac{SS_{R}}{SS_{E}+SS_{R}}=\frac{SS_{R}}{SS_{T}} \] \[R^{2}_{adj}=1-(1-R^{2})*\frac{n-1}{n-m-1} \] \[F=\frac{R^{2}}{1-R^{2}}*\frac{n-m-1}{m} \] where m is the number of predictor variables (2), n is the sample size (30). We need to calculate \(SS_{T}\), and\(SS_{R}\) while keeping in mind: \[SS_{T}=SS_{E}+SS_{R}\] These components are calculated in the following way: \[\sum_{i=1}^{n}(y_{i}-\overline{y})²=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})²+\sum_{i=1}^{n}(\hat{y}_{i}-\overline{y})²\]
First, we calculate \(SS_{T}=\sum_{i=1}^{n}(y_{i}-\overline{y})²\) (the total variation of observations around the mean)
manual_calculation_tibble <- sales_data_regression %>% select(sales) # new tibble only containing y
#mean of dependent variable y
mean_y <- sales_data_regression %>%
summarise(
mean = mean(sales),
) %>% as.numeric()
#Addends for SST
manual_calculation_tibble <- manual_calculation_tibble %>%
mutate(
y_minus_mean_squared = (sales- mean_y)^2
)
SST <- sum(manual_calculation_tibble$y_minus_mean_squared)
SST
[1] 218804418
Second, we calculate \(SS_{R}=\sum_{i=1}^{n}(\hat{y}_{i}-\overline{y})²\) (the information explained by the regression model)
manual_calculation_tibble$y_hat <- regression_model$fitted.values # add fitted values from the regression model to the tibble
#Addends for SSR
manual_calculation_tibble <- manual_calculation_tibble %>%
mutate(
y_hat_minus_mean_squared = (y_hat - mean_y)^2
)
SSR <- sum(manual_calculation_tibble$y_hat_minus_mean_squared)
SSR
[1] 120631815
To complete the picture, \(SS_{E}\) is given by:
SSE <- SST - SSR
SSE
[1] 98172602
Now, that we have \(SS_{T}\) and \(SS_{R}\), we can calculate R² as:
R_squared <- SSR/SST
R_squared
[1] 0.5513226
Perfectly in line with the value that the summary of lm() provides us with (see above). The adjusted R² is easily obtained by:
adjusted_R_squared <- 1-(1-R_squared)*((30-1)/(30-2-1))
adjusted_R_squared
[1] 0.5180872
Finally, F is given by:
F_value <- ((R_squared)/(1-R_squared))*((30-2-1)/(2))
F_value
[1] 16.58843
5.5.2 Interpretation of regression coefficients
Let us return to the summary of our regression model:
regression_model %>% summary()
Call:
lm(formula = sales ~ own_price + promotion, data = sales_data_regression)
Residuals:
Min 1Q Median 3Q Max
-3447.3 -1206.1 -282.1 761.8 5353.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30304.05 16837.17 1.800 0.08307 .
own_price -25209.86 15237.86 -1.654 0.10962
promotion 42.27 12.39 3.412 0.00204 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1907 on 27 degrees of freedom
Multiple R-squared: 0.5513, Adjusted R-squared: 0.5181
F-statistic: 16.59 on 2 and 27 DF, p-value: 2e-05
Now that we have evaluated the global F-test as significant and found a substantial model fit, we proceed with testing of the individual regression coefficients.
We see that among the two independent variables, only the variable promotion is significantly different from 0 (column Pr(>|t|), which shows the corresponding p-values from t-tests against the value of 0). Thus, we usually would drop the variable promotion from the model to estimate a new, more parsimonious model. However, we will keep both independent variables for illustrative purposes. Remember, we have issues with heteroscedasticity (see above). Due to the biased standard errors (column Std. Error), we should not trust the conventional t-tests for the regression coefficients. One way to tackle this problem is to implement robust standard errors (e.g., White, 1980). Another would be to use bootstrapping (e.g., Efron & Gong, 1983).
For example, the package lmtest (Zeileis & Hothorn, 2002) provides the function coeftest(). In combination with the function vcovHC() provided by the package sandwich (Zeileis, Koell, & Graham, 2020), we can implement many different approaches for the estimation of significance tests relying on robust standard errors.
Let us show White’s estimator (White, 1980), which is robust against violations of homoscedasticity (HC in the code below) :
regression_model %>% lmtest::coeftest(vcov = vcovHC(x=., type = 'HC'))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30304.054 13825.120 2.1920 0.03718 *
own_price -25209.858 12472.071 -2.0213 0.05326 .
promotion 42.266 13.594 3.1091 0.00439 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note, that the signs and absolute values of the regression coefficients did not change! In contrast, standard errors and the associated t-test results changed. In the corrected test for the independent variables the coefficient for promotion remains significant at p<0.05. However, what is interesting is that the p-value in the corrected test for the coefficient of own_price is now close to \(\alpha\)=5%. If one argues that the directed hypothesis “price has a negative relationship with sales” is justified from an economical perspective (Sablotny-Wackershauser, Lichters, Guhl, Bengart, & Vogt, 2024), one could also argue that we can divide the empirical p-value by 2 for mimicking a directed (one-sided) test. Against this background, also the price variable has a significant regression coefficient. It seems as if even the limited extent of heteroscedasticity in our example is enough to lead to a suppression effect (see lecture), where the correlation between two IVs suppresses the significant relationship of one of the variables with the DV in the regression model.
Even if we would have come to the conclusion that own_price is not a significant predictor for sales, this would not mean that price is not related to sales. It only means that price has no correlation with sales in the ranges of prices that are recorded in the sample. That is, more extreme price changes might nevertheless influence sales.
For sake of completeness, below we repeat this analysis step ones with robust confidence intervals (\(\alpha=0.05\) and \(\alpha=0.10\)) instead of t-tests and tests with standard errors that are robust against heteroscedasticity and autocorrelation simultaneously (Newey & West, 1987).
regression_model %>% lmtest::coefci(level = 0.95, vcov = vcovHC(x=., type = 'HC')) # undirected
2.5 % 97.5 %
(Intercept) 1937.25062 58670.85720
own_price -50800.43323 380.71721
promotion 14.37263 70.15947
regression_model %>% lmtest::coefci(level = 0.90, vcov = vcovHC(x=., type = 'HC')) # directed
5 % 95 %
(Intercept) 6755.88661 53852.22122
own_price -46453.39189 -3966.32413
promotion 19.11085 65.42124
regression_model %>% lmtest::coeftest(vcov = NeweyWest)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30304.054 15821.742 1.9153 0.066097 .
own_price -25209.858 14438.704 -1.7460 0.092183 .
promotion 42.266 12.252 3.4498 0.001858 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Drawing on the above discussion, we leave both IV’s in the model. Next we interpret their direction of association with the DV as well as their relative contribution in explaining variation in the DV. The IV own_price is measured in USD (as the DV sales is). The IV promotion, however, is a complex index combining the level of promotional activities (e.g., product placement, advertising, and coupons) for the milk product. Here are the coefficients again
Call:
lm(formula = sales ~ own_price + promotion, data = sales_data_regression)
Coefficients:
(Intercept) own_price promotion
30304.05 -25209.86 42.27
We see that in the complete absence of promotional activities and a price of 0 USD (unrealistic!), supermarkets are estimated to generate 30304.05 USD of sales per week for selling the milk product. Further, with each increase of the product price for 1 USD, the model estimates that a supermarket generates 25209.86 USD less. Keep in mind that the average sales in our dataset are 4920.5 USD, whereas the mean price is around 1.08 USD. Therefore, is seems logical that supermarkets are estimated to generate approximately 30304.05-25209.86 = 5094.19 USD in milk sales when the price increases from 0 to 1 USD. Furthermore the model estimates that an increase in promotional activities for 1 index point will lead to an increase of sales for 42.27 USD.
When comparing the independent variables’ strength of association with the dependent variable, we should never base our conclusions one the relative differences between p-values! Rather we should take the differences in the regression coefficients’ absolute values into account. If different IVs are measures in different units/scales, we should compare the standardized regression coefficients instead.
Next, we will evaluate the IVs’ relative contribution in explaining variation in the DV. For this purpose, we need the standardized regression coefficients. Fortunately, many packages such as the package lm.beta (Behrendt, 2023) provide suitable functions such as lm.beta() that we will use below:
Call:
lm(formula = sales ~ own_price + promotion, data = sales_data_regression)
Residuals:
Min 1Q Median 3Q Max
-3447.3 -1206.1 -282.1 761.8 5353.5
Coefficients:
Estimate Standardized Std. Error t value Pr(>|t|)
(Intercept) 3.030e+04 NA 1.684e+04 1.800 0.08307 .
own_price -2.521e+04 -2.670e-01 1.524e+04 -1.654 0.10962
promotion 4.227e+01 5.506e-01 1.239e+01 3.412 0.00204 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1907 on 27 degrees of freedom
Multiple R-squared: 0.5513, Adjusted R-squared: 0.5181
F-statistic: 16.59 on 2 and 27 DF, p-value: 2e-05
regression_model %>% lm.beta::lm.beta()
Call:
lm(formula = sales ~ own_price + promotion, data = sales_data_regression)
Standardized Coefficients::
(Intercept) own_price promotion
NA -0.2669624 0.5506476
Since the absolute value of standardized regression coefficient for promotion is higher than those for own_price, we conclude that promotion activities have a higher contribution in explaining variation in milk sales as compared to price changes.
5.6 Step 5: Validating the regression model
During the lecture we have learned that validating the regression model includes an assessment of the model’s stability (i.e., do the model coefficients remain stable if one changes the data basis or adds/removes predictor variables?) and predictive validity (does the model predict values for the DV of observations not used for calibration well?). While stability ensures us that our interpretation of the regression model’s results observed for the drawn sample of stores does not hinge on a miss specified model, productive validity is important when the goal of the market research study is to predict sales for stores not included in the sample. Sometimes, market researchers just try to understand relationships among variables within a given sample, other times predictive validity is more important since the goal of the market research study is to find a suitable prediction model.
5.6.1 Evaluating the regression model’s stability
For illustrative purpose, we apply a Split-sample validation with a 50:50 split into two subsamples. Remember from the lecture, such an approach involves splitting into two groups and re-running the model on each subset. Results are stable if both samples produced similar results (i.e., signs of the individual parameters should at least be consistent, and significant variables should remain so).
Keep in mind that our model has a great chance of not being stable since each sub sample only contains n=15 observations. Further, below, we use the function sample_frac(.50) (dplyr package (Wickham, François, Henry, Müller, & Vaughan, 2023)) to create random sub sample. Thus, results might differ depending on your machine.
sales_data_regression
# A tibble: 30 × 4
supermarket promotion own_price sales
<dbl> <dbl> <dbl> <dbl>
1 1 12.0 1.1 3454
2 2 22.0 1.08 3966
3 3 22.0 1.08 2952
4 4 22.0 1.08 3576
5 5 21.4 1.08 3692
6 6 22.0 1.08 3226
7 7 47.0 1.09 3776
8 8 117. 1.05 14134
9 9 92.0 1.1 5114
10 10 62.0 1.08 4022
# ℹ 20 more rows
# create id variable
sales_data_regression <- sales_data_regression %>% mutate(id = row_number())
# create estimation set
sales_data_regression_subsample_1 <- sales_data_regression %>% sample_frac(.50)
# create validation set
sales_data_regression_subsample_2 <- anti_join(sales_data_regression, sales_data_regression_subsample_1, by = 'id')
# estimate both regression models
model_subsample_1 <- lm(formula = sales ~ own_price + promotion, data = sales_data_regression_subsample_1)
model_subsample_2 <- lm(formula = sales ~ own_price + promotion, data = sales_data_regression_subsample_2)
#compare results
list(model_subsample_1, model_subsample_2) %>%
map(~ summary(object = .))
[[1]]
Call:
lm(formula = sales ~ own_price + promotion, data = sales_data_regression_subsample_1)
Residuals:
Min 1Q Median 3Q Max
-2369.9 -1146.1 -256.3 415.1 4372.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23616.72 24215.64 0.975 0.349
own_price -18911.29 21821.01 -0.867 0.403
promotion 30.80 18.86 1.633 0.128
Residual standard error: 1933 on 12 degrees of freedom
Multiple R-squared: 0.4848, Adjusted R-squared: 0.3989
F-statistic: 5.645 on 2 and 12 DF, p-value: 0.01871
[[2]]
Call:
lm(formula = sales ~ own_price + promotion, data = sales_data_regression_subsample_2)
Residuals:
Min 1Q Median 3Q Max
-2402.6 -943.9 245.3 894.2 2257.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63992.94 18851.56 3.395 0.005323 **
own_price -56483.68 17167.49 -3.290 0.006457 **
promotion 61.45 12.48 4.924 0.000351 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1399 on 12 degrees of freedom
Multiple R-squared: 0.8208, Adjusted R-squared: 0.7909
F-statistic: 27.48 on 2 and 12 DF, p-value: 3.31e-05
5.6.2 Evaluating the regression model’s predictive validity
During lecture, we have learned about multiple approaches to evaluate predictive validity in linear regression. Below, we will exemplary apply k-fold cross validation and leave-on-out validation (i.e., n-fold cross validation). For example, the package cv (Fox & Monette, 2024) provides as function with the package’s name. We will use this function for a simple 5-fold cross-validation first (see lecture slides). However, the package comes with one drawback. It is not apt in handling variables possessing variable labels. Therefore, below we re-estimate the regression model, but based on the data without variable labels (function remove_labels() from labelled (Larmarange, 2023)).
Remember, when we judged the within sample GoF above (e.g., R²), we based our conclusions in great part on: \(\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})²\) (the sum of squares between the fitted values of the regression model (\(\hat{y}\)) and the observed values for the dependent variable (y)). Thus, the lower the sum of squared models residuals, the higher the global GoF is.
Below, we apply a similar logic when using 5-fold cross validation. More precisely, we have to tell the cv() function which measure of misfit for out-of-sample predictions to use when predicting each hold-out fold’s values for the dependent variable. The default is the mean squared error (MSE) of prediction, calculated as \(MSE=\frac{\sum_{j=1}^{k}\sum_{i=1}^{d}(y_{j,i}-\hat{y}_{j, i})²}{n}\), where k is the number of folds and 1…d indicates the number of cases within a fold. The lower the MSE, the higher the predictive validity of the model. Instead of the MSE, below we use the root mean squared error of prediction (RMSE), given by: \(RMSE=\sqrt{\frac{\sum_{j=1}^{k}\sum_{i=1}^{d}(y_{j,i}-\hat{y}_{j,i})²}{n}}\). This measure has the nice property of possessing the same measurement unit as the dependent variable (sales in USD). Also, we set reps = 3. We rerun 5-fold cross-validation for 3 times to get an impression of how strongly the results depend on the concrete allocation of data into the five folds.
Note that due to the random elements in this procedure, your results might look slightly different from those below.
5-fold cross-validation
lm(formula = sales ~ own_price + promotion, data = remove_labels(sales_data_regression)) %>% cv::cv(k = 5, criterion = rmse, reps = 3) %>% summary()
R RNG seed set to 569804
R RNG seed set to 669719
R RNG seed set to 919378
Replicate 1:
5-Fold Cross Validation
method: Woodbury
criterion: rmse
cross-validation criterion = 1980.91
full-sample criterion = 1808.983
Replicate 2:
5-Fold Cross Validation
method: Woodbury
criterion: rmse
cross-validation criterion = 2175
full-sample criterion = 1808.983
Replicate 3:
5-Fold Cross Validation
method: Woodbury
criterion: rmse
cross-validation criterion = 2062.092
full-sample criterion = 1808.983
Average:
5-Fold Cross Validation
method: Woodbury
criterion: rmse
cross-validation criterion = 2049.728 (91.8657)
full-sample criterion = 1808.983
Above, we see that, on average, the model predicts the sales of a new store not included in calibration rather badly. The RMSE is around 2100 USD. Given that the mean sales within the sample are 4920.5 USD, this is a hugh number, but not surprising with an eye on our tiny sample size. The output portrays the results for each repetition and then averaged across repetitions, with the standard deviations of the CV criterion in parentheses. Based on these results we should not trust the model when doing predictions. We also see that the RMSE error within the complete sample is much lower. Such an adverse situation where the out-of-sample error of prediction is much higher than the in-sample error drawing on the whole sample is typically called overfitting.
Leave-one-out (LOO) n-fold cross-validation
lm(formula = sales ~ own_price + promotion, data = remove_labels(sales_data_regression)) %>% cv::cv(k = "loo", criterion = rmse)
cross-validation criterion (rmse) = 2065.1
The results of a loo cross-validation closely mirror those of the 5-fold cross validation.
Note that when the main goal of a study is not significance testing but rather prediction, we should rather use cross-validation also to decide which independent variables should remain in the regression model (Hastie, Tibshirani, & Friedman, 2009, p. 246).
For sake of completeness, below we apply leave-one-out (loo) cross-validation to compare a model with and another without the price included as a independent variable. We first use cv’s models() function to create a list of regression models to compare. Then, we use the cv() function to present the results of cross-validation for each list element.
model_list <- cv::models(
regression_price_promotion = lm(formula = sales ~ own_price + promotion, data = remove_labels(sales_data_regression)),
regression_promotion = lm(formula = sales ~ promotion, data = remove_labels(sales_data_regression))
)
model_list %>% cv::cv(data = remove_labels(sales_data_regression),
k = "loo",
criterion = rmse)
Model regression_price_promotion:
cross-validation criterion = 2065.1
Model regression_promotion:
cross-validation criterion = 2093.559
The out-of-sample prediction error (RMSE) is slightly lower for the model with both predictors included (promotion and price). This result delivers additional support for the decision to leave the independent variable own_price in the regresssion model.
5.7 Step 6: Using the regression model
Using the regression model means prediction! This is easy in R. Let us take the lecture example.
Prediction for the price set at 1.10 USD and promotional activities at 50
predictions <- regression_model %>% predict.lm(
newdata = data.frame(promotion = 50, own_price = 1.10),
interval = "prediction",
level = 0.95
)
predictions
fit lwr upr
1 4686.512 658.6533 8714.372
Given the above values for the independent variables, the regression model estimates that the supermarket’s milk weekly milk sales will be approximately 4686.51 USD. However, please pay attention to the extraordinary wide 95% prediction interval for this estimate. This showcases how much uncertainty we face when using regression models based on tiny sample sizes for predictions.
6 Factor Analysis and Principal Component Analysis
In the following section, we reproduce the principal component analysis (PCA) example from the lecture within R. The dataset stems from Sarstedt, Ringle, Raithel, & Gudergan (2014) and is also used as an introductory example in Chapter 8 in Sarstedt & Mooi (2019).
Fan satisfaction represents a vital management goal of any (professional) soccer club. Empirical studies often consider only the role of individual aspects of fan satisfaction on target variables. Therefore, A soccer club commissioned a questionnaire comprising 99 previously identified items through literature databases and focus groups of fans (high-dimensional data). Development of the FANSAT scale for measurement and prediction of soccer fan behavior. We use only 7 of the items (as was done during the lecture). Like in the lecture, we also use the term factor to refer to principal components (when conducting PCA).
Question (PCA): Which umbrella terms can we use to summarize the sets of variables that load highly on the extracted principal components? Question (Factor analysis aka, principal axis analysis, FA): What is the underlying construct (common reason) for the strong correlations between the variables and their corresponding factors?
In total, we will go through the seven steps that we have discussed during the lecture:
- Checking the data requirements and conducting preliminary analyses
- Extracting the factors
- Determining the number of factors to retain
- Interpreting the (rotated) factor solution
- Evaluating the goodness-of-fit of the factor solution
- Computing the factor scores (optional)
However, we will only focus on the things that incorporate applications in R. Therefore, we will not repeat rules-of-thumb (for example, regarding sample sizes, etc.).
Let us download the data from our Chair’s homepage at the Otto-von-Guericke-University.
6.1 Import the data for Principal Component Analysis / Factor Analysis
The dataset is stored as a .rds file within the following directory: https://marketing.ovgu.de/marketing_media/Downloads/MMA/Soccer.rds
The dataset’s name is Soccer.rds.
We first set the url (Uniform Resource Locator) for the download.
url_of_file <- "https://marketing.ovgu.de/marketing_media/Downloads/MMA/Soccer.rds"
Then, we download the file to our current R project.
If anything does not work out as expected on your machine, you can download the data by clicking on the following button.
Let’s have a look at the dataset:
print(Soccer_data_PCA_FA, n=25)
# A tibble: 251 × 7
x_1 x_2 x_3 x_4 x_5 x_6 x_7
<dbl+lbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl+l> <dbl+lb> <dbl+l>
1 5 5 5 6 5 5 5
2 7 [very dissatisfied] 7 [very dissa… 7 [ver… 7 [ver… 7 [ver… NA 1 [ver…
3 5 5 4 2 1 [ver… NA 6
4 6 6 4 3 2 NA 6
5 7 [very dissatisfied] 7 [very dissa… 7 [ver… 5 7 [ver… 4 5
6 3 3 3 2 2 4 6
7 7 [very dissatisfied] 7 [very dissa… 6 5 5 4 4
8 7 [very dissatisfied] 7 [very dissa… 4 1 [ver… 1 [ver… NA 7 [ver…
9 3 4 4 7 [ver… 6 NA 4
10 7 [very dissatisfied] 5 4 4 6 4 3
11 7 [very dissatisfied] 7 [very dissa… 7 [ver… 5 5 NA 4
12 7 [very dissatisfied] 7 [very dissa… 6 4 6 6 4
13 7 [very dissatisfied] 7 [very dissa… 7 [ver… 7 [ver… 7 [ver… 6 6
14 7 [very dissatisfied] 7 [very dissa… 7 [ver… 6 6 6 6
15 7 [very dissatisfied] 7 [very dissa… 6 6 6 7 [ver… 7 [ver…
16 6 6 5 4 4 3 4
17 7 [very dissatisfied] 6 5 3 1 [ver… 4 4
18 7 [very dissatisfied] 7 [very dissa… 7 [ver… 7 [ver… 7 [ver… 2 3
19 7 [very dissatisfied] 7 [very dissa… 6 5 6 6 5
20 7 [very dissatisfied] 3 3 5 6 4 4
21 4 4 4 5 7 [ver… 4 6
22 6 6 6 2 4 NA 5
23 6 6 5 1 [ver… 1 [ver… 4 3
24 7 [very dissatisfied] 7 [very dissa… 7 [ver… 3 2 NA 5
25 7 [very dissatisfied] 7 [very dissa… 7 [ver… 5 6 4 3
# ℹ 226 more rows
We see that we have 251 cases and 7 variables (x_1 to x_7). Unfortunately, we also spot missing values (NA) in the data. These missing value are no non-responses in the questionnaire. Rather, they represent respondents’ “don’t know” reaction. We can briefly summarize variable and the value labels, as well as the number of missing for each variable by using the look_for() function from the package labelled (Larmarange, 2023).
Soccer_data_PCA_FA %>% labelled::look_for()
pos variable label col_type missing
1 x_1 Condition of the stadium dbl+lbl 2
2 x_2 Outer appearance of the sta~ dbl+lbl 2
3 x_3 Interior design of the stad~ dbl+lbl 3
4 x_4 Quality of the team composi~ dbl+lbl 5
5 x_5 Number of stars in the team dbl+lbl 12
6 x_6 Price of annual season tick~ dbl+lbl 47
7 x_7 Entry fees dbl+lbl 3
values
[1] very satisfied
[7] very dissatisfied
[1] very satisfied
[7] very dissatisfied
[1] very satisfied
[7] very dissatisfied
[1] very satisfied
[7] very dissatisfied
[1] very satisfied
[7] very dissatisfied
[1] very satisfied
[7] very dissatisfied
[1] very satisfied
[7] very dissatisfied
Next to the labels, we see that the variable x_6 (satisfaction with the price of the annual season ticket) has 47 missing values. After all, people can’t tell about this aspect of fan satisfaction when they are not buying the annual ticket. Why is this important? PCA/ FA analysis uses the complete cases for calculations. Thus, the effective sample size in the following steps will not be n=251. More importantly, it will also not be 251-47=204 since a case with no missing values in x_6 can have missing values elsewhere. To get an overview, we subsequently analyze the structure of missing data using functions provided by the package naniar (Tierney & Cook, 2023).
Soccer_data_PCA_FA %>%
miss_case_table()
# A tibble: 6 × 3
n_miss_in_case n_cases pct_cases
<int> <int> <dbl>
1 0 195 77.7
2 1 47 18.7
3 2 5 1.99
4 3 1 0.398
5 4 1 0.398
6 5 2 0.797
The effective sample size in the PCR/FA will be n=195.
6.2 Step 1: Checking the data requirements and conducting preliminary analyses
Question: Are the variables sufficiently correlated with each other?
We will approach the answer to this question by using multiple analyses.
6.2.1 Correlation matrix and tests for significance
We begin with the pairwise Pearson correlations (r) and the corresponding tests whether these pairwise correlations are significantly different from 0. Like in other sections of this script, we visualize the correlation matrix with the help of the function ggcorrmat() provided by the package ggstatsplot (Patil, 2021). Note, that for the following analysis steps we only use the complete cases, while dropping the incomplete ones. We do so by applying the function drop_na() (rstatix (Kassambara, 2023)).
Soccer_data_PCA_FA %>% rstatix::drop_na() %>% ggcorrmat(k=3L)
Note that the values perfectly mirror those in the lecture (IBM SPSS outputs). All pairwise correlations differ significantly from 0 (none of the matrix elements is crossed out). Furthermore, the colors reveal some interesting patterns. x_1, x_2, and x_3 somehow are correlated highly. The same is true for the pairs x_4 & x_5 as well as for x_6 & x_7.
6.2.2 Kaiser-Meyer-Olkin (KMO) criterion/ Measure of Sampling Adequacy (MSA)
Next, we use the Kaiser–Meyer–Olkin (KMO) statistic. The KMO statistic, also called the measure of sampling adequacy (MSA), indicates whether the other variables in the dataset can explain the correlations between variables. Kaiser (1974) introduced the statistic and recommended threshold values for KMO/MSA:
KMO/MAS | Interpretation |
---|---|
\(\geq\) 0.90 | excellent |
\(\geq\) 0.80 | deserving |
\(\geq\) 0.70 | quite good |
\(\geq\) 0.60 | moderate |
\(\geq\) 0.50 | miserable |
< 0.50 | unbearable |
Thus, global the KMO/MAS value should exceed 0.5. We should consider deleting variables with local KMO/MAS values below 0.5.
Below, we use the function KMO() provided by the package psych (2023). to ask for the MSA values. In general, this package psych provides many very useful functions for conducting PCAs/FAs.
Kaiser-Meyer-Olkin factor adequacy
Call: psych::KMO(r = .)
Overall MSA = 0.72
MSA for each item =
x_1 x_2 x_3 x_4 x_5 x_6 x_7
0.79 0.80 0.82 0.67 0.64 0.65 0.61
With a global KMO/MSA value of 0.72, the data shows a ‘quite good’ adequacy for PCA/FA. Also when looking at the local values, the minimum value is 0.61 (x_7), indicating ‘moderate’ adequacy of each of the seven questions.
6.2.3 Bartlett’s test of sphericity
Finally, we conduct Bartlett’s test of sphericity (Bartlett, 1950). Keep in mind, if the empirical p-value is insignificant (i.e., data is likely to come from a population with zero correlations between all variables), the appropriateness of factor analysis should be questioned.
The psych package also provides a function for the Bartlett test:
Soccer_data_PCA_FA %>% rstatix::drop_na() %>% psych::cortest.bartlett()
R was not square, finding R from data
$chisq
[1] 809.3528
$p.value
[1] 9.435409e-158
$df
[1] 21
Since the empirical p-value is almost 0, the data is very unlikely to stem from a population where all pairwise correlations are 0. Thus, the data seems to be appropriate for PCA/FA.
Furthermore, the results of the KMO statistic and the Bartlett test are well-aligned with the IBM SPSS outputs we saw during our lecture.
6.3 Step 2: Extracting the factors and Step 3: Determining the number of factors to retain
We will complete both steps combined. First, we decide how many factors to extract. Second, we only extract the recommended number of factors.
Kaiser criterion and Scree plot
The function scree() is also provided by the psych package. It delivers both information (a) how many eigenvalues are > 1 in PCA (Kaiser’s criterion, (Kaiser & Dickman, 1959)), and (b) the scree plot hopefully indicating an “Elbow”. Note that the function, by default, provides the information for principle component analysis (PCA) and factor analysis (FA). In the latter case, the eigenvalues are generally lower since in FA (vs. PCA), not all information contained in the original variables is thought to be reproduced by the factor solution (i.e., the sum of communalities does not equal to the number of variables).
We will interpret the PCA results of the figure (PC). Here, we see that according to the Kaiser criterion, we should retain only three factors. The first pronounced elbow appears at four factors. Thus, with the elbow criterion, we should extract one factor less (three factors in total) because the left-hand side eigenvalue is much higher than the eigenvalue seen for the fourth factor.
Parallel Analysis
Next, we proceed with parallel analysis (Horn, 1965), the recommended approach to determine th number of factors to retain. In parallel analysis, we extract as much factors as we observe eigenvalues greater than corresponding 95% percentile of eigenvalues from randomly generated datasets with the same sample size and number of variables than the original dataset. The psych package comes with a corresponding function (fa.parallel()).
Soccer_data_PCA_FA %>% rstatix::drop_na() %>% psych::fa.parallel(quant = 0.95, error.bars = TRUE)
Parallel analysis suggests that the number of factors = 3 and the number of components = 3
Randomly generated datasets comes in two forms: randomly generated uncorrelated data following a normal distribution or resampled data (two separate red lines in the plot above indicate the mean eigenvalues for both random generation strategies). In the present case, the parallel analysis suggests to retain 3 factors (in case of PCA and FA).
Note that due to the small dataset and the random elements involved in parallel analysis your results might slightly look different. Further note that the error bars in the plot represent 1-standard deviation and not the 95% percentile of randomly drawn data. To apply the 95% percentile rule, we set quant = 0.95 in the above code chunk, which is not reflected by the plot.
Extraction of unrotated solution
In the subsequent step, we extract the factor solution. We remember from the lecture that we never interpret the unrotated factor solution. However, for illustration, we will do so next. For factor extraction, we draw on psych’s principal() function.
PCA_results_unrotated <- Soccer_data_PCA_FA %>% rstatix::drop_na() %>% psych::principal(
rotate = "none",
nfactors = 3
)
PCA_results_unrotated
Principal Components Analysis
Call: psych::principal(r = ., nfactors = 3, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 h2 u2 com
x_1 0.79 -0.40 0.25 0.85 0.150 1.7
x_2 0.81 -0.37 0.23 0.85 0.148 1.6
x_3 0.81 -0.33 0.26 0.83 0.168 1.5
x_4 0.75 0.08 -0.59 0.91 0.088 1.9
x_5 0.68 0.14 -0.66 0.92 0.077 2.1
x_6 0.56 0.69 0.25 0.84 0.157 2.2
x_7 0.49 0.72 0.33 0.86 0.142 2.2
PC1 PC2 PC3
SS loadings 3.52 1.42 1.14
Proportion Var 0.50 0.20 0.16
Cumulative Var 0.50 0.71 0.87
Proportion Explained 0.58 0.23 0.19
Cumulative Proportion 0.58 0.81 1.00
Mean item complexity = 1.9
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.05
with the empirical chi square 19.25 with prob < 0.00024
Fit based upon off diagonal values = 0.99
The function’s results provide us with many useful information. We will focus on the essential ones, which we also discussed during the lecture. The first table in the output shows the matrix of factor loadings (columns PC1 to PC3, PC stands for principal component). Keep in mind that we never interpret the unrotated loadings. Also, note that the order of rows (the items) is not the same as in the IBM SPSS output we have seen during the lecture.
The numerical results indicate that the first three components together account for 87% of the variance of the initial items (column PC3, row Cumulative Var), which is clearly above the lower threshold of 0.5 and also above the recommended value of at least 0.75.
If we look at the item-specific communalities (h2), we see that the solution does exceed the common threshold of 50% for each of the items, which is satisfying (h2 is simply the sum of squared factor loadings of a corresponding item). Since the communalities are the item-specific information accounted for by the factor solution, their sum divided by the number of items is again 87% ([0.85+0.85+0.83+0.91+0.92+0.84+0.86]/7=0.866).
The column u2 presents the corresponding uniqueness values. Keep in mind that 1 - communality = uniqueness. Thus, for x_1, 15% percent of the original variable’s information is not accounted for by the three extracted factors.
Finally, the column com lists the items’ complexity (Hofmann, 1977, 1978) . This values always is allways between 1 and the number of factors extracted. If an item represents only one latent construct, then the value in this column is close to one. Higher values indicate that assigning an item to only one factor is challenging. Values higher than two suggest that an item shares more information with other factors than the one it is assigned to based on the highest factor loading. We also do not interpret these values for the unrotated factor solution.
6.4 Step 4: Interpreting the (rotated) factor solution
To enhance the interpretability of the factor solution, we next apply a factor rotation scheme to the unrotated matrix of factor loadings. As we have seen during the lecture, rotating a factor solution in an orthogonal way preserves the factors’ interdependence. Furthermore, it does not change the total information accounted for by the factor solution. Instead, some information explained by the first few factors in the unrotated solution is shifting in the direction of the latter factors extracted. Mathematically, rotating is simply a matrix multiplication where a rotation matrix is multiplied from the right to the unrotated factor loadings matrix. During the lecture, we have learned that the most often applied orthogonal rotation is the Varimax rotation (Kaiser, 1958).
Below, we will use this rotation:
PCA_results_varimax <- Soccer_data_PCA_FA %>% rstatix::drop_na() %>% psych::principal(
rotate = "varimax",
nfactors = 3
)
PCA_results_varimax
Principal Components Analysis
Call: psych::principal(r = ., nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC3 RC2 h2 u2 com
x_1 0.90 0.17 0.08 0.85 0.150 1.1
x_2 0.89 0.20 0.11 0.85 0.148 1.1
x_3 0.88 0.18 0.15 0.83 0.168 1.1
x_4 0.28 0.90 0.14 0.91 0.088 1.2
x_5 0.17 0.94 0.13 0.92 0.077 1.1
x_6 0.14 0.18 0.89 0.84 0.157 1.1
x_7 0.10 0.08 0.92 0.86 0.142 1.0
RC1 RC3 RC2
SS loadings 2.53 1.83 1.71
Proportion Var 0.36 0.26 0.24
Cumulative Var 0.36 0.62 0.87
Proportion Explained 0.42 0.30 0.28
Cumulative Proportion 0.42 0.72 1.00
Mean item complexity = 1.1
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.05
with the empirical chi square 19.25 with prob < 0.00024
Fit based upon off diagonal values = 0.99
Notably, through the Varimax rotation, many results changed. When focusing on the matrix of factor loadings, we see a much clearer picture (RC now stands for rotated component). The variables x_1 to x_3 only possess high loadings on the first factor. The information explained in x_4 and x_5 is best summarized by the second factor, whereas the third factor mainly explains x_6 and x_7. Note that the rotation did not change anything in the communalities or the uniqueness values. However, in general, the explained information shifts from the first to the latter factors when comparing the unrotated to the rotated solution (row Proportion Var). The first two factors contain 62% o the original items’ information. Furthermore, since the assignment of items to factors is now clearer, the item-specific complexity (com) values as well as the mean complexity is now approaching 1. Overall, the factor solution explains enough information of the original questions while being easy to interpret.
For completeness, let us make a case for the rotation matrix involved in the above Varimax rotation. Remember, rotation in factor analysis means multiplying a rotation matrix from the right to the unrotated matrix of factor loadings. Therefore, we are searching for the rotation matrix V that satisfies: \[ L'=L\times V \] Here, L’ stands for the rotated matrix of loadings while L is the unrotated matrix of factor loadings. We know L’ and L from the results above.
L <- PCA_results_unrotated$loadings %>% unclass()
L_rotated <- PCA_results_varimax$loadings %>% unclass()
L
PC1 PC2 PC3
x_1 0.7921634 -0.39808163 0.2531100
x_2 0.8142628 -0.36936331 0.2301943
x_3 0.8127470 -0.32629723 0.2555649
x_4 0.7501428 0.07826233 -0.5855445
x_5 0.6802674 0.13532182 -0.6649990
x_6 0.5552329 0.68795614 0.2477984
x_7 0.4851842 0.71841318 0.3264339
L_rotated
RC1 RC3 RC2
x_1 0.9030880 0.16663668 0.08198102
x_2 0.8948374 0.20117132 0.10601309
x_3 0.8806297 0.18450185 0.15096776
x_4 0.2801160 0.90168908 0.14210380
x_5 0.1658047 0.93731104 0.13136293
x_6 0.1397769 0.17702581 0.88999751
x_7 0.1028291 0.07749895 0.91733332
In R we can solve a non-quadratic matrix equation by using the function qr.solve() (base R).
V <- qr.solve(a = L, b = L_rotated)
V
RC1 RC3 RC2
PC1 0.7414698 0.5385021 0.4002975
PC2 -0.5391272 0.1229690 0.8331990
PC3 0.3994552 -0.8336031 0.3814989
V above is the rotation matrix that was used in the Varimax rotation. To complete the picture, we can recalculate the rotated loadings matrix L’ by means of matrix multiplication:
L%*%V
RC1 RC3 RC2
x_1 0.9030880 0.16663668 0.08198102
x_2 0.8948374 0.20117132 0.10601309
x_3 0.8806297 0.18450185 0.15096776
x_4 0.2801160 0.90168908 0.14210380
x_5 0.1658047 0.93731104 0.13136293
x_6 0.1397769 0.17702581 0.88999751
x_7 0.1028291 0.07749895 0.91733332
The psych package also provides a graphical path diagram to guide market researchers in interpreting the final factor solution (function fa.diagram()). Usually, when interpreting the factor solution, researchers apply the rule of thumb that they blind out every factor loading (and cross-loading) that falls below the absolute value of 0.5. Therefore, we set cut = 0.5 in the code chunk below.
psych::fa.diagram(PCA_results_varimax,
digits = 3,
simple = FALSE,
cut = 0.5,
sort = TRUE,
adj = 3
)
The above path diagram reinforces the interpretation of the factor solutions. The first factor summarizes fans’ satisfaction with the stadium, the second the satisfaction with the team, and the third the satisfaction with the ticket prices.
In more complex situation, these path diagrams come in very handy to identify unwanted cross-loadings between factors (red dashed lines below). To illustrate this, consider the path diagram for the unrotated solution.
psych::fa.diagram(PCA_results_unrotated,
digits = 3,
simple = FALSE,
cut = 0.50,
sort = TRUE,
adj = 3
)
The 50% cutoff criterion is a rough guideline. Keep in mind, that the factor loadings are Pearson correlations between an item and the corresponding factor. Therefore, in an old version of their text book, Hair et alii (2006, p. 128) came up with the idea to look at loadings (cross-loadings) only when they are significantly different from 0 at \(\alpha=0.05\) and \(\beta=0.20\) (i.e., a statistical power of 80%).
Below, we use the function pwrss.z.corr() function from the pwrss package (Bulus, 2023) to produce a table presenting the necessary sample sizes for different (absolute) loading sizes to be considered as significant:
necessary_n <- tibble(sample_size = numeric())
i <- 0.15
while (i <= 0.95) {
temp <- pwrss.z.corr(
n = NULL,
r = i,
alpha = 0.05,
power = 0.80,
alternative = "not equal",
verbose = FALSE
)
values <- list(loading_size = as.character(i), sample_size = round(temp$n, digits = 0))
necessary_n <- rbind(necessary_n, values)
i <- i + 0.05
}
necessary_n
loading_size sample_size
1 0.15 347
2 0.2 194
3 0.25 123
4 0.3 85
5 0.35 62
6 0.4 47
7 0.45 36
8 0.5 29
9 0.55 24
10 0.6 19
11 0.65 16
12 0.7 13
13 0.75 11
14 0.8 10
15 0.85 8
16 0.9 7
Since our effective sample size is n=195, we could have also used 0.20 as absolute cutoff when interpreting factor loadings and cross-loadings. One sees that these guidelines are more beneficial when operating with relatively small sample size. This is because, with high sample size, also very low loadings get significant qucikly.
Alternatively, we can apply the function pwr.r.test() from the pwr package (Champely, 2020) to calculate the actual threshold for significance given a specific sample size.
pwr::pwr.r.test(
n = 195,
r = NULL,
sig.level = 0.05,
power = 0.80
)
approximate correlation power calculation (arctangh transformation)
n = 195
r = 0.1990192
sig.level = 0.05
power = 0.8
alternative = two.sided
6.5 Step 5: Evaluating the goodness-of-fit of the factor solution
As we have discussed during lecture, we will focus on four sources of information to evaluate the factor solution:
- The total variance explained
- The congruence of the initial and reproduced Correlations
- Variables’ communality/ uniqueness
- The measure of item complexity / total complexity
We have seen most of these things in our analyses above:
Principal Components Analysis
Call: psych::principal(r = ., nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC3 RC2 h2 u2 com
x_1 0.90 0.17 0.08 0.85 0.150 1.1
x_2 0.89 0.20 0.11 0.85 0.148 1.1
x_3 0.88 0.18 0.15 0.83 0.168 1.1
x_4 0.28 0.90 0.14 0.91 0.088 1.2
x_5 0.17 0.94 0.13 0.92 0.077 1.1
x_6 0.14 0.18 0.89 0.84 0.157 1.1
x_7 0.10 0.08 0.92 0.86 0.142 1.0
RC1 RC3 RC2
SS loadings 2.53 1.83 1.71
Proportion Var 0.36 0.26 0.24
Cumulative Var 0.36 0.62 0.87
Proportion Explained 0.42 0.30 0.28
Cumulative Proportion 0.42 0.72 1.00
Mean item complexity = 1.1
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.05
with the empirical chi square 19.25 with prob < 0.00024
Fit based upon off diagonal values = 0.99
The total variance explained is clearly above the threshold of 75% (see 0.87 in the row Cumulative Var). The variables’ communality values (h2) are clearly exceeding 0.50, and the item-specific complexities approach 1 (also the total complexity). Note, that these values are well aligned to the values we have discussed during lecture.
Let us focus on the correlation residuals (below, the function lowerMat() also comes from the psych package..
PCA_results_varimax$residual %>% psych::lowerMat(digits = 3)
x_1 x_2 x_3 x_4 x_5 x_6 x_7
x_1 0.150
x_2 -0.067 0.148
x_3 -0.085 -0.080 0.168
x_4 -0.020 -0.002 0.010 0.088
x_5 0.021 -0.001 -0.010 -0.081 0.077
x_6 0.017 -0.017 -0.003 -0.014 -0.003 0.157
x_7 -0.009 0.019 -0.008 0.012 0.004 -0.149 0.142
The output shows us the lower off-diagonal matrix of residuals between the original correlations and the reproduced residuals. Note, that the complete matrix will be symmetric. Thus, we only have to focus on the lower part. The diagonal elements, nonetheless, present the item-specific uniqueness values again. We now have to count all off-diagonal elements that exceed an absolute value of 0.05, to see whether more than 50% are doing so.
residuals <- PCA_results_varimax$residual %>% psych::lowerMat(digits = 3)
x_1 x_2 x_3 x_4 x_5 x_6 x_7
x_1 0.150
x_2 -0.067 0.148
x_3 -0.085 -0.080 0.168
x_4 -0.020 -0.002 0.010 0.088
x_5 0.021 -0.001 -0.010 -0.081 0.077
x_6 0.017 -0.017 -0.003 -0.014 -0.003 0.157
x_7 -0.009 0.019 -0.008 0.012 0.004 -0.149 0.142
We see that fewer than 50% of the correlation residuals have an absolute value of higher than 0.05. The original correlations are reproduced with sufficient precision.
6.6 Step 6: Computing the factor scores
Computing factor scores is very easy. Actually, we just have them calculated when extracting the factor solution. The scores are saved with the results:
PCA_results_varimax$scores %>% head()
RC1 RC3 RC2
[1,] -0.339646371 0.55154286 0.45810123
[2,] 0.847307756 0.61906876 -0.04454389
[3,] -1.189578985 -1.46718026 0.89315875
[4,] 0.835139521 0.08730112 -0.35566928
[5,] -0.001201816 0.34022781 -0.67542413
[6,] 0.732337861 0.04429829 0.32984885
The factor scores in this example are standardized. Thus, the have a mean of 0 and a SD of 1. The interpretation for the above output is: Fan 1 has a higher-than-average score for factors 2 [team] and 3 [price] (i.e., a factor score of > 0) and lower-than-average score for factor 1 [Satisfaction with the stadium] (i.e., a factor score of < 0).
Keep in mind, in factor analysis there exist several methods on how to calculate factor scores. However, this goes beyond the scope of our introduction.
As noted during the lecture, often market researchers are using the factor scores to visualize the cases (fans in this case) in low dimensional space. We will do so next by using psych’s biplot.psych() function.
PCA_results_varimax %>% psych::biplot.psych(x = .)
Since we have a complex factor solution with 3 factors, the resulting graphics are rather complex. The histograms summarize the distribution of factor scores across all consumers. For example, the histogram in the upper-left position indicates that many consumers possess above average satisfaction with the stadium (factor 1), whereas others are extremely dissatisfied with this component. The plot in the lower-left position portrays the biplot between the factor scores of factors 1 (stadium, abscissa) and 2 (team, ordinate). Each point represents one case in the analysis. The red vectors indicate the variables. For example, one could interpret that few consumers are extremely satisfied by the stadium (would be high values on the abscissa), independently of their satisfaction level with the team (ordinate).
To get a clearer picture, we could also only ask for the lower-left plot.
PCA_results_varimax %>% psych::biplot.psych(x = .,
main="Biplot from PCA",
choose = c("RC1", "RC2"))
With this bigger picture, we could also argue that the variables x_4 and x_5 are not well explained by the first two rotated components as their absolute loadings are rather small (equivalent to saying that their distance from 0 is small).
If the number of consumers in a analysis gets higher, it is often useful to switch to a heatmap-like visualization that helps to assess the mass of consumers in the plot.
PCA_results_varimax %>% psych::biplot.psych(x = ., smoother = TRUE,
main="Biplot from PCA",
choose = c("RC1", "RC2"))
7 Basic Cluster Analysis
In the following section, we start reproducing the cluster example from the lecture within R. The example is modified from Table 9.1 of Chapter 9 in Sarstedt & Mooi (2019).
Two variables measured for seven customers on a scale from 0 to 100 with higher values denoting a higher degree of price consciousness and brand loyalty..
Question: How can we segment a market based on customers’ price consciousness (x) and brand loyalty (y)?
In total, we will go through the 5 steps that we have discussed during the lecture:
- Selecting the clustering variables (already done)
- Selecting the clustering procedure (we will apply hierarchical clustering and partioning k-means for illustration)
- Conducting clustering methods
- Deciding on the number of clusters to retain
- Validating and interpreting the cluster solution
However, we will only focus on the things that incorporate applications in R. Therefore, we will not repeat rules-of-thumb (for example, regarding sample sizes, etc.).
First, we will establish the data in R.
7.1 Creating the data in R and doing some basic data analyses to recap lecture content
Brand_Loyalty_Price_Consciousness <- tibble(
customer = LETTERS[1:7],
x = c(33, 82, 66, 30, 79, 50, 10),
y = c(95, 94, 80, 67, 60, 33, 17)
)
labelled::var_label(Brand_Loyalty_Price_Consciousness) <- list(
customer = "Customer",
x = "Price consciousness",
y = "Brand loyalty"
)
print(Brand_Loyalty_Price_Consciousness)
# A tibble: 7 × 3
customer x y
<chr> <dbl> <dbl>
1 A 33 95
2 B 82 94
3 C 66 80
4 D 30 67
5 E 79 60
6 F 50 33
7 G 10 17
Keep in mind, that we usually would evaluate, whether our clustering variables show high (\(\ge0.90\)) pair-wise correlations.
Brand_Loyalty_Price_Consciousness %>% select(x, y) %>% cor()
x y
x 1.0000000 0.4944447
y 0.4944447 1.0000000
In the present case, we have a somewhat “moderate” correlation (r=0.49) between both clustering variables (Cohen, 1988. p. 79). Usually, we handle more then two variables. In this case, we can use ggstatsplot::ggcorrmat() to get an impression of the pair-wise correlations (see, Section 6.2.1).
With only two variables in the example, it is easy to portray the data in a two-dimensional scatterplot
scatterplot_basic_cluster_analysis <- ggplot(Brand_Loyalty_Price_Consciousness, aes(x = x, y = y)) +
geom_point(alpha = 0.5, shape = 21, size = 2.7, colour = "black", fill = "#A4A4A4", stroke = 1) +
geom_hline(yintercept = 0, col = "darkgrey") +
geom_vline(xintercept = 0, col = "darkgrey") +
labs(y = "Brand loyalty (y)", x = "Price consciousness (x)") +
coord_fixed() +
theme_minimal() +
scale_x_continuous(breaks = seq(0, 100, 10), limits = c(0, 105), minor_breaks = NULL) +
scale_y_continuous(breaks = seq(0, 100, 10), limits = c(0, 105), minor_breaks = NULL) +
geom_text(aes(label = customer), hjust = -0.1, vjust = -0.85) +
theme(
plot.background = element_rect(fill = "lightgrey"),
panel.background = element_rect(fill = "grey98"),
panel.border = element_rect(colour = "black", fill = NA, linewidth = 0.7),
text = element_text(size = 12)
)
scatterplot_basic_cluster_analysis
Keep in mind, that with the graphical representation above, it is also easy to identify customer G as a multivariate outlier. Likewise, it appears as if B, C, and E form a relatively homogeneous group, while A and D represent another one.
We first will calculate the distance matrix between the seven customers. Think back to what we have learned during lecture. There are several distance measures available.
We will use the Euclidean distance (\(L_{2}\)-norm) for illustration purpose, which is defined between two objects A and B and k clustering variables as \[\sqrt{\sum_{i=1}^{k}(A_{i}-B_{i})²}\]
(\(A_{i}\) represents A’s coordinates at the \(i^{th}\) variable).
For this purpose, we use the function dist() which comes with base R. Since any distance matrix is symmetric, let us focus only on lower-diagonal matrix (below, the function lowerMat() comes from the psych package.
distance_matrix <- Brand_Loyalty_Price_Consciousness %>%
select(x, y) %>%
dist(method = "euclidean") %>%
as.matrix()
rownames(distance_matrix) <- Brand_Loyalty_Price_Consciousness$customer
colnames(distance_matrix) <- Brand_Loyalty_Price_Consciousness$customer
distance_matrix %>% psych::lowerMat(digits = 2)
A B C D E F G
A 0.00
B 49.01 0.00
C 36.25 21.26 0.00
D 28.16 58.59 38.28 0.00
E 57.80 34.13 23.85 49.50 0.00
F 64.29 68.88 49.65 39.45 39.62 0.00
G 81.32 105.42 84.29 53.85 81.30 43.08 0.00
As with the graphical representation above, we see that G has a relatively high distance to all other objects.
Often, market researchers use data visualization techniques for better insights into distance/ similarity matrices when research problems become bigger. Below, we use the heatmap() function which comes with base R to do so.
distance_matrix %>% heatmap(scale="none", revC=TRUE)
With the heatmap representation, understanding the dissimilarity between the seven customers becomes straightforward. The more the color shifts towards red, the more distant the two objects are from each other (e.g., B and G). Conversely, the more yellowish the color, the higher the similarity (proximity in feature space). It’s worth noting that the function automatically arranges columns and rows based on the Euclidean distance. This arrangement groups similar objects together, making it easy to identify patterns. For instance, we can quickly see that B, C, and E form a region of high similarity in the lower-right part of the matrix. A and D, on the other hand, form a separate region. The graphic also includes dendrograms for a corresponding clustering of data, a topic we will discuss in detail later. By default, heatmap() uses complete linkage as a hierarchical clustering method.
Another approach to visualize distances between cases in cluster analysis is to use network graphs. Below we use th function qgraph() which the qgraph package (Epskamp, Cramer, Waldorp, Schmittmann, & Borsboom, 2012) provides.
We first convert the distance matrix into a similarity matrix. Keep in mind that \(similarity=\frac{1}{distance}\), since the similarity of two objects is the reciprocate of their distance/dissimilarity. By doing so, we accept that each object has an infinite similarity with itself.
similarity_matrix <- 1/distance_matrix
similarity_matrix %>% qgraph::qgraph(layout="spring")
In the network graph above, objects that are more similar to each other (e.g., B, C, and E) as compared to others (e.g., A and G) are placed in a narrow distance. Similarity between customers is represented by the thickness and the shade of green of connection lines. This kind of visualization in particularly useful with a manageable number of cases but a high number of clustering variables.
For sake of completeness, we subsequently will also use the City-block distance/ Manhattan metric, and the Chebychev distance for compiling the distance matrix. Note, that there exist also ample other measures (e.g., Kassambara, 2017; Lichters, Möslein, Sarstedt, & Scharf, 2021).
The City-block distance/ Manhattan metric is defined between two objects A and B and k clustering variables as \[\sum_{i=1}^{k}|A_{i}-B_{i}|\]
(\(A_{i}\) represents A’s coordinates at the \(i^{th}\) variable).
The largest distance of two objects A and B on any clustering variable k defines to Chebychev distance: \[max^k_{i=1}|A_{i}-B_{i}|\]
Let us start with the City-block distance/ Manhattan metric.
distance_matrix <- Brand_Loyalty_Price_Consciousness %>%
select(x, y) %>%
dist(method="manhattan") %>%
as.matrix()
rownames(distance_matrix) <- Brand_Loyalty_Price_Consciousness$customer
colnames(distance_matrix) <- Brand_Loyalty_Price_Consciousness$customer
distance_matrix %>% lowerMat(digits = 2)
A B C D E F G
A 0
B 50 0
C 48 30 0
D 31 79 49 0
E 81 37 33 56 0
F 79 93 63 54 56 0
G 101 149 119 70 112 56 0
Then, the Chebychev distance.
distance_matrix <- Brand_Loyalty_Price_Consciousness %>%
select(x, y) %>%
dist(method="maximum") %>%
as.matrix()
rownames(distance_matrix) <- Brand_Loyalty_Price_Consciousness$customer
colnames(distance_matrix) <- Brand_Loyalty_Price_Consciousness$customer
distance_matrix %>% lowerMat(digits = 2)
A B C D E F G
A 0
B 49 0
C 33 16 0
D 28 52 36 0
E 46 34 20 49 0
F 62 61 47 34 29 0
G 78 77 63 50 69 40 0
7.2 Conducting Hierarchical Cluster Analysis
7.2.1 Single Linkage based on Euclidean distance
We only will illustrate the application of Single Linkage algorithm based Euclidean distance.
There are several good packages in R that help with conducting hierarchical cluster analysis. For the subsequent clustering, we are staying with a basic variant which the hclust() function provides (stats package (2024)). We first recreate the distance matrix again. Since we have already seen the visualizations of the distance matrix above, we will not repeat this steps here.
distance_matrix_euclidean <- Brand_Loyalty_Price_Consciousness %>%
select(x, y) %>%
dist(method="euclidean")
Next, we apply the hclust() function to the distance matrix and subsequently set the labels (i.e., the customers’ names in our example).
h_c_euclidean_single_linkage <- hclust(d=distance_matrix_euclidean, method="single")
h_c_euclidean_single_linkage$labels <- Brand_Loyalty_Price_Consciousness$customer
7.2.2 Decide on the number of clusters
Keep in mind that one should never rely only on statistical considerations when making the final decision on clusters. As discussed during lecture, often expert judgment has a valid stake in the final decision.
As seen during the lecture, in hierarchical clustering it is particularly useful to look at the corresponding dendrogram to select the final number of clusters to retain.
plot(h_c_euclidean_single_linkage, ylab="Height (i.e., Distances)")
In the dendrogram we can understand the step-wise approach of clustering. Specifically, we see that first B and C are merged. Afterward, this new cluster fusions with E to {B, C, E}. Then, {A, D} is created. Both clusters are next merged to {A, B, C, D, E}, before F and finally G are joining. Most interestingly, we see that the main increase in distance (i.e., the main increase in within-cluster heterogeneity) occurs when {B, C, E} is merged with {A, D}. Thus, we should stop before this step, leaving the final 4-cluster solution comprising {A, D}, {B, C, E}, F, and G. As F and G are non merged singular observations in the final solution, one could also think of them as multivariate outliers (recall that Single Linkage is often used because of its chaining tendencies to detect such outliers).
The cluster assignments for each observation are given by:
cutree(h_c_euclidean_single_linkage, k=4)
A B C D E F G
1 2 2 1 2 3 4
We can also add the assignments to the original data:
Brand_Loyalty_Price_Consciousness$h_c_single_linkage <- cutree(h_c_euclidean_single_linkage, k=4)
Brand_Loyalty_Price_Consciousness
# A tibble: 7 × 4
customer x y h_c_single_linkage
<chr> <dbl> <dbl> <int>
1 A 33 95 1
2 B 82 94 2
3 C 66 80 2
4 D 30 67 1
5 E 79 60 2
6 F 50 33 3
7 G 10 17 4
Usually, with larger datasets, one looks at the number of observations in each segment as well as the cluster profiles regarding the clustering variables.
Brand_Loyalty_Price_Consciousness %>%
group_by(h_c_single_linkage) %>%
summarise(
price_consciousness = mean(x),
brand_loyalty = mean(y),
n = n()
)
# A tibble: 4 × 4
h_c_single_linkage price_consciousness brand_loyalty n
<int> <dbl> <dbl> <int>
1 1 31.5 81 2
2 2 75.7 78 3
3 3 50 33 1
4 4 10 17 1
Other packages enable for a more appealing visualization of dendrograms. To illustrate this, we use functions included in the package dendextend package (Galili, 2015).
dendrogram <- h_c_euclidean_single_linkage %>%
as.dendrogram() %>%
sort() %>% # sort dendromgram
set("labels_col", value = c("#5d8ea6", "#05a535", "black", "#7a003f"), k = 4) %>% # set color for labels
set("branches_k_color", value = c("#5d8ea6", "#05a535", "black", "#7a003f"), k = 4) %>% # set color of lines in dendrogram
set("branches_lwd", 1.5) %>% # set width of lines
as.ggdend() ## convert to ggplot compatibility
dendrogram %>% ggplot(horiz = TRUE, theme = theme_minimal()) +
ggtitle("Dendrogram") +
xlab("Customers") +
ylab("Height (i.e., Distances)") +
theme(
plot.background = element_rect(fill = "lightgrey"),
panel.background = element_rect(fill = "grey98"),
panel.border = element_rect(colour = "black", fill = NA, linewidth = 0.7),
text = element_text(size = 12))
For sake of completeness, we spot the different dendrograms, which result from different fusion algorithms (but always Euclidean distance) below.
Above, we see that only Single Linkage and Centroid Linkage are able to detect F and G as multivariate outliers.
Another way of guiding the selection of the final number of clusters is to look at the invariance of the cluster solution (i.e., the assignment of cluster labels to the observations) with regard to different fusion algorithms or/and distance measures. As a rule of thumb, less than 20% of cluster assignments should vary across different reasonable clustering approaches (Sarstedt & Mooi, 2019, p. 332). Note, that only reasonable clustering approaches should be evaluated. For example, it makes no sense to switch to matching coefficients as similarity/ distance measures if one handles clustering variables that are all interval-scaled.
In the following, we stay with the Euclidean distance, but evaluate Ward’s Linakge (Ward, 1963) and Average Linkage next to the already implemented Single Linkage.
#Distance matrix
distance_matrix_euclidean <- Brand_Loyalty_Price_Consciousness %>%
select(x, y) %>%
dist(method = "euclidean")
# Single Linkage
single_linkage <- hclust(d=distance_matrix_euclidean, method="single")
single_linkage$labels <- Brand_Loyalty_Price_Consciousness$customer
# Average Linkage
average_linkage <- hclust(d=distance_matrix_euclidean, method="average")
average_linkage$labels <- Brand_Loyalty_Price_Consciousness$customer
# Ward's Linkage
ward_linkage <- hclust(d=distance_matrix_euclidean, method="ward.D2")
ward_linkage$labels <- Brand_Loyalty_Price_Consciousness$customer
After having conducted these segmentation, we save the segment assignments for solutions with 3 to 5 clusters separately.
assignments <- tibble(.rows = nrow(Brand_Loyalty_Price_Consciousness))
assignments$Single_Linkage_3 <- cutree(single_linkage, k=3)
assignments$Single_Linkage_4 <- cutree(single_linkage, k=4)
assignments$Single_Linkage_5 <- cutree(single_linkage, k=5)
assignments$Average_Linkage_3 <- cutree(average_linkage, k=3)
assignments$Average_Linkage_4 <- cutree(average_linkage, k=4)
assignments$Average_Linkage_5 <- cutree(average_linkage, k=5)
assignments$Ward_Linkage_3 <- cutree(ward_linkage, k=3)
assignments$Ward_Linkage_4 <- cutree(ward_linkage, k=4)
assignments$Ward_Linkage_5 <- cutree(ward_linkage, k=5)
Doing so is helpful to understand how stable the Single Linkage solution is at varying numbers of clusters. The next table shows the assignments in case of 3-cluster solutions.
assignments %>% select(Single_Linkage_3, Average_Linkage_3, Ward_Linkage_3)
# A tibble: 7 × 3
Single_Linkage_3 Average_Linkage_3 Ward_Linkage_3
<int> <int> <int>
1 1 1 1
2 1 2 2
3 1 2 2
4 1 1 1
5 1 2 2
6 2 3 3
7 3 3 3
assignments %>%
group_by(Single_Linkage_3, Average_Linkage_3) %>%
summarise(
n = n(),
percent = n()/nrow(.)*100
)
`summarise()` has grouped output by 'Single_Linkage_3'. You can override using
the `.groups` argument.
# A tibble: 4 × 4
# Groups: Single_Linkage_3 [3]
Single_Linkage_3 Average_Linkage_3 n percent
<int> <int> <int> <dbl>
1 1 1 2 28.6
2 1 2 3 42.9
3 2 3 1 14.3
4 3 3 1 14.3
We see that for the solution with 3 clusters, we only have a limited concordance of cluster assignments between Single and other linkage algorithms. In contrast, the concordance between Average Linkage and Ward is perfect. This small example provides an essential learning. First, the less than 20 % discordance rule mentioned above is misleading when multivariate outliers are present, and the fusion algorithms under usage are sensitive to outliers to varying degrees. We have 42.9% + 14.3% = 57.2% mismatch between the clustering variants in the above example. This is because fusion algorithms with chaining properties (such as Single Linkage) will seldom match the outcomes of other algorithms disturbed by outliers, when the number of clusters is low. At the same time, with multivariate outliers present, we should not trust the concordance of two fusion algorithms that are sensitive to outliers. These outliers affect the final cluster assignments so strongly that two fusion algorithms sensitive to outliers will almost always end in the same assignments. Second, the only proper way in the presence of multivariate outliers is to exclude them before clustering the data and then evaluate the concordance of varying solutions. Alternatively, one could collect the outliers in a kind of noise cluster (Vigneau, Qannari, Navez, & Cottet, 2016), as it is done somehow by the final steps in Single Linkage.
Often, market research also are preparing cross tabs of competing cluster solutions. We will do so below with the help of the CrossTable() function provided by the descr package (Enzmann, Schwartz, Jain, & Kraft, 2023).
descr::CrossTable(x = assignments$Single_Linkage_3,
y = assignments$Average_Linkage_3,
digits = 2,
prop.r = FALSE,
prop.c = FALSE,
prop.t = TRUE,
prop.chisq = FALSE,
dnn = c("Single Linkage", "Average Linkage"))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
============================================
Average Linkage
Single Linkage 1 2 3 Total
--------------------------------------------
1 2 3 0 5
0.29 0.43 0.00
--------------------------------------------
2 0 0 1 1
0.00 0.00 0.14
--------------------------------------------
3 0 0 1 1
0.00 0.00 0.14
--------------------------------------------
Total 2 3 2 7
============================================
For illustration, we next focus on the comparison between 3 to 5 clusters and Single Linkage and Ward.
assignments %>%
group_by(Single_Linkage_3, Ward_Linkage_3) %>%
summarise(
n = n(),
percent = n()/nrow(.)*100
)
`summarise()` has grouped output by 'Single_Linkage_3'. You can override using
the `.groups` argument.
# A tibble: 4 × 4
# Groups: Single_Linkage_3 [3]
Single_Linkage_3 Ward_Linkage_3 n percent
<int> <int> <int> <dbl>
1 1 1 2 28.6
2 1 2 3 42.9
3 2 3 1 14.3
4 3 3 1 14.3
assignments %>%
group_by(Single_Linkage_4, Ward_Linkage_4) %>%
summarise(
n = n(),
percent = n()/nrow(.)*100
)
`summarise()` has grouped output by 'Single_Linkage_4'. You can override using
the `.groups` argument.
# A tibble: 4 × 4
# Groups: Single_Linkage_4 [4]
Single_Linkage_4 Ward_Linkage_4 n percent
<int> <int> <int> <dbl>
1 1 1 2 28.6
2 2 2 3 42.9
3 3 3 1 14.3
4 4 4 1 14.3
assignments %>%
group_by(Single_Linkage_5, Ward_Linkage_5) %>%
summarise(
n = n(),
percent = n()/nrow(.)*100
)
`summarise()` has grouped output by 'Single_Linkage_5'. You can override using
the `.groups` argument.
# A tibble: 6 × 4
# Groups: Single_Linkage_5 [5]
Single_Linkage_5 Ward_Linkage_5 n percent
<int> <int> <int> <dbl>
1 1 1 1 14.3
2 2 2 2 28.6
3 2 3 1 14.3
4 3 1 1 14.3
5 4 4 1 14.3
6 5 5 1 14.3
In short, the discordance between both fusion algorithms in case of 3 clusters is 57.2%. In case of 4 clusters it shrinks to 0%. Finally, with 5 clusters the mismatch is 28.6 %. Thus, according to this stability assessment we would also favor the 4 cluster solution.
Market researchers also often apply cluster quality indices to decide on an appropriate number of clusters in the final solution. Researchers, came up with many different of these criteria - all have their merits and drawbacks. As a consequence, many good R packages implement a plethora of the criteria. Examples include the NbClust (Charrad, Ghazzali, Boiteau, & Niknafs, 2014b), the fpc (Hennig, 2024), and the clusterCrit (Desgraupes, 2023) package.
During the lecture, we focused on two classes of criteria: (a) the Variance Ratio Criterion proposed by Calinski & Harabasz (1974), and (b) the Duda-Hart-index (Duda & Hart, 1973). We will use both classes subsequently, because both worked well in the seminal work of Milligan & Cooper (1985).
We start with the Variance Ratio Criterion (VRC) proposed by Calinski & Harabasz (1974). Remember from the lecture that the calculation of the VRC is similar to computations we have seen in ANOVA:
\[VRC_{k}=\frac{\frac{SS_{B}}{k-1}}{\frac{SS_{W}}{n-k}} \] where n is the total sample size and k the number of clusters in a solution. \(SS_{B}\) indicated the between-cluster sum of squares and \(SS_{W}\) the within-group sum of squares. Usually, we are searching for the k that maximizes VRC, since at this point the variation regarding the clustering variables between the groups is high (i.e., clusters are well separated), while the variation within the groups is small (i.e., the clusters are very compact). However, as discussed during lecture, VRC tend to decrease with increasing number of clusters. Therefore, a more nuanced approach is to use the minimal Omega, defined as searching for the minimal the differences in VRC values for successive numbers of clusters: \[ \omega_{k}=(VRC_{k+1}-VRC_{k})- (VRC_{k}-VRC_{k-1}) \] Clearly, \(\omega_{k}\) is not defined for k<3 and k>n-2. Why? Because \(\omega_{k}\), in case of 2 clusters, would ask for a calculation of \(VRC_{1}\). However, \(VRC_{1}\) is not defined, since we do not have a \(SS_{B}\) in a 1-cluster solution. Similarly, in case of k=n-1, \(\omega_{k}\) would ask for a calculation of \(VRC_{n}\). However, the term \(\frac{SS_{W}}{n-k}\) in the VRC calculation above is not defined for n=k.
Please also note: applying the VRC and the Omega criterion does not make much sense when working the fusion algorithms with chaining properties in the presence of multivariate outliers. This is because such algorithms (e.g., Single Linkage) tend to produce clusters with only one observation per cluster. Thus, the corresponding \(SS_{W}\)-components contributed by these clusters in a solution will be 0, increasing the VRC artificially. Both criteria are also only useful when applied to fusion algorithms where the sum of squares has a valid interpretation. For example, we can’t use them when we have created the cluster in a way that only considered categorical clustering variables, since squared differences between them have no meaningful interpretation.
Below, we start with the VRC criterion. First, we use the function NbClust() from the package with the same name (Charrad, Ghazzali, Boiteau, & Niknafs, 2014a). This function asks us to specify a set of arguments as parameters.
- data: is the datset providing the clustering variables (e.g., a tibble)
- diss: is the distance matrix that we might have calculated before
- distance: set to NULL, because we alrady have a distance matrix. Otherwise, we would have had specified a distance measure
- min.nc & max.nc: minimal and maximal number of clusters to consider
- method: a fusion algorithm (Ward’s Linkage in the present case)
- index: a cluster quality index. Below, “ch” stands for Calinski & Harabasz (set to “all” to calculate all available indexes)
clustering_vars <- Brand_Loyalty_Price_Consciousness %>% select(x, y)
suggestion_CH <- NbClust::NbClust(data = clustering_vars,
diss = distance_matrix_euclidean,
distance = NULL,
min.nc = 1,
max.nc = 6,
method = "single",
index = "ch")[1:2]
suggestion_CH
$All.index
1 2 3 4 5 6
NA 4.1559 3.5922 7.6400 6.1707 8.4030
$Best.nc
Number_clusters Value_Index
6.000 8.403
If we would follow the VRC, than we would select six clusters in the final solution as the VRC has the maximum value 8.4. However, based on the above-dendrograms, we know that this is mainly driven by the multivariate outliers G and F. Calinski & Harabasz (1974, p. 12) knew abut this potential problem with the application of the VRC. Therefore, in this situation, they advocated against the use of the globally maximum, and for the use of the first local maximum that usually occurs for smaller k. Would be 4 clusters in the present case.
Next, we will apply the omega criterion instead. Unfortunately, when compiling this script, I was not able to find a convenient package that provides a function for calculating omega (if you know one, just give me an email). Therefore, I have written one on my own, which our chair’s page lists. We use the following code to load the script providing the function named Calinski_Harabasz_omega.
If anything does not work out as expected on your machine, you can download the data by clicking on the following button.
Below, we apply the function. Keep in mind, that for k>n-2 omega is undefined. Therefore, we set max_k to 5 (with 6, the function will throw an error)..
Calinski_Harabasz_omega(max_k = 5, distance_matrix = distance_matrix_euclidean, hierarchical_cluster_solution = single_linkage)
clusters VRC VRC_omega
1 2 4.155899 NA
2 3 3.592170 4.611549
3 4 7.639990 -5.517113
4 5 6.170697 3.701631
5 6 8.403034 NA
As discussed during the lecture, we are searching for the global minimum of omega to determine the optimal k. Doing so, would suggest retaining 4 clusters in the final solution.
We proceed with the Duda-Hart index (Duda & Hart, 1973) applied to the same clustering problem. When using the Duda-Hart index, one has to acknowledge that different literature sources provide different suggestions on how to use the index. For example, Sarstedt & Mooi (2019) suggests choosing the number of clusters k to maximize the Duda-Hart index. The original paper (Duda & Hart, 1973), as well as other authors (Charrad et al., 2014b; Gordon, 1999; Milligan & Cooper, 1985) advise selecting the first k, which is accompanied by an index value that exceeds a k-specific critical value.
Remember from lecture, in the ith step of a hierarchical clustering procedure, when two clusters A and B are up to be merged, the index is calculated as follows:
\[ DH = \frac{SS_{W_{A}}+SS_{W_{B}}}{SS_{W_{A \cup B}}} \] Here, \(SS_{W_{A}}\) is the within-group sum of squares of cluster A. The critical value calculates as follows (Charrad et al., 2014b, p. 6):
\[ crit_{DH} = 1- \frac{2}{\pi*p}-3.2\sqrt{\frac{2(1-\frac{8}{\pi^2p})}{n_{A \cup B}*p}} \] Without getting into too much detail here, the critical value takes into account the number of clustering variables involved (p), and the number objects in the joined cluster \(A \cup B\), and provides an estimate of what one would expect if the data would not have any cluster structure but is randomly distributed in an p-dimensional space.
Again, we use the function NbClust() from the package with the same name (Charrad et al., 2014a) to calculate Duda-Hart index.
clustering_vars <- Brand_Loyalty_Price_Consciousness %>% select(x, y)
suggestion_DH_index <- NbClust::NbClust(data = clustering_vars,
diss = distance_matrix_euclidean,
distance = NULL,
min.nc = 1,
max.nc = 6,
method = "single",
index = "duda")[1:3]
suggestion_DH_index
$All.index
1 2 3 4 5 6
0.5486 0.6821 0.3236 6.5738 1.2495 0.7522
$All.CriticalValues
1 2 3 4 5 6
-0.2510 -0.3258 -0.4219 -1.0633 -0.7431 -1.0633
$Best.nc
Number_clusters Value_Index
1.0000 0.5486
As discussed during the lecture, we are searching for the first DH index value that exceeds the corresponding critical value to determine the optimal k. Doing so, would suggest a final solution with only 1 cluster (DH = 0.55). In contrast, following the maximum DH index rule would lead to the selection of 4 clusters (DH = 6.57)
Finally, we proceed with the Duda-Hart’s Pseudo t² (Duda & Hart, 1973).
\[ Pseudo\; t^2 = \frac{SS_{W_{A \cup B}}-SS_{W_{A}}-SS_{W_{B}}}{\frac{SS_{W_{A}}+SS_{W_{B}}}{n_{A}+n_{B}}} \] The big numerator in the formula above (\(SS_{W_{A \cup B}}-SS_{W_{A}}-SS_{W_{B}}\)) represents the part of the within-group sum of squares that merging A and B adds to \(A \cup B\) above the sum of squares already present in A and B (\(SS_{W_{A}}+SS_{W_{B}}\)). Thus, the lower the numerator the better. The denominator \(\frac{SS_{W_{A}}+SS_{W_{B}}}{n_{A}+n_{B}}\) represents the mean sum of within-group squares of A and B. The Pseudo t², therefore, accounts for the number of objects in the A and B (\(n_{A}+n_{B}\)) before merging. The Duda-Hart index does not have this feature. The higher the mean sum of within-group squares of A and B (i.e., high heterogeneity within clusters A and B, even before merging), the lower the denominator gets. In conclusion, Pseudo t² tends to be low when merging A and B in the next step of hierarchical clustering step only marginally increases the within-group sum of squares and when the heterogeneighty within A an B is already high before merging to \(A \cup B\).
Similar to the Duda-Hart index, some sources are recommending to search for the global minimal Pseudo t² to determine the appropriate k (Sarstedt & Mooi, 2019), while others recommend to search for the smallest k at which the Pseudo t² is smaller or equal the corresponding critical value (Charrad et al., 2014b; Gordon, 1999).
clustering_vars <- Brand_Loyalty_Price_Consciousness %>% select(x, y)
suggestion_pseudo_t2 <- NbClust::NbClust(data = clustering_vars,
diss = distance_matrix_euclidean,
distance = NULL,
min.nc = 1,
max.nc = 6,
method = "single",
index = "pseudot2")[1:3]
suggestion_pseudo_t2
$All.index
1 2 3 4 5 6
4.1138 1.8640 6.2701 0.0000 -0.1997 0.0000
$All.CriticalValues
1 2 3 4 5 6
-24.9172 -16.2785 -10.1102 0.0000 -2.3458 0.0000
$Best.nc
Number_clusters Value_Index
4 0
We are searching for the smallest k at which the Pseudo t² is smaller than, or equal to the corresponding critical value. Doing so, would suggest a 4-cluster solution (Pseudo t² = 0).
8 Advanced Cluster Analysis
8.1 Conducting Partitioning Clustering
We will only illustrate the application of the k-means algorithm (e.g., Hartigan & Wong, 1979).
Think back to the lecture. The general procedure in k-means follows a fixed sequence:
- Initialize k cluster centers (e.g., by randomly selecting k coordinates in the space of clustering variables).
- Creating a starting partition (i.e., assigning each object to the closest cluster in terms of Euclidean distance between the objects and the cluster centers).
- Re-computing the k cluster centers – the cluster centroids (in terms of their coordinates regarding the clustering variables as the mean of assigned objects)
- Apply exchange algorithm(s). That is, check for every object whether reassignment to another cluster would decrease the Euclidean distance of this object to its corresponding cluster center.
- Repeat Steps 3 and 4. Then stop when (a) no reassignments are beneficial or (b) when a certain number of iterations are reached, or when (c) a specific threshold for the decrease in goal function to be minimized is not exceeded.
The goal in k-means is typically to minimize the total within-group sum of squares (SSW):
\[ min!\quad SS_{W}=\sum^{k}_{i=1}\sum^{n_{i}}_{j=1}\sum^{v}_{m=1}(center(i)_{m}-j_{m})² \] Where k is the number of clusters, \(n_{i}\) is the number of objects in cluster i, v is the number of clustering variables, with m being a specific variable. The term \(center(i)_{m}\) represents the values of the clustering variable m regarding the center of cluster i. \(j_{m}\) is the value of the clustering variable m regarding object j in cluster i.
The figure below visualizes these steps using the data we have used before. If you rather prefer cartoon-like explanations, I strongly recommend Allison Horst’s fantastic Data Science & Statistics Artwork.
There are several good packages in R that help with conducting partitioning cluster analysis such as k-means.
References
Reuse
Copyright
Citation
@misc{lichters2025,
author = {Lichters, Marcel},
title = {Marketing {Methods} and {Analysis:} {R} {Examples} Used
During {Market} {Research} {Lectures}},
date = {2025-03-18},
url = {https://pub.ww.ovgu.de/lichters/mma/lecture/OVGU_MMA_Lecture_Examples.html},
doi = {https://doi.org/10.24352/UB.OVGU-2024-085},
langid = {en}
}