MGT4018/MGT4090 Lab 2
1 Introduction
This lab is the same as the SPSS Lab 2 for MGT4018 and MGT4090. We use base R functions as the default. While there are many R packages available, understanding base R operations provides a strong foundation for data analysis.
Alternatives for achieving the same outcomes using different R packages are provided in green boxes. If you want to explore these alternatives, each box will introduce the respective package, how to install and use it, and the outputs it can produce.
2 Exercise D: Correlation and Linear Regression
This lab will guide you through creating a dataset, assigning labels, and conducting basic analyses using R. You will learn how to create variables, enter data, and generate summary tables similar to those you would in SPSS.
2.1 Step D1: Setup
You can continue working in the same project (Step A1, Option 1) or working directory (Step A1, Option 2) that you created in the previous lab. You should, however, do the following:
- Create a new R script: Go to File > New File > R Script.
- Save the script in your
scripts
folder with an appropriate name, e.g.,Lab2_Exercise_D_E_F_G.R
.
You can either work through the following steps and copy/paste the respective code into the Lab_Exercise_D_E_F_G.R
file that you will create in Step D1 or download the R script for Exercises D, E, F, and G and follow the instructions below and save the downloaded file in the scripts
folder that you will create.
Now you are ready to begin your work in R and continue with Step D2!
2.2 Step D2: Loading the Dataset
Please download the dataset and save it in the data
folder within your project folder or working directory.
# Load the CSV file stored in the "data" folder
<- read.csv("data/lab2-survey.csv") survey_data_full
You can easily explore and check the basic structure of your data and get a summary:
# View the first few rows using head()
head(survey_data_full)
# Examine the structure
str(survey_data_full)
# Get basic summary statistics
summary(survey_data_full)
Similar to Exercises B and C in Lab 1, we want to assign labels to the demographic variables in the survey_data_full
data frame using factors with labeled levels. This method ensures the data remains categorical but with human-readable labels for easier interpretation and analysis.
# Convert demographic variables to factors with labeled levels
# Assign labels for "sex"
$sex <- factor(
survey_data_full$sex,
survey_data_fulllevels = c(1, 2),
labels = c("Male", "Female")
)
# Assign labels for "marital"
$marital <- factor(
survey_data_full$marital,
survey_data_fulllevels = c(1, 2, 3, 4, 5, 6, 7, 8),
labels = c(
"Single", "Steady relationship", "Living with partner",
"Married first time", "Remarried", "Separated",
"Divorced", "Widowed"
)
)
# Assign labels for "child"
$child <- factor(
survey_data_full$child,
survey_data_fulllevels = c(1, 2),
labels = c("Yes", "No")
)
# Assign labels for "educ"
$educ <- factor(
survey_data_full$educ,
survey_data_fulllevels = c(1, 2, 3, 4, 5, 6),
labels = c(
"Primary", "Some secondary", "Completed high school",
"Some additional training", "Completed undergraduate",
"Postgraduate completed"
)
)
# Assign labels for "source"
$source <- factor(
survey_data_full$source,
survey_data_fulllevels = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
labels = c(
"Work", "Spouse or partner", "Relationships", "Children",
"Family", "Health/illness", "Life in general",
"Money/finances", "Lack of time, too much to do"
)
)
# Assign labels for "smoke"
$smoke <- factor(
survey_data_full$smoke,
survey_data_fulllevels = c(1, 2),
labels = c("Yes", "No")
)
The summary()
function can be used to confirm that the labels have been applied correctly to the variables.
# Verify changes by printing a summary
summary(survey_data_full)
2.3 Step D3: Correlation
Correlation analysis helps us understand the relationship between two variables. Before we dive into the practical implementation, let’s understand some key concepts.
2.3.1 What is Correlation?
Correlation measures the strength and direction of the relationship between two variables. The Pearson correlation coefficient (PCC), often denoted as r
, is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables. It is one of the most widely used correlation metrics.
The correlation coefficient (r) ranges from -1 to +1:
- r = +1: Perfect positive correlation
- r = 0: No linear correlation
- r = -1: Perfect negative correlation
The magnitude of r (how close it is to ±1) indicates the strength of the relationship, while the sign indicates the direction.
The correlation coefficient is standardized, meaning it’s independent of the units of measurement of the original variables. This makes it useful for comparing relationships between different pairs of variables.
2.3.2 Assumptions for Pearson’s Correlation
Before calculating correlations, we should check these assumptions:
- Variables are measured at the interval or ratio level
- Linear relationship between variables
- No significant outliers
- Approximate normal distribution of variables
- Homoscedasticity (equal variances)
2.3.3 Visual Inspection: Scatterplots
Before calculating correlations, it’s important to visualize the relationships:
# Install ggplot2 (NOTE: not needed if you completed the first lab)
install.packages("ggplot2")
# Load ggplot2
library(ggplot2)
# Simple scatterplot
<- ggplot(data = survey_data_full, aes(x = tpcoiss, y = tpstress)) +
scatter_stress_control geom_point(color = "blue", size = 2) +
labs(
title = "Scatterplot of Perceived Stress vs Coping Strategies",
x = "Coping Strategies (tpcoiss)",
y = "Perceived Stress (tpstress)"
+
) theme_minimal()
# Show plot
scatter_stress_control
Note: If you run the code above, you will get the warning message below. This warning occurs because the variables tpstress
and tpcoiss
have missing values. The warning doesn’t stop the plot from rendering, but it alerts you that data is being removed and you should investigate the reason behind the warning and decide how to handle it appropriately (e.g., filtering, replacing, or imputing missing values if necessary).
Warning message:
Removed 13 rows containing missing values or values outside the scale range (`geom_point()`).
2.3.4 Computing Correlations
We can calculate correlation. In the first instance, we focus on the two variables tpcoiss
and tpstress
. Afterwards, we will explore how to calculate correlations for all variables at once.
Single Correlation
The cor()
function calculates the correlation coefficient:
# Calculate correlation between tpcoiss and tpstress
cor(survey_data_full$tpcoiss, survey_data_full$tpstress, use = "complete.obs")
# Specify the method (default is Pearson, so this will return the same result)
cor(survey_data_full$tpcoiss, survey_data_full$tpstress, use = "complete.obs",
method = "pearson")
The cor()
function returns NA
if any of the input values are missing, as it cannot compute the correlation because the missing values prevent pairwise comparisons. From the warning when creating the scatterplot, we know that there are 13 missing values across the two variables.
Best Practices in Base R
Always visualize your data first
Check assumptions (e.g., normality):
# Test for normality shapiro.test(x) shapiro.test(y1)
Consider different correlation methods when appropriate:
# Pearson correlation (default) cor(x, y1, method = "pearson") # Spearman correlation (for non-normal data) cor(x, y1, method = "spearman") # Kendall correlation (for ordinal data) cor(x, y1, method = "kendall")
Deal with missing values appropriately.
- Option A: Handle missing values in
cor()
with different options foruse
:
# Exclude all rows with one missing value. # Recommended in most cases and used for pairwise correlation. cor(x, y1, use = "complete.obs") # Uses all available pairwise data without dropping entire rows. # Works well when computing correlation matrices. cor(x, y1, use = "pairwise.complete.obs")
- Option B: Remove or replace missing values before calculating the correlations (note that replacing missing values will lead to different results):
# You can filter the dataset if you prefer to remove missing values before # calculation. <- na.omit(survey_data_full[, c("tpcoiss", "tpstress")]) clean_data cor(clean_data$tpcoiss, clean_data$tpstress) # Alternatively, you can replace missing values (e.g., using the mean). <- survey_data_full modified_data $tpcoiss[is.na(modified_data$tpcoiss)] <- mean(modified_data$tpcoiss, na.rm = TRUE) modified_data$tpstress[is.na(modified_data$tpstress)] <- mean(modified_data$tpstress, na.rm = TRUE) modified_data cor(modified_data$tpcoiss, modified_data$tpstress)
- Option A: Handle missing values in
Correlation Matrix
We can calculate correlations for multiple variables at once.
First, let’s create a smaller data frame with only continuous variables. We can do this in base R using the following code:
<- survey_data_full[, c("tpcoiss", "tpstress", "toptim",
survey_data_small "tposaff", "tnegaff", "tlifesat",
"tslfest", "tmarlow")]
Note: When subsetting the data above, we use survey_data_full[, c("tpcoiss", ...
. The comma ,
in the square brackets explicitly indicates that you are subsetting columns from a data frame. This clarity can help prevent unintended behaviors when working with more complex subsetting tasks (e.g., selecting rows and columns simultaneously) and makes the code more readable to others. Although survey_data_full[c("tpcoiss", ...
(without the comma ,
) often produces the same output when subsetting columns by name, using the comma notation adheres to standard R conventions for data frame indexing.
You can verify the structure of your new data frame using:
# Check the structure of the new data frame
str(survey_data_small)
# Or see the first few rows
head(survey_data_small)
You can also use the tidyverse
package to accomplish the same result: a new data frame containing only these eight variables:
library(tidyverse)
<- survey_data_full %>%
survey_data select(tpcoiss, tpstress, toptim, tposaff,
tnegaff, tlifesat, tslfest, tmarlow)
You can verify the structure of your new data frame using:
# Check the structure of the new data frame
str(survey_data_small)
# Or see the first few rows
head(survey_data_small)
# Calculate correlation matrix
<- cor(survey_data_small, use = "pairwise.complete.obs")
correlation_matrix
# Round to 3 decimal places for clarity
round(correlation_matrix, 3)
2.3.5 Statistical Testing with cor.test()
The cor.test()
function provides a full statistical test:
# Perform correlation test
<- cor.test(survey_data_full$tpcoiss,
correlation_test $tpstress,
survey_data_fulluse = "complete.obs")
# View complete results
print(correlation_test)
This output includes:
- The correlation coefficient
- The test statistic
- The p-value
- The confidence interval
2.3.6 Interpreting Correlation Results
When interpreting correlation results, consider:
Strength: Common guidelines for absolute values:
- 0.00 to 0.19: “very weak”
- 0.20 to 0.39: “weak”
- 0.40 to 0.59: “moderate”
- 0.60 to 0.79: “strong”
- 0.80 to 1.00: “very strong”
Direction: Positive or negative relationship
Statistical Significance: Check the p-value
- p < 0.05 typically indicates statistical significance
- Consider effect size, not just significance
Context: What’s meaningful in your field?
2.3.7 Common Pitfalls and Considerations
Correlation ≠ Causation: Correlation only indicates association, not causation
Outliers: Can strongly influence correlation coefficients
Non-linear Relationships: Pearson’s correlation only measures linear relationships
Missing Data: Handle missing values appropriately
2.4 Step D4: Linear Regression
Regression analysis helps us understand how one variable (the dependent variable) changes when another variable (the independent variable) changes. Before we dive into the practical implementation, let’s understand some key concepts.
2.4.1 What is Linear Regression?
Linear regression is one of the foundational techniques in statistics. Linear regression gives us a mathematical way to describe and predict the relationship between two variables.
Think of linear regression like drawing a “best-fit” line through a cloud of data points. This line helps us:
- Understand the relationship between variables.
- Predict values for new observations.
- Quantify how strong the relationship is.
2.4.2 Creating Our Linear Regression Model
Let’s analyze the relationship between stress (tpstress
) and sense of control (tpcoiss
). We’ll first visualize the data, then create our regression model.
# Create Scatterplot
ggplot(survey_data_full, aes(x = tpcoiss, y = tpstress)) +
geom_point(color = "darkblue", alpha = 0.7) + # Scatterplot points
geom_smooth(method = "lm", color = "red", se = FALSE, linewidth = 1.5) + # Regression line
labs(
title = "Relationship between Stress and Control",
x = "Sense of Control (tpcoiss)",
y = "Perceived Stress (tpstress)"
+
) theme_minimal()
Note: When running the code above, we get the error below. The functions geom_smooth()
(regression) and geom_point()
(scatterplot) automatically exclude the rows with missing values, triggering the warning. Refer back to the correlation section for how to proactively deal with these issues.
`geom_smooth()` using formula = 'y ~ x'
Warning messages:
1: Removed 13 rows containing non-finite outside the scale range (`stat_smooth()`).
2: Removed 13 rows containing missing values or values outside the scale range (`geom_point()`).
Now, let’s create our regression model using base R:
# Create the linear regression model
<- lm(tpstress ~ tpcoiss, data = survey_data_full)
stress_model
# View the complete summary
summary(stress_model)
2.4.3 Understanding the Results
Let’s break down each part of the output to understand what it tells us:
1. The Correlation Coefficient (R)
In our case, we can find the correlation coefficient (r) by taking the square root of R-squared. The negative or positive sign comes from the slope coefficient in our regression output.
# Calculate r from our model
<- sign(coef(stress_model)[2]) * sqrt(summary(stress_model)$r.squared)
r cat("Correlation coefficient (r):", round(r, 3))
This value tells us:
- The strength of the relationship (how close to -1 or 1)
- The direction (positive or negative)
2. The ANOVA Table
The ANOVA table helps us assess if our model is statistically significant:
# Display the ANOVA table
anova(stress_model)
Looking at the p-value (Pr(>F)), we can see if our relationship is statistically significant. A value less than 0.05 suggests strong evidence of a real relationship between our variables.
3. The Regression Equation
From our coefficients table, we can write our regression equation:
# Display coefficients
coef(stress_model)
Our regression equation is:
Stress = β₀ + β₁(Control)
Where:
- β₀ is our intercept (constant)
- β₁ is our slope coefficient
Let’s fill in the actual values:
# Extract coefficients
<- coef(stress_model)[1]
intercept <- coef(stress_model)[2]
slope
cat("Regression equation:\n")
cat("Stress =", round(intercept, 3), "+", round(slope, 3), "× Control")
This equation means:
- When Control = 0, predicted Stress = intercept
- For each one-unit increase in Control, Stress changes by the slope amount
4. R-squared (R²)
R-squared tells us how much of the variation in stress can be explained by control:
# Extract R-squared
<- summary(stress_model)$r.squared
r_squared cat("R-squared:", round(r_squared, 3))
This means that approximately {round(r_squared * 100, 1)}% of the variation in stress levels can be explained by a person’s sense of control.
2.4.4 Making Predictions
We can use our model to predict stress levels for new values of control:
# Create some example control values
<- data.frame(tpcoiss = c(20, 30, 40))
new_control
# Make predictions
<- predict(stress_model, newdata = new_control)
predictions
# Display predictions
cbind(Control = new_control, Predicted_Stress = round(predictions, 2))
2.4.5 Checking Model Assumptions
For our regression to be valid, we should check certain assumptions:
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(stress_model)
These plots help us check:
- Linearity (Residuals vs Fitted)
- Normality (Normal Q-Q)
- Homoscedasticity (Scale-Location)
- Influential points (Residuals vs Leverage)
2.4.6 Interpretation Guide
When interpreting your regression results, consider:
- Statistical Significance
- Look at the p-value in the ANOVA table
- A p-value < 0.05 suggests a significant relationship
- Practical Significance
- Look at R-squared to understand how much variance is explained
- Consider the slope coefficient for practical impact
- Direction of Relationship
- A negative slope means as one variable increases, the other decreases
- A positive slope means both variables increase together
- Model Assumptions
- Check diagnostic plots for violations
- Consider transformations if assumptions are violated
Introduction While base R provides solid foundations for linear regression, modern R packages offer enhanced capabilities for analysis, visualization, and interpretation. We’ll use several powerful packages that make our analysis more intuitive and visually appealing:
# Install packages if needed
install.packages(c("tidyverse", "broom", "performance", "see", "ggpubr", "sjPlot", "sjmisc", "sjlabelled"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(broom) # For tidying statistical objects
library(performance) # For model performance metrics
library(see) # For model visualization
library(ggpubr) # For publication-ready plots
library(sjPlot) # For model visualization and tables
library(sjmisc) # For model visualization and tables
library(sjlabelled) # For model visualization and tables
Data Preparation
First, let’s prepare our data using tidyverse
functions:
# Create a focused dataset for analysis
<- survey_data_full %>%
analysis_data select(tpstress, tpcoiss) %>%
drop_na() # Remove any missing values
# Quick summary of our variables
summary(analysis_data)
Visual Exploration with ggplot2
The ggplot2
package (part of tidyverse
) creates beautiful visualizations:
# Create an enhanced scatter plot
ggplot(analysis_data, aes(x = tpcoiss, y = tpstress)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
labs(
title = "Relationship between Stress and Control",
subtitle = "With linear regression line and 95% confidence interval",
x = "Sense of Control (tpcoiss)",
y = "Perceived Stress Level (tpstress)"
+
) theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
Creating the Regression Model
We’ll create our model and use modern packages to examine it:
# Create the model
<- lm(tpstress ~ tpcoiss, data = analysis_data)
model
# Get a tidy summary using broom
<- tidy(model, conf.int = TRUE)
tidy_model <- glance(model)
glance_model
# Display tidy results
tidy_model
Model Diagnostics with performance
The performance
package provides enhanced diagnostic tools:
# Check model assumptions
check_model(model)
# Model performance metrics
model_performance(model)
Creating Publication-Ready Tables with sjPlot
The sjPlot
, sjmisc
, and sjlabelled
packages creates beautiful HTML and Word-compatible tables:
# Create regression table
tab_model(model,
title = "Linear Regression Results",
dv.labels = "Perceived Stress",
pred.labels = c("(Intercept)", "Sense of Control"))
# Export tables in html or Word format
tab_model(model, file = "output/tables/regression_table.html")
tab_model(model, file = "output/tables/regression_table.doc")
Visualizing Effects with ggeffects
We can visualize the relationship more clearly:
# Plot predicted values
plot_model(model, type = "pred") +
labs(
title = "Predicted Stress Levels by Control",
x = "Sense of Control",
y = "Predicted Stress Level"
)
Enhanced Regression Diagnostics
Let’s create more informative diagnostic plots using ggplot2
:
# Get augmented data (includes residuals, etc.)
<- augment(model)
model_data
# Create diagnostic plots
<- ggplot(model_data, aes(x = .fitted, y = .resid)) +
p1 geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted",
x = "Fitted values",
y = "Residuals") +
theme_minimal()
<- ggplot(model_data, aes(sample = .std.resid)) +
p2 stat_qq() +
stat_qq_line() +
labs(title = "Normal Q-Q Plot") +
theme_minimal()
# Arrange plots side by side
<- ggarrange(p1, p2, ncol = 2)
diagnostic_plots
# Show combined plot
diagnostic_plots
# Save combined plot as a pdf (change file ending for other formats)
ggsave("output/figures/diagnostic_plots.pdf", plot = diagnostic_plots, width = 8, height = 5)
Making Predictions
We can use tidyverse functions for predictions:
# Create new control values
<- tibble(tpcoiss = c(20, 30, 40))
new_control
# Make predictions and bind results
<- new_control %>%
predictions mutate(Predicted_Stress = predict(stress_model, newdata = .) %>% round(2))
# Display predictions
print(predictions)
Interactive Model Summary
We can create an interactive summary of our model:
# Create model summary
<- tibble(
summary_stats Statistic = c("R-squared", "Adjusted R-squared", "F-statistic", "p-value"),
Value = c(
$r.squared,
glance_model$adj.r.squared,
glance_model$statistic,
glance_model$p.value
glance_model
)%>%
) mutate(Value = round(Value, 3))
# Display as a formatted table
::kable(summary_stats,
knitrcaption = "Model Summary Statistics")
Advantages of Using Modern Packages
These modern R packages offer several advantages over base R:
- Better Visualization:
ggplot2
creates publication-quality graphics - Tidy Output:
broom
makes statistical output easier to work with - Enhanced Diagnostics:
performance
provides comprehensive model checking - Publication-Ready Tables:
sjPlot
creates professional tables - Consistent Interface:
tidyverse
provides a coherent framework
3 Exercise E: T-Test
A t-test helps us determine whether there’s a meaningful difference between groups in our data. Imagine you’re wondering whether men and women tend to have different levels of self-esteem. A t-test can tell us if any difference we observe is likely to be real or just due to chance.
Think of a t-test like a referee in a debate: it helps us decide whether we have enough evidence to say two groups are truly different from each other. Just as a referee needs clear rules to make fair decisions, our t-test uses statistical principles to make this determination.
3.1 Step E1: Understanding the Independent Samples t-test
The independent samples t-test is used when we have:
- A continuous dependent variable (like self-esteem scores)
- Two independent groups (like males and females)
- Independent observations (one person’s score doesn’t influence another’s)
Before we dive into our analysis, let’s prepare our data and create some helpful visualizations:
# First, let's see what our data looks like
head(survey_data_full[c("sex", "tslfest", "tpcoiss")])
# Create boxplot with jittered points of self-esteem by gender
ggplot(survey_data_full, aes(x = factor(sex), y = tslfest)) +
geom_boxplot(aes(fill = factor(sex)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Self-esteem by Gender",
x = "Gender",
y = "Total Self-esteem Score"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender
As an alternative to a boxplot with jittered points, you can also create a violin plot. Violin plots are mirrored density plots for displaying of continuous distributions, similar to a boxplot.
# Create violin plot with a small boxplot inside
ggplot(survey_data_full, aes(x = factor(sex), y = tslfest, fill = factor(sex))) +
geom_violin(trim = FALSE, alpha = 0.5) + # Violin plot with full density curve
geom_boxplot(width = 0.2, fill = "white", outlier.shape = NA) + # Add a small boxplot inside
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Self-esteem by Gender",
x = "Gender",
y = "Total Self-esteem Score"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender
3.2 Step E2: Performing the T-Test (Self-esteem by Gender)
Let’s conduct our first t-test examining self-esteem differences between genders:
# First, let's check basic descriptive statistics
tapply(survey_data_full$tslfest,
$sex,
survey_data_fullfunction(x) c(mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
n = sum(!is.na(x))))
# Perform Levene's test for equality of variances
<- var.test(tslfest ~ sex, data = survey_data_full)
var_test print("Levene's test results:")
print(var_test)
# Perform t-test based on Levene's test result
<- t.test(tslfest ~ sex,
t_test_result data = survey_data_full,
var.equal = var_test$p.value > 0.05) # true if p > 0.05
print("\nt-test results:")
print(t_test_result)
Let’s break down what these results tell us:
Levene’s Test The p-value from our variance test helps us decide whether to assume equal variances. If p > 0.05, we assume equal variances and use the standard t-test. If p < 0.05, we use the Welch’s t-test which doesn’t assume equal variances.
t-test Results
- The t-statistic tells us how many standard errors the group means are apart
- The degrees of freedom (df) help us determine critical values
- The p-value tells us the probability of seeing such differences by chance
- The confidence interval shows us the likely range of the true difference
Effect Size Let’s calculate Cohen’s d to understand the practical significance:
# Calculate Cohen's d
<- survey_data_full$tslfest[survey_data_full$sex == "Male"]
group1 <- survey_data_full$tslfest[survey_data_full$sex == "Female"]
group2
<- function(x, y) {
calc_cohens_d # Remove NA values
<- x[!is.na(x)]
x <- y[!is.na(y)]
y
<- length(x)
nx <- length(y)
ny
# Ensure non-empty groups
if (nx < 2 || ny < 2) {
return(NA) # Return NA if a group has fewer than 2 values
}
<- sqrt(((nx-1)*var(x) + (ny-1)*var(y))/(nx+ny-2))
pooled_sd abs(mean(x) - mean(y))/pooled_sd
return(abs(mean(x) - mean(y)) / pooled_sd)
}
<- calc_cohens_d(group1, group2)
effect_size cat("Cohen's d effect size:", round(effect_size, 3))
Note: We purposefully do not call the function cohens_d
as we will explore a function of the same name below when using the rstatix
package. If we define a function with the same name, RStudio would refer to our custom defined function rather than using the function from the package.
We can use several R packages that provide enhanced capabilities for conducting and visualizing t-tests and work together to create a comprehensive analysis:
# Install packages if needed
install.packages(c("tidyverse", "car", "rstatix", "ggpubr", "effectsize"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(car) # For Levene's test and additional diagnostics
library(rstatix) # For statistical analysis
library(ggpubr) # For publication-ready plots
library(effectsize) # For effect size calculations
Data Preparation
First, let’s prepare our data and ensure it’s in the right format for analysis:
# Create a small dataframe with relevant variables only
<- survey_data_full %>% select(sex, tslfest, tpcoiss)
analysis_data
# Display the structure of our prepared data
str(analysis_data)
Visual Exploration
The ggplot2
package creates beautiful and informative visualizations (these are the same as in the base R solution, as we always use ggplot2
for visualizations):
# Create boxplot with jittered points of self-esteem by gender
ggplot(analysis_data, aes(x = factor(sex), y = tslfest)) +
geom_boxplot(aes(fill = factor(sex)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Self-esteem by Gender",
x = "Gender",
y = "Total Self-esteem Score"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender
Here is the code for the violin plot as an alternative to the boxplot with jittered points.
# Alternative: Create violin plot with a small boxplot inside
ggplot(analysis_data, aes(x = factor(sex), y = tslfest, fill = factor(sex))) +
geom_violin(trim = FALSE, alpha = 0.5) + # Violin plot with full density curve
geom_boxplot(width = 0.2, fill = "white", outlier.shape = NA) + # Add a small boxplot inside
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Self-esteem by Gender",
x = "Gender",
y = "Total Self-esteem Score"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender
The rstatix
package can help us calculate descriptive statistics.
# Add descriptive statistics
<- analysis_data %>%
desc_stats group_by(sex) %>%
get_summary_stats(tslfest, type = "common")
print(desc_stats)
Checking Assumptions
Modern packages provide comprehensive tools for checking t-test assumptions:
- Normality Check
We can check normality both visually and statistically:
# Create Q-Q plots by group
ggqqplot(analysis_data, "tslfest", facet.by = "sex")
# Shapiro-Wilk test for each group
%>%
analysis_data group_by(sex) %>%
shapiro_test(tslfest)
- Homogeneity of Variances
The car
package provides a robust Levene’s test:
# Levene's test using car package
leveneTest(tslfest ~ sex, data = analysis_data)
Conducting the T-Test
The rstatix
package provides a comprehensive t-test function:
# Perform t-test
<- analysis_data %>%
t_test_results t_test(tslfest ~ sex) %>%
add_significance()
# Display results
t_test_results
Effect Size Calculation
The rstatix
has a function to calculte Cohen’s d:
# # Calculate Cohen's d (using `rstatix` package)
<- cohens_d(tslfest ~ sex, data = analysis_data)
cohens_result print(cohens_result)
The effectsize
package helps us understand the practical significance:
# Interpret Cohen's d effect size
<- interpret_cohens_d(cohens_result$Cohens_d)
interpretation
# Print interpretation
print(interpretation)
Advantages of Using Modern Packages
These modern R packages offer several benefits over base R:
- The
rstatix
package provides:- Clearer output formatting
- Automatic significance indicators
- Built-in assumption checking
- The
ggpubr
package creates:- Publication-ready plots
- Easy visualization of assumptions
- Combined plots with statistical results
- The
effectsize
package offers:- Standardized effect size calculations
- Effect size interpretation
- Confidence intervals for effect sizes
Creating a Complete Report
We can create a comprehensive analysis report:
# Function to run complete analysis
<- function(data, dv, group_var) {
analyze_group_difference # Descriptive statistics
<- data %>%
desc group_by(!!sym(group_var)) %>%
get_summary_stats(!!sym(dv), type = "common")
# Assumption checks
<- data %>%
normality group_by(!!sym(group_var)) %>%
shapiro_test(!!sym(dv))
# Levene's test
<- leveneTest(as.formula(paste(dv, "~", group_var)), data = data)
homogeneity
# T-test
<- data %>%
t_test t_test(as.formula(paste(dv, "~", group_var))) %>%
add_significance()
# Effect size
<- cohens_d(as.formula(paste(dv, "~", group_var)), data = data)
effect
list(
descriptives = desc,
normality_test = normality,
variance_test = homogeneity,
t_test = t_test,
effect_size = effect
)
}
# Example usage
<- analyze_group_difference(
self_esteem_analysis
analysis_data, "tslfest",
"sex"
)
# Display results
print(self_esteem_analysis)
Writing Up Results
Here’s how to write up the results in APA format using our comprehensive analysis:
# Create formatted output
<- function(analysis) {
report_results with(analysis, {
cat("Results:\n")
cat("Descriptive Statistics:\n")
print(descriptives)
cat("\nAssumption Tests:\n")
cat("Normality (Shapiro-Wilk):\n")
print(normality_test)
cat("\nHomogeneity of Variance (Levene's Test):\n")
print(variance_test)
cat("\nT-test Results:\n")
print(t_test)
cat("\nEffect Size:\n")
print(effect_size)
})
}
report_results(self_esteem_analysis)
3.3 Step E3: Performing the T-Test (Control Levels by Gender)
Now let’s examine our second research question about control levels between genders. First, we want to visually inspect the relationship between gender and perceived control:
# Create boxplot with jittered points of perceived control by gender
ggplot(analysis_data, aes(x = factor(sex), y = tpcoiss)) +
geom_boxplot(aes(fill = factor(sex)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Perceived Control by Gender",
x = "Gender",
y = "Perceived Control"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender
We can also check descriptive statistics and perform the analysis.
# Check basic descriptive statistics
tapply(survey_data_full$tslfest,
$sex,
survey_data_fullfunction(x) c(mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
n = sum(!is.na(x))))
# Perform Levene's test for equality of variances
<- var.test(tslfest ~ sex, data = survey_data_full)
var_test print("Levene's test results:")
print(var_test)
# Perform t-test based on Levene's test result
<- t.test(tslfest ~ sex,
t_test_result data = survey_data_full,
var.equal = var_test$p.value > 0.05) # true if p > 0.05
print("\nt-test results:")
print(t_test_result)
Lastly, we want to estimate the effect size using Cohen’s d. We are using our own function calc_cohens_d()
that we created in Step E2.
# Calculate Cohen's d
<- survey_data_full$tslfest[survey_data_full$sex == "Male"]
group1 <- survey_data_full$tslfest[survey_data_full$sex == "Female"]
group2
# We defined our function `calc_cohens_d()` earlier.
<- calc_cohens_d(group1, group2)
effect_size cat("Cohen's d effect size:", round(effect_size, 3))
We can use the same R packages as in the previous example for conducting and visualizing t-tests and creating a comprehensive analysis. Below is the complete analysis for our second research question about control levels.
# Create boxplot with jittered points of perceived control by gender
ggplot(analysis_data, aes(x = factor(sex), y = tpcoiss)) +
geom_boxplot(aes(fill = factor(sex)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Perceived Control by Gender",
x = "Gender",
y = "Perceived Control"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender
# Check assumptions
# 1. Normality
%>%
analysis_data group_by(sex) %>%
shapiro_test(tpcoiss)
# 2. Homogeneity of variance
leveneTest(tpcoiss ~ sex, data = analysis_data)
# Perform t-test
<- analysis_data %>%
control_t_test t_test(tpcoiss ~ sex) %>%
add_significance()
# Calculate effect size
<- cohens_d(tpcoiss ~ sex, data = analysis_data)
control_effect
# Display comprehensive results
list(
"T-test Results" = control_t_test,
"Effect Size" = control_effect
)
3.4 Step E4: A Step-by-Step Guide to Interpreting Results
When interpreting t-test results, follow these steps:
- Check Assumptions
- Look at the boxplots for obvious outliers
- Check Levene’s test for equality of variances
- If Levene’s test is significant (p < 0.05), use the Welch’s t-test output (R does this automatically with var.equal = FALSE).
- Consider whether the groups are truly independent
- Check if the data is normally distributed
- If the data is not normally distributed, consider using non-parametric alternatives like the Wilcoxon rank-sum test:
# Example of non-parametric test
wilcox.test(tslfest ~ sex, data = survey_data_full)
- Examine the t-test Results
- Look at the p-value (significance level)
- Consider the confidence interval
- Think about the effect size
- Draw Conclusions
- If p < 0.05, we have evidence of a significant difference
- Consider the practical significance (effect size)
- Think about the real-world implications
3.5 Step E5: Writing Up Results
Here is an example for how to write up t-test results in APA format:
“An independent-samples t-test was conducted to compare [variable] between males and females. There was a [significant/non-significant] difference in scores for males (M = XX.XX, SD = XX.XX) and females (M = XX.XX, SD = XX.XX); t(df) = XX.XX, p = .XXX. The magnitude of the differences in the means (mean difference = XX.XX, 95% CI: XX.XX to XX.XX) was [small/medium/large] (d = X.XX).”
4 Exercise F: Analysis of Variance (ANOVA)
Analysis of Variance, commonly known as ANOVA, extends our analytical capabilities beyond comparing just two groups. While t-tests are perfect for comparing means between two groups (like comparing males and females), researchers often need to examine differences across multiple groups. For instance, we might want to know how optimism levels differ across various age groups, or how stress levels vary with education.
Think of ANOVA like a sophisticated referee in a multi-team tournament: instead of just comparing two teams, it helps us determine whether any teams in the entire tournament are significantly different from each other. Just as a tournament referee needs clear rules to make fair decisions across multiple teams, ANOVA uses statistical principles to examine differences across multiple groups simultaneously.
4.1 Step F1: Understanding One-way ANOVA
One-way ANOVA is used when we have:
- A continuous dependent variable (like optimism scores).
- One categorical independent variable with three or more groups (like age groups).
- Independent observations (one person’s score doesn’t influence another’s).
Let’s begin by preparing our data and creating some helpful visualizations:
# First, let's prepare our age groups properly
$agegp3 <- factor(survey_data_full$agegp3,
survey_data_fulllevels = c(1, 2, 3),
labels = c("18-29", "30-44", "45+"))
# Create boxplot with jittered points of optimism levels by age group
ggplot(survey_data_full, aes(x = factor(agegp3), y = toptim)) +
geom_boxplot(aes(fill = factor(agegp3)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightgreen", "lightpink")) +
labs(
title = "Optimism Levels by Age Group",
x = "Age Group",
y = "Total Optimism Score"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender
4.2 Step F2: Performing the One-Way ANOVA (Optimism across Age Groups)
Before diving into the ANOVA, let’s examine our data descriptively:
# Create comprehensive descriptive statistics
<- function(dv, group) {
get_descriptives tapply(dv, group, function(x) {
c(n = sum(!is.na(x)),
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
se = sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x))))
})
}
# Get descriptives for optimism by age group
<- get_descriptives(survey_data_full$toptim,
descriptives $agegp3)
survey_data_fullprint("Descriptive Statistics:")
print(descriptives)
Just as t-tests have assumptions, ANOVA requires certain conditions to be met. Levene’s test is often performed before an ANOVA to ensure the validity of the ANOVA results by verifying that the variances across groups are similar (the “homogeneity of variance” assumption). However, Levene’s test uses the mean for deviations, making it appropriate if the data are normally distributed. We can perform the Shapiro-Wilk test, and if p > 0.05
, the data are approximately normally distributed and we can use Levene’s test.
# Normality test (Shapiro-Wilk test)
shapiro.test(survey_data_full$toptim)
# Levene's test for homogeneity of variance
<- function(y, group) {
levene_test <- factor(group)
group <- tapply(y, group, mean, na.rm = TRUE)
means <- abs(y - means[group])
abs_dev anova(lm(abs_dev ~ group))[1, "Pr(>F)"]
}
# Perform Levene's test
<- levene_test(survey_data_full$toptim, survey_data_full$agegp3)
levene_p cat("Levene's test p-value:", levene_p, "\n")
If p > 0.05
from the Shapiro-Wilk test, the data are not normally distributed and the Brown-Forsythe test, which uses the median for deviations, is more appropriate.
# Brown-Forsythe test (more robust to non-normality)
<- function(y, group) {
bf_test <- factor(group)
group <- tapply(y, group, median, na.rm = TRUE)
meds <- abs(y - meds[group])
abs_dev anova(lm(abs_dev ~ group))[1, "Pr(>F)"]
}
<- bf_test(survey_data_full$toptim, survey_data_full$agegp3)
bf_p cat("Brown-Forsythe test p-value:", bf_p, "\n")
Now we’ll perform both standard ANOVA and its robust alternative:
# Standard one-way ANOVA
<- aov(toptim ~ agegp3, data = survey_data_full)
anova_result print("Standard ANOVA results:")
print(summary(anova_result))
# Welch's ANOVA (robust to unequal variances)
<- oneway.test(toptim ~ agegp3,
welch_result data = survey_data_full,
var.equal = FALSE)
print("\nWelch's ANOVA results:")
print(welch_result)
When conducting an ANOVA test, a significant result indicates that at least one group mean differs from the others. However, ANOVA does not tell us which specific groups differ. To determine where these differences exist, we use Tukey’s Honest Significant Difference (HSD) test, which compares each pair of groups while controlling for multiple comparisons.
# Tukey's HSD test for pairwise comparisons
<- TukeyHSD(anova_result)
tukey_result print("Tukey's HSD test results:")
print(tukey_result)
# Visualize the Tukey results
plot(tukey_result)
Following the steps below, we can confidently interpret the results of Tukey’s HSD test and understand which groups significantly differ from each other:
- Check the p-value (
p adj
): If it is less than0.05
, the difference is statistically significant. - Look at the confidence interval (
lwr
andupr
): If the interval does not include0
, there is a meaningful difference. - Examine the mean difference (
diff
): Determine whether one group has a higher or lower mean compared to another.
Let’s calculate eta-squared to understand the practical significance. Statistical significance (i.e., a small p-value) does not tell us how important or meaningful the difference is in real-world terms. This is where effect size measures like eta-squared (η²) come in.
# Calculate eta-squared
<- summary(anova_result)
aov_summary <- aov_summary[[1]]$"Sum Sq"[1] / sum(aov_summary[[1]]$"Sum Sq")
eta_squared cat("Eta-squared:", round(eta_squared, 3))
Cohen (1988) provides general guidelines for interpreting eta-squared in the context of ANOVA:
- 0.01 (1%): The independent variable explains a small amount of variance (small effect).
- 0.06 (6%): The independent variable explains a moderate amount of variance (medium effect).
- 0.14+ (14% or more): The independent variable explains a large amount of variance (strong effect).
Modern R packages provide enhanced capabilities for conducting and visualizing Analysis of Variance (ANOVA). We’ll use several powerful packages that work together to create a comprehensive analysis:
# Install packages if needed
install.packages(c("tidyverse", "car", "rstatix", "ggpubr", "effectsize", "emmeans"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(car) # For Levene's test and additional diagnostics
library(rstatix) # For statistical analysis
library(ggpubr) # For publication-ready plots
library(effectsize) # For effect size calculations
library(emmeans) # For estimated marginal means and post-hoc tests
Data Preparation
First, let’s prepare our data and ensure it’s in the right format for analysis:
<- survey_data_full %>%
analysis_data select(agegp3, educ, toptim, tpstress)
# Display the structure of our prepared data
glimpse(analysis_data)
Visual Exploration
The ggpubr
package creates beautiful and informative visualizations:
# Create violin plot with boxplot and individual points
ggviolin(analysis_data,
x = "age_group",
y = "toptim",
fill = "age_group",
add = "boxplot",
add.params = list(fill = "white")) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(title = "Optimism Levels by Age Group",
y = "Total Optimism Score",
x = "Age Group") +
theme(legend.position = "none")
# Add descriptive statistics
<- analysis_data %>%
desc_stats group_by(age_group) %>%
get_summary_stats(toptim, type = "common")
print(desc_stats)
Checking Assumptions
Modern packages provide comprehensive tools for checking ANOVA assumptions:
- Normality Check
We can check normality both visually and statistically:
# Create Q-Q plots by group
ggqqplot(analysis_data, "toptim", facet.by = "age_group")
# Shapiro-Wilk test for each group
%>%
analysis_data group_by(age_group) %>%
shapiro_test(toptim)
- Homogeneity of Variances
The car package provides both Levene’s and Brown-Forsythe tests:
# Levene's test using car package
leveneTest(toptim ~ age_group, data = analysis_data)
# Brown-Forsythe test using car package
leveneTest(toptim ~ age_group, data = analysis_data, center = median)
Conducting the ANOVA
The rstatix
package provides comprehensive ANOVA functions:
# Perform one-way ANOVA
<- analysis_data %>%
anova_results anova_test(toptim ~ age_group) %>%
add_significance()
# Display results
anova_results
# Welch's ANOVA (robust to heterogeneity of variance)
<- analysis_data %>%
welch_results welch_anova_test(toptim ~ age_group)
print("Welch's ANOVA results:")
welch_results
Post-hoc Analysis
Modern packages offer various methods for post-hoc analysis:
# Tukey's HSD using emmeans
<- emmeans(aov(toptim ~ age_group, data = analysis_data),
emmeans_result "age_group")
pairs(emmeans_result)
# Games-Howell post-hoc test (robust to heterogeneity of variance)
<- analysis_data %>%
games_howell_result games_howell_test(toptim ~ age_group)
print("Games-Howell test results:")
games_howell_result
Effect Size Calculation
The effectsize
package helps us understand the practical significance:
# Calculate eta squared
<- effectsize::eta_squared(aov(toptim ~ age_group, data = analysis_data))
eta_squared print(eta_squared)
# Interpret effect size
interpret_eta_squared(eta_squared$Eta2)
Creating Powerful Visualizations
We can create publication-ready plots that combine statistical information:
# Create plot with statistical annotations
ggboxplot(analysis_data,
x = "age_group",
y = "toptim",
color = "age_group") +
stat_compare_means(method = "anova") +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(title = "Optimism Levels by Age Group with ANOVA Results",
y = "Total Optimism Score",
x = "Age Group") +
theme(legend.position = "none")
Advantages of Using Modern Packages
These modern R packages offer several benefits over base R:
- The
rstatix
package provides:- Tidyverse-compatible statistical functions
- Automatic assumption checking
- Built-in effect size calculations
- The
ggpubr
package creates:- Publication-ready plots
- Easy statistical annotations
- Combined plots with results
- The
emmeans
package offers:- Sophisticated post-hoc analyses
- Flexible contrast specifications
- Multiple comparison adjustments
Writing Up Results
Here’s how to write up the results in APA format:
# Function to create formatted output
<- function(analysis) {
report_anova_results with(analysis, {
cat("Results:\n")
cat("\nDescriptive Statistics:\n")
print(descriptives)
cat("\nANOVA Results:\n")
print(anova_test)
cat("\nEffect Size:\n")
print(effect_size)
cat("\nPost-hoc Analysis:\n")
print(posthoc)
})
}
report_anova_results(stress_analysis)
4.3 Step F3: Performing the One-Way ANOVA (Stress Levels across Age Groups)
Let’s apply what we’ve learned to analyze stress levels (tpstress
) across age groups (agegp3
). First, we create some helpful visualizations:
# Create boxplot with jittered points of stress levels by age group
ggplot(survey_data_full, aes(x = factor(agegp3), y = tpstress)) +
geom_boxplot(aes(fill = factor(agegp3)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightgreen", "lightpink")) +
labs(
title = "Stress Levels by Age Group",
x = "Age Group",
y = "Total Stress Score"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent age group
Before diving into the ANOVA, let’s examine our data descriptively:
# Get descriptives for total perceived stress by age group
# The function `get_descriptives()` was created in Step F2
<- get_descriptives(survey_data_full$tpstress,
descriptives $agegp3)
survey_data_fullprint("Descriptive Statistics:")
print(descriptives)
Before conducting our ANOVA, let’s test whether our data meets the requirements. We can perform the Shapiro-Wilk test, and if p > 0.05
, the data are approximately normally distributed and we can use Levene’s test, otherwise we should use the Brown-Forsythe test.
# Normality test (Shapiro-Wilk test)
shapiro.test(survey_data_full$tpstress)
# Perform Levene's test (if p > 0.05 from Shapiro-Wilk test)
# The function `get_descriptives()` was created in Step F2
<- levene_test(survey_data_full$tpstress, survey_data_full$agegp3)
levene_p cat("Levene's test p-value:", levene_p, "\n")
# Brown-Forsythe test (if p < 0.05 from Shapiro-Wilk test)
# The function `get_descriptives()` was created in Step F2
<- bf_test(survey_data_full$tpstress, survey_data_full$agegp3)
bf_p cat("Brown-Forsythe test p-value:", bf_p, "\n")
Now we’ll perform either the standard ANOVA or its robust alternative:
# Standard one-way ANOVA
<- aov(tpstress ~ agegp3, data = survey_data_full)
anova_result print("Standard ANOVA results:")
print(summary(anova_result))
# Welch's ANOVA (robust to unequal variances)
<- oneway.test(tpstress ~ agegp3,
welch_result data = survey_data_full,
var.equal = FALSE)
print("\nWelch's ANOVA results:")
print(welch_result)
If we find significant differences, we need to know which groups differ from each other:
# Tukey's HSD test for pairwise comparisons
<- TukeyHSD(anova_result)
tukey_result print("Tukey's HSD test results:")
print(tukey_result)
# Visualize the Tukey results
plot(tukey_result)
Let’s calculate eta-squared to understand the practical significance:
# Calculate eta-squared
<- summary(anova_result)
aov_summary <- aov_summary[[1]]$"Sum Sq"[1] / sum(aov_summary[[1]]$"Sum Sq")
eta_squared cat("Eta-squared:", round(eta_squared, 3))
Similar to the one-way ANOVA for total optimism toptim
across age groups agegp3
, we can use R packages to conduct a one-way ANOVA for stress levels tpstress
across age groups agegp3
.
Complete and Comprehensive Analysis
# Install packages if needed
install.packages(c("tidyverse", "car", "rstatix", "ggpubr", "effectsize", "emmeans"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(car) # For Levene's test and additional diagnostics
library(rstatix) # For statistical analysis
library(ggpubr) # For publication-ready plots
library(effectsize) # For effect size calculations
library(emmeans) # For estimated marginal means and post-hoc tests
# Create boxplot with jittered points of stress levels by age group
ggplot(survey_data_full, aes(x = factor(agegp3), y = tpstress)) +
geom_boxplot(aes(fill = factor(agegp3)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightgreen", "lightpink")) +
labs(
title = "Stress Levels by Age Group",
x = "Age Group",
y = "Total Stress Score"
+
) theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent age group
# Function for comprehensive ANOVA analysis
<- function(data, dv, group_var) {
analyze_group_differences # Descriptive statistics
<- data %>%
desc group_by(!!sym(group_var)) %>%
get_summary_stats(!!sym(dv), type = "common")
# Assumption checks
<- data %>%
normality group_by(!!sym(group_var)) %>%
shapiro_test(!!sym(dv))
# Levene's test
<- leveneTest(as.formula(paste(dv, "~", group_var)), data = data)
homogeneity
# ANOVA
<- anova_test(data = data, as.formula(paste(dv, "~", group_var)))
anova
# Welch's ANOVA
<- welch_anova_test(data = data, as.formula(paste(dv, "~", group_var)))
welch
# Post-hoc tests
<- games_howell_test(data = data, as.formula(paste(dv, "~", group_var)))
posthoc
# Effect size
<- aov(as.formula(paste(dv, "~", group_var)), data = data)
model <- effectsize::eta_squared(model)
effect
# Return results
list(
descriptives = desc,
normality_test = normality,
variance_test = homogeneity,
anova_test = anova,
welch_test = welch,
posthoc = posthoc,
effect_size = effect
)
}
# Run complete analysis
<- analyze_group_differences(
stress_analysis
survey_data_full, "tpstress",
"agegp3"
)
# Display results
print(stress_analysis)
Creating Powerful Visualizations
We can create publication-ready plots that combine statistical information:
# Create plot with statistical annotations
ggboxplot(survey_data_full,
x = "agegp3",
y = "tpstress",
color = "agegp3") +
stat_compare_means(method = "anova") +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(title = "Optimism Levels by Age Group with ANOVA Results",
y = "Total Stress Score",
x = "Age Group") +
theme(legend.position = "none")
Writing Up Results
Here’s how to write up the results in APA format:
# Function to create formatted output
<- function(analysis) {
report_anova_results with(analysis, {
cat("Results:\n")
cat("\nDescriptive Statistics:\n")
print(descriptives)
cat("\nANOVA Results:\n")
print(anova_test)
cat("\nEffect Size:\n")
print(effect_size)
cat("\nPost-hoc Analysis:\n")
print(posthoc)
})
}
report_anova_results(stress_analysis)
4.4 Step F4: A Step-by-Step Guide to Interpreting Results
When interpreting ANOVA results, follow these steps:
- Check Assumptions
- Review Levene’s and Brown-Forsythe test results.
- Consider whether observations are truly independent.
- ANOVA can handle unequal sample sizes, but you should pay extra attention to the homogeneity of variance assumption.
- Examine the boxplots for outliers and distribution shape. If the data isn’t normally distributed, consider using the non-parametric Kruskal-Wallis test:
# Example of non-parametric alternative
kruskal.test(toptim ~ agegp3, data = survey_data_full)
- Examine the ANOVA Results
- Look at the F-statistic and p-value
- Consider which ANOVA version to trust (standard or Welch’s)
- Examine effect sizes
- Interpret Post-hoc Tests
- Look for significant pairwise differences
- Consider the practical meaning of the differences
- Think about the confidence intervals
4.5 Step F5: Writing Up Results
Here’s how to write up ANOVA results in APA format:
“A one-way between-groups analysis of variance was conducted to explore the impact of age on optimism levels. Participants were divided into three groups according to their age (Group 1: 18-29 years; Group 2: 30-44 years; Group 3: 45+ years). There was a [significant/non-significant] difference in optimism scores for the three age groups: F(2, XXX) = XX.XX, p = .XXX. The effect size, calculated using eta squared, was .XX. Post-hoc comparisons using the Tukey HSD test indicated that the mean score for Group 1 (M = XX.XX, SD = XX.XX) was significantly different from Group 2 (M = XX.XX, SD = XX.XX)…”
5 Exercise G: Chi-Square Test
Chi-Square tests help us analyze categorical data - information that falls into distinct groups or categories. Think of these tests as ways to determine whether the patterns we see in our data are meaningful or just due to chance. We’ll explore two main types of Chi-Square tests:
- Chi-Square Test for Goodness of Fit: Compares our observed data to what we would expect based on some theory or known population values.
- Chi-Square Test for Independence: Examines whether two categorical variables are related to each other.
5.1 Step G1: Chi-Square Test for Goodness of Fit (Smoking Rates)
Imagine you’re wondering whether the proportion of smokers in your sample matches what you’d expect based on national statistics. This is exactly what a goodness of fit test helps us determine.
In this example, we check whether our sample’s smoking rate matches the expected population rate of 20%:
# First, let's look at our data
<- table(survey_data_full$smoke)
smoker_table print("Observed frequencies:")
print(smoker_table)
# Calculate percentages
<- prop.table(smoker_table) * 100
smoker_props print("\nObserved percentages:")
print(round(smoker_props, 1))
# Perform Chi-Square test
# Expected proportions (20% smokers, 80% non-smokers)
<- c(0.2, 0.8)
expected_props
<- chisq.test(smoker_table, p = expected_props)
chisq_result
# Display detailed results
print("\nChi-Square Test Results:")
print(chisq_result)
# Calculate and display expected frequencies
<- sum(smoker_table)
n_total <- n_total * expected_props
expected_freq
print("\nComparison of observed vs expected frequencies:")
<- data.frame(
comparison_df Category = c("Smokers", "Non-smokers"),
Observed = as.numeric(smoker_table),
Expected = expected_freq,
Difference = as.numeric(smoker_table) - expected_freq
)print(comparison_df)
5.2 Step G2: Interpreting Goodness of Fit Results
The Chi-Square test gives us several pieces of information:
- The Chi-Square statistic (χ²): Measures how different our observed values are from what we expected
- Degrees of freedom (df): The number of categories minus 1
- p-value: Tells us whether any difference is statistically significant
In our example:
- If p < 0.05: The smoking rate in our sample is significantly different from 20%
- If p ≥ 0.05: We don’t have evidence that our sample differs from the expected 20%
Modern R packages provide enhanced capabilities for conducting and visualizing Chi-Square analyses. We’ll use several packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "rstatix", "ggstatsplot", "knitr"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(rstatix) # For statistical analysis
library(ggstatsplot) # For statistical visualization
library(knitr) # For dynamic report generation
Enhanced Visualization of Observed vs Expected
The ggstatsplot
package creates informative visualizations that include statistical results:
# Create enhanced pie chart with statistical results
ggpiestats(
data = survey_data_full,
x = smoke,
title = "Observed vs Expected Smoking Rates",
xlab = "Smoking Status",
ylab = "Count",
results.subtitle = TRUE,
subtitle = "Testing against expected 20% smoking rate",
type = "parametric",
paired = FALSE,
expected.r = c(0.2, 0.8) # Specify expected proportions correctly
)
Here is an alternative approach to visualizing the data and results in a bar chart using ggplot2
:
# Alternative approach using ggplot2
<- table(survey_data_full$smoke)
smoker_counts <- prop.table(smoker_counts)
smoker_props <- c(0.2, 0.8)
expected_props <- sum(smoker_counts) * expected_props
expected_counts
# Create data frame for plotting
<- data.frame(
plot_data Status = names(smoker_counts),
Observed = as.vector(smoker_counts),
Expected = expected_counts
)
# Plot
ggplot(plot_data, aes(x = Status)) +
geom_bar(aes(y = Observed), stat = "identity", fill = "steelblue", alpha = 0.7) +
geom_point(aes(y = Expected), color = "red", size = 3) +
geom_line(aes(y = Expected, group = 1), color = "red", linetype = "dashed") +
theme_minimal() +
labs(title = "Observed vs Expected Smoking Rates",
subtitle = paste("Chi-square test: p =",
round(chisq.test(smoker_counts, p = expected_props)$p.value, 3)),
x = "Smoking Status",
y = "Count")
Comprehensive Statistical Analysis
The rstatix
package provides clear statistical output:
# Filter to only include Yes/No responses
<- survey_data_full %>%
filtered_data filter(smoke %in% c("Yes", "No"))
# Now proceed with calculating observed frequencies on filtered data
<- filtered_data %>%
obs_freq count(smoke) %>%
mutate(
prop = n / sum(n), # Observed proportions
expected_prop = c(0.2, 0.8), # Expected proportions
expected_n = sum(n) * expected_prop, # Expected counts
contrib = (n - expected_n)^2 / expected_n # Contribution to Chi-square
)
# Create a contingency table (observed vs expected)
<- data.frame(
contingency_table Category = obs_freq$smoke,
Observed = obs_freq$n,
Expected = obs_freq$expected_n
)
# Perform Chi-square goodness-of-fit test using `chisq_test()`
<- chisq_test(as.table(cbind(contingency_table$Observed, contingency_table$Expected)))
chisq_result
# Add significance stars
<- chisq_result %>%
chisq_result add_significance()
# Display observed vs expected frequencies
%>%
obs_freq kable(digits = 3, caption = "Observed vs. Expected Smoking Frequencies")
# Display results
chisq_result
Advantages of Modern Packages
- Enhanced Visualization
- One-click visualization of observed vs expected frequencies
- Automated proportion plots with error bars
- Clear representation of deviations from expected values
- Clearer Output
- Formatted tables comparing observed and expected frequencies
- Clear presentation of test statistics
- Automatic computation of standardized residuals
- Additional Tools
- Multiple effect size measures (Cohen’s w, Cramer’s V)
- Power analysis capabilities
- Easy comparison with multiple expected distributions
- Better Integration
- Simple workflow from data to visualization
- Easy export of results for reporting
- Consistent syntax across analyses
Best Practices for Reporting
- Report:
- Sample size and degrees of freedom
- Expected proportions and their source/justification
- Chi-square statistic and exact p-value
- Effect size (typically Cohen’s w)
- Visual comparison of observed vs expected frequencies
- Include:
- Clear statement of the null hypothesis
- Table of observed and expected frequencies
- Standardized residuals for each category
- Power analysis if the result is non-significant
- Consider:
- Whether categories should be combined
- If sample size is adequate
- Alternative tests for small samples
5.3 Step G3: Chi-Square Test for Independence (Smoking and Gender)
Now let’s examine whether there’s a relationship between smoking and gender. This test helps us determine if two categorical variables are independent of each other.
First, let’s create a cross-tabulation and examine our data:
# Create cross-tabulation
<- table(survey_data_full$sex, survey_data_full$smoker)
cross_tab
# Add row and column names for clarity
dimnames(cross_tab) <- list(
Gender = c("Male", "Female"),
Smoking = c("Smoker", "Non-smoker")
)
# Display the cross-tabulation
print("Cross-tabulation of Gender and Smoking:")
print(cross_tab)
# Calculate and display percentages by gender
<- prop.table(cross_tab, margin = 1) * 100
prop_table print("\nPercentages within each gender:")
print(round(prop_table, 1))
Before performing the Chi-Square test of independence, we should check if our data meets the assumptions:
# Calculate expected frequencies
<- chisq.test(cross_tab)$expected
expected print("\nExpected frequencies:")
print(round(expected, 2))
# Check if any expected frequencies are less than 5
<- min(obs_freq$expected_n)
min_expected # Print result using cat()
cat("\nMinimum expected frequency:", min_expected, "\n")
# Print additional warning if expected frequencies are less than 5
if(min_expected < 5) {
print("Warning: Some expected frequencies are less than 5!")
}
Now we can perform the Chi-Square test of independence:
# Perform Chi-Square test with continuity correction (for 2x2 tables)
<- chisq.test(cross_tab, correct = TRUE)
chi_result print("\nChi-Square Test Results:")
print(chi_result)
# Calculate effect size (Cramer's V)
<- sum(cross_tab)
n <- sqrt(chi_result$statistic / (n * (min(dim(cross_tab)) - 1)))
cramer_v cat("\nCramer's V:", round(cramer_v, 3))
To understand effect sizes in a 2x2 table, we can interpret Cramer’s V as follows:
- 0.1: Small effect
- 0.3: Medium effect
- 0.5: Large effect
Standardized residuals help us understand which cells contribute most to any significant results:
# Calculate adjusted standardized residuals
<- chisq.test(cross_tab)$stdres
stdres dimnames(stdres) <- dimnames(cross_tab)
print("\nAdjusted standardized residuals:")
print(round(stdres, 2))
Let’s create some helpful visualizations:
# Create barplot of smoking rates by gender
barplot(prop_table,
beside = TRUE,
main = "Smoking Status by Gender",
xlab = "Smoking Status",
ylab = "Percentage",
col = c("lightblue", "pink"),
legend.text = TRUE)
We can also compute the proportions manually and then plot them with ggplot2
:
# Create contingency table of counts
<- table(survey_data_full$sex, survey_data_full$smoke)
table_counts
# Convert counts to proportions (row-wise percentages)
<- prop.table(table_counts, margin = 1) * 100 # Convert to percentage
prop_table
# Convert to data frame (since ggplot2 requires a data frame)
<- rep(rownames(prop_table), times = ncol(prop_table))
gender <- rep(colnames(prop_table), each = nrow(prop_table))
smoke <- as.vector(prop_table)
percentage
<- data.frame(gender, smoke, percentage)
plot_data
# Create bar plot with ggplot2
ggplot(plot_data, aes(x = smoke, y = percentage, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") + # Grouped bars
labs(
title = "Smoking Status by Gender",
x = "Smoking Status",
y = "Percentage"
+
) scale_fill_manual(values = c("lightblue", "pink")) + # Custom colors
theme_minimal()
Modern R packages provide enhanced capabilities for conducting and visualizing Chi-Square analyses. We’ll use several packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "rstatix", "ggstatsplot", "janitor",
"effectsize"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(rstatix) # For statistical analysis
library(ggstatsplot) # For statistical visualization
library(janitor) # For examining and cleaning data
library(effectsize) # For effect size calculations
Enhanced Cross-Tabulation Visualization with ggstatsplot
# Create an enhanced visualization of the relationship
ggbarstats(
data = survey_data_full,
x = sex,
y = smoke,
title = "Relationship between Gender and Smoking Status",
xlab = "Gender",
ylab = "Proportion",
label = "percentage",
results.subtitle = TRUE
)
Comprehensive Analysis with rstatix
# Create a contingency table (cross-tabulation of sex and smoking status)
<- table(survey_data_full$sex, survey_data_full$smoke)
contingency_table
# Perform Chi-square test for independence
<- chisq_test(contingency_table) %>%
independence_test add_significance()
# Calculate effect size (Cramér’s V)
<- cramer_v(contingency_table)
independence_effect
# Display results
list(
"Chi-square Test Results" = independence_test,
"Effect Size (Cramér's V)" = independence_effect
)
Post-hoc Analysis
If we find significant associations, we can examine standardized residuals:
# Create contingency table
<- table(survey_data_full$sex, survey_data_full$smoke)
contingency_table
# Perform Chi-square test using base R (needed for residuals)
<- chisq.test(contingency_table)
chisq_result
# Extract standardized residuals
<- chisq_result$stdres # Standardized residuals matrix
residuals_matrix
# Convert matrix to a data frame
<- as.data.frame(as.table(residuals_matrix))
residuals_analysis colnames(residuals_analysis) <- c("sex", "smoke", "residual")
# Create heatmap of standardized residuals
ggplot(residuals_analysis, aes(x = sex, y = smoke, fill = residual)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white") +
geom_text(aes(label = round(residual, 2))) +
theme_minimal() +
labs(
title = "Standardized Residuals from Chi-Square Test",
fill = "Residual"
)
Effect Size Visualization
The effectsize
package helps understand practical significance:
# Calculate Cramer's V effect size
<- effectsize::cramers_v(
effect_size table(survey_data_full$sex, survey_data_full$smoke)
)
# Print result
effect_size
Creating a Complete Report
We can create a comprehensive analysis function:
<- function(data, var1, var2 = NULL) {
analyze_categorical
if (is.null(var2)) {
# One-variable analysis
<- list(
result
# Frequency table
frequency = data %>%
count(!!sym(var1)) %>%
mutate(proportion = n / sum(n)),
# Bar chart visualization
visualization = ggbarstats(
data = data,
x = !!sym(var1)
),
# Chi-square goodness-of-fit test
test = data %>%
count(!!sym(var1)) %>%
pull(n) %>%
chisq.test() %>%
::tidy() %>%
broomadd_significance(),
# Effect size (not applicable for one variable)
effect = NA
)
else {
} # Two-variable independence test
<- table(data[[var1]], data[[var2]]) # Create contingency table
contingency_table
<- list(
result
# Cross-tabulation (contingency table)
frequency = as.data.frame.matrix(contingency_table),
# Bar chart visualization
visualization = ggbarstats(
data = data,
x = !!sym(var1),
y = !!sym(var2)
),
# Chi-square independence test
test = chisq_test(contingency_table) %>%
add_significance(),
# Cramér’s V effect size
effect = cramer_v(contingency_table)
)
}
return(result)
}
# Example usage
<- analyze_categorical(
smoking_analysis data = survey_data_full,
var1 = "sex",
var2 = "smoke"
)
# Display results
smoking_analysis
Advantages of Modern Packages
- Enhanced Visualization
- Mosaic plots with statistical annotations
- Heat maps of associations
- Interactive contingency tables
- Clearer Output
- Well-formatted contingency tables
- Clear presentation of cell percentages
- Highlighted significant associations
- Additional Tools
- Multiple association measures (Phi, Cramer’s V, Lambda)
- Post-hoc cell-wise tests
- Assumption checking for expected frequencies
- Better Integration
- Streamlined workflow for multiple variables
- Easy integration with other categorical analyses
- Consistent reporting formats
Best Practices for Reporting
- Report:
- Complete contingency table with row and column totals
- Chi-square statistic and exact p-value
- Effect size (Cramer’s V or Phi for 2x2 tables)
- Degrees of freedom
- Smallest expected frequency
- Include:
- Cross-tabulation with percentages
- Standardized residuals for significant results
- Visual representation of the relationship
- Confidence intervals for proportions
- Consider:
- Cell size assumptions
- Need for Fisher’s exact test
- Multiple comparison corrections
- Direction and strength of associations
5.4 Step G4: Writing Up Results
Here’s how to write up the results of a goodness of fit test in a formal report:
“A chi-square goodness of fit test was performed to determine whether the sample’s smoking rate differed from the expected population rate of 20%. The test revealed [insert significance statement based on results]. The observed rate was [X]% compared to the expected 20%.”
And how to write up the results of an independence test in a formal report:
“A chi-square test of independence was conducted to examine the relationship between gender and smoking status. All expected cell frequencies were greater than 5. The test revealed [insert significance statement based on results]. Among males, [X]% were smokers compared to [Y]% of females.”
6 Exercise H: Exploring the Staff Survey Data
6.1 Step H1: Setup
You can continue working in the same project or working directory that you created in the previous lab and have been working in for the past exercises. You should, however, do the following:
- Create a new R script: Go to File > New File > R Script.
- Save the script in your
scripts
folder with an appropriate name, e.g.,Lab2_Exercise_H.R
.
Try to start with the empty script that you created above (Lab2_Exercise_H.R
) and work through this exerise before looking at the solutions. But you can also download the R script for Exercise H and save the downloaded file in the scripts
folder.
Now you are ready to begin your work in R and continue with Step H2!
6.2 Step H2: Loading the Dataset
Please download the dataset and download the code book and save both in the data
folder within your project folder or working directory.
# Load the CSV file stored in the "data" folder
<- read.csv("data/lab2-staff-survey.csv") staff_survey_data
You can easily explore and check the basic structure of your data and get a summary:
# View the first few rows using head()
head(staff_survey_data)
# Examine the structure
str(staff_survey_data)
# Get basic summary statistics
summary(staff_survey_data)
In line with the code book, we want to assign labels to the variables in the staff_survey_data
data frame using factors with labeled levels. This method ensures the data remains categorical but with human-readable labels for easier interpretation and analysis.
# First, let's create vectors for our common level labels since they're reused often
<- c("not at all", "to a slight extent", "to a moderate extent",
extent_levels "to a great extent", "to a very great extent")
<- c("not important", "slightly important", "moderately important",
importance_levels "very important", "extremely important")
# Create age groups labels
$age <- factor(staff_survey_data$age,
staff_survey_datalevels = 1:5,
labels = c("under 20", "21 to 30", "31 to 40",
"41 to 50", "over 50"))
# Employment status
$employstatus <- factor(staff_survey_data$employstatus,
staff_survey_datalevels = 1:2,
labels = c("permanent", "casual"))
# Now let's handle the repeated patterns for Q1a through Q10a
# We'll use a loop to avoid repetitive code
for(i in 1:10) {
# Convert "extent" questions (Q1a through Q10a)
paste0("Q", i, "a")]] <- factor(
staff_survey_data[[paste0("Q", i, "a")]],
staff_survey_data[[levels = 1:5,
labels = extent_levels
)
# Convert "importance" questions (Q1imp through Q10imp)
paste0("Q", i, "imp")]] <- factor(
staff_survey_data[[paste0("Q", i, "imp")]],
staff_survey_data[[levels = 1:5,
labels = importance_levels
)
}
# Finally, convert the recommend variable
$recommend <- factor(staff_survey_data$recommend,
staff_survey_datalevels = c(0, 1),
labels = c("no", "yes"))
# Let's add a verification step to check our work
# This will print the levels of each variable to confirm they're correctly labeled
<- function(data) {
verify_factors for(col in names(data)) {
if(is.factor(data[[col]])) {
cat("\nLevels for", col, ":\n")
print(levels(data[[col]]))
}
}
}
# Run the verification
verify_factors(staff_survey_data)
The summary()
function can also be used to confirm that the labels have been applied correctly to the variables.
# Verify changes by printing a summary
summary(staff_survey_data)
Try working through the folloing steps and answer the questions before looking at the solutions.
6.3 Step H3: Frequency Tables
Create frequency tables of the demographic variables city
, service
, and employstatus
.
Note: The solution is provided in the callout notes below. We’ll explore the staff_survey_data
using only base R functions (orange callout note) and also provide solutions using R packages (in the usual green callout notes). While modern packages offer many conveniences, understanding base R approaches provides a solid foundation and helps us appreciate the underlying statistical processes.
Let’s begin by examining the distribution of our key demographic variables. Base R provides several functions for creating and formatting frequency tables:
# Create frequency table for city
<- table(staff_survey_data$city)
city_table <- prop.table(city_table) * 100
city_props
# Combine counts and percentages
<- cbind(
city_summary Frequency = as.vector(city_table),
Percentage = as.vector(city_props)
)rownames(city_summary) <- names(city_table)
# Display the results with formatting
print("Distribution of Staff by City:")
print(round(city_summary, 1))
# Create frequency table for employment status
<- table(staff_survey_data$employstatus)
status_table <- prop.table(status_table) * 100
status_props
<- cbind(
status_summary Frequency = as.vector(status_table),
Percentage = as.vector(status_props)
)rownames(status_summary) <- names(status_table)
print("\nDistribution of Employment Status:")
print(round(status_summary, 1))
# Create frequency table for service
# First, let's create reasonable bins for years of service
<- seq(0, max(staff_survey_data$service, na.rm = TRUE) + 5, by = 5)
service_breaks <- cut(staff_survey_data$service,
service_cats breaks = service_breaks,
include.lowest = TRUE)
<- table(service_cats)
service_table <- prop.table(service_table) * 100
service_props
<- cbind(
service_summary Frequency = as.vector(service_table),
Percentage = as.vector(service_props)
)rownames(service_summary) <- names(service_table)
print("\nDistribution of Years of Service:")
print(round(service_summary, 1))
When interpreting these results, consider:
- Look for notable imbalances in group sizes
- Consider whether distributions match expectations
We can also use R packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "knitr", "kableExtra"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(kableExtra) # For enhanced table formatting
First, we’ll create clear, informative frequency tables for our key demographic variables:
# Function to create a nicely formatted frequency table
<- function(data, variable) {
create_freq_table %>%
data count({{variable}}) %>%
mutate(Percentage = n/sum(n) * 100) %>%
kable(digits = 1,
col.names = c("Category", "Count", "Percentage"),
caption = paste("Distribution of", deparse(substitute(variable)))) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
}
# Create tables for each demographic variable
create_freq_table(staff_survey_data, city)
create_freq_table(staff_survey_data, service)
create_freq_table(staff_survey_data, employstatus)
6.4 Step H4: Hisogram
Create a histogram of the demographic variable service
and look for outliers.
We’ll create a histogram to understand the distribution of service years and identify potential outliers:
# Create histogram
hist(staff_survey_data$service,
main = "Distribution of Years of Service",
xlab = "Years of Service",
ylab = "Frequency",
breaks = 20,
col = "lightblue",
border = "white")
# Add a box plot for outlier detection
boxplot(staff_survey_data$service,
horizontal = TRUE,
main = "Box Plot of Years of Service",
xlab = "Years of Service",
col = "lightblue")
# Reset plotting parameters
par(mfrow = c(1, 1))
# Calculate summary statistics
<- summary(staff_survey_data$service)
service_summary_stats <- sd(staff_survey_data$service, na.rm = TRUE)
service_sd
# Print summary statistics
print("\nSummary Statistics for Years of Service:")
print(service_summary_stats)
print(paste("Standard Deviation:", round(service_sd, 2)))
# Identify potential outliers using the 1.5 * IQR rule
<- quantile(staff_survey_data$service, 0.25, na.rm = TRUE)
Q1 <- quantile(staff_survey_data$service, 0.75, na.rm = TRUE)
Q3 <- Q3 - Q1
IQR <- Q3 + 1.5 * IQR
outlier_threshold_upper <- Q1 - 1.5 * IQR
outlier_threshold_lower
<- staff_survey_data$service[staff_survey_data$service > outlier_threshold_upper |
outliers $service < outlier_threshold_lower]
staff_survey_data
if(length(outliers) > 0) {
print("\nPotential outliers identified:")
print(sort(outliers))
}
When interpreting these results, consider:
- Note the shape of the distribution
- Identify any unusual patterns or gaps
- Consider whether outliers represent errors or true values
We can also use R packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "knitr", "kableExtra"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(kableExtra) # For enhanced table formatting
Let’s create a detailed histogram to understand the distribution of years of service and identify any potential outliers:
# Create histogram with density curve
ggplot(staff_survey_data, aes(x = service)) +
geom_histogram(aes(y = ..density..),
bins = 30,
fill = "lightblue",
color = "black") +
geom_density(color = "red") +
theme_minimal() +
labs(x = "Years of Service",
y = "Density",
title = "Distribution of Years of Service",
subtitle = "With density curve overlay") +
# Add boxplot for outlier visualization
geom_boxplot(aes(y = -0.02), width = 0.1)
# Calculate summary statistics for service
<- staff_survey_data %>%
service_summary summarize(
Mean = mean(service, na.rm = TRUE),
Median = median(service, na.rm = TRUE),
SD = sd(service, na.rm = TRUE),
Q1 = quantile(service, 0.25, na.rm = TRUE),
Q3 = quantile(service, 0.75, na.rm = TRUE),
IQR = IQR(service, na.rm = TRUE),
Min = min(service, na.rm = TRUE),
Max = max(service, na.rm = TRUE)
)
# Print summary statistics
kable(service_summary, digits = 1,
caption = "Summary Statistics for Years of Service") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
# Identify potential outliers
<- staff_survey_data %>%
outliers filter(service > (service_summary$Q3 + 1.5 * service_summary$IQR) |
< (service_summary$Q1 - 1.5 * service_summary$IQR)) service
Looking at the distribution of years of service, we can observe:
- The shape of the distribution (whether it’s symmetric or skewed)
- Any unusual patterns or gaps
- Potential outliers, which we’ve identified using the 1.5 × IQR rule
6.5 Step H5: Cross-Tabulation
Cross-tabulate and study the age groups (agerecode
) against employstatus
.
Let’s examine the relationship between age groups and employment status using a cross-tabulation:
# Create cross-tabulation
<- table(staff_survey_data$agerecode,
age_employ_table $employstatus)
staff_survey_data
# Calculate row percentages
<- prop.table(age_employ_table, margin = 1) * 100
age_employ_props
# Print the results
print("Counts by Age Group and Employment Status:")
print(age_employ_table)
print("\nPercentages within Age Groups:")
print(round(age_employ_props, 1))
# Create a visual representation using base R
# Set up the plotting area
barplot(t(age_employ_props),
beside = TRUE,
col = c("lightblue", "lightgreen"),
main = "Employment Status by Age Group",
xlab = "Age Group",
ylab = "Percentage",
legend.text = colnames(age_employ_props))
# Perform chi-square test of independence
<- chisq.test(age_employ_table)
chi_sq_test print("\nChi-square test of independence:")
print(chi_sq_test)
When interpreting these results, consider:
- Look for patterns in employment type across age groups
- Consider the chi-square test results
- Think about implications for workforce planning
We can also use R packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "knitr", "kableExtra"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(kableExtra) # For enhanced table formatting
Let’s examine the relationship between age groups and employment status:
# Create cross-tabulation
<- table(staff_survey_data$agerecode,
age_employ_table $employstatus)
staff_survey_data
# Convert to percentages and format nicely
<- prop.table(age_employ_table, margin = 1) * 100
age_employ_props
# Combine counts and percentages in a nice table
<- cbind(
age_employ_combined as.data.frame.matrix(age_employ_table),
as.data.frame.matrix(age_employ_props)
%>%
) setNames(c("Permanent (n)", "Casual (n)",
"Permanent (%)", "Casual (%)"))
kable(age_employ_combined,
digits = 1,
caption = "Age Groups by Employment Status") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
# Create a visualization
ggplot(staff_survey_data,
aes(x = agerecode, fill = employstatus)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(x = "Age Group",
y = "Proportion",
fill = "Employment Status",
title = "Employment Status Distribution by Age Group") +
coord_flip()
6.6 Step H6: Total Staff Satisfaction
How do you think total staff satisfaction (totsatis
) is calculated? Is there a difference in Total staff satisfaction among the two staff statuses? (Hint: t-test)
To understand how total staff satisfaction (totsatis
) might be calculated, let’s examine its relationship with individual satisfaction items:
# Identify satisfaction-related variables (Q*a)
<- grep("Q.*a$", names(staff_survey_data), value = TRUE)
satisfaction_vars
# Calculate correlations with total satisfaction
<- sapply(staff_survey_data[satisfaction_vars],
correlations function(x) cor(x, staff_survey_data$totsatis,
use = "complete.obs"))
# Print correlations
print("Correlations with Total Satisfaction:")
print(round(correlations, 3))
# Create scatterplot matrix of selected variables
pairs(staff_survey_data[c(satisfaction_vars[1:5], "totsatis")],
main = "Relationships between Satisfaction Measures")
Let’s examine whether there are differences in total satisfaction between permanent and casual staff:
# Calculate descriptive statistics by group
tapply(staff_survey_data$totsatis,
$employstatus,
staff_survey_datafunction(x) c(n = length(x),
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE)))
# Create box plot for visual comparison
boxplot(totsatis ~ employstatus,
data = staff_survey_data,
main = "Total Satisfaction by Employment Status",
xlab = "Employment Status",
ylab = "Total Satisfaction Score",
col = "lightblue")
# Add individual points for better visualization
stripchart(totsatis ~ employstatus,
data = staff_survey_data,
vertical = TRUE,
method = "jitter",
add = TRUE,
pch = 20,
col = "darkgray")
# Perform t-test
<- t.test(totsatis ~ employstatus,
satisfaction_ttest data = staff_survey_data)
# Print t-test results
print("\nt-test Results:")
print(satisfaction_ttest)
# Calculate effect size (Cohen's d)
<- staff_survey_data$totsatis[staff_survey_data$employstatus == "permanent"]
group1 <- staff_survey_data$totsatis[staff_survey_data$employstatus == "casual"]
group2
<- function(x, y) {
cohens_d <- length(x)
nx <- length(y)
ny <- sqrt(((nx-1)*var(x) + (ny-1)*var(y))/(nx+ny-2))
pooled_sd abs(mean(x) - mean(y))/pooled_sd
}
<- cohens_d(group1, group2)
effect_size print(paste("\nEffect size (Cohen's d):", round(effect_size, 3)))
When interpreting these results, consider:
- Consider both statistical significance (p-value) and practical significance (effect size)
- Look at the distribution of scores within each group
- Think about potential confounding variables
We can also use R packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "knitr", "kableExtra"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(kableExtra) # For enhanced table formatting
The total staff satisfaction score (totsatis
) is likely calculated as a composite measure of various satisfaction-related questions in the survey. Let’s examine its components and distribution:
# Examine the structure of satisfaction-related variables
<- names(staff_survey_data)[grep("Q.*a$", names(staff_survey_data))]
satisfaction_vars
# Look at correlations between these variables and totsatis
<- staff_survey_data %>%
satisfaction_correlations select(all_of(c(satisfaction_vars, "totsatis"))) %>%
cor(use = "complete.obs")
# Print correlations with total satisfaction
kable(satisfaction_correlations["totsatis", ],
digits = 3,
caption = "Correlations with Total Satisfaction") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
To examine whether there are differences in total satisfaction between permanent and casual staff, we’ll conduct a t-test:
# First, let's look at descriptive statistics
<- staff_survey_data %>%
satisfaction_by_status group_by(employstatus) %>%
summarise(
n = n(),
mean = mean(totsatis, na.rm = TRUE),
sd = sd(totsatis, na.rm = TRUE),
se = sd/sqrt(n)
)
# Print descriptive statistics
kable(satisfaction_by_status,
digits = 2,
caption = "Satisfaction Scores by Employment Status") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
# Create visualization
ggplot(staff_survey_data, aes(x = employstatus, y = totsatis)) +
geom_boxplot(fill = "lightblue") +
geom_jitter(width = 0.2, alpha = 0.2) +
theme_minimal() +
labs(x = "Employment Status",
y = "Total Satisfaction Score",
title = "Distribution of Satisfaction Scores by Employment Status")
# Perform t-test
<- t.test(totsatis ~ employstatus,
satisfaction_ttest data = staff_survey_data)
# Calculate effect size (Cohen's d)
<- function(x, y) {
cohens_d <- length(x)
nx <- length(y)
ny <- sqrt(((nx-1)*var(x) + (ny-1)*var(y))/(nx+ny-2))
pooled_sd abs(mean(x) - mean(y))/pooled_sd
}
<- with(staff_survey_data,
effect_size cohens_d(totsatis[employstatus == "permanent"],
== "casual"]))
totsatis[employstatus
# Print t-test results
print(satisfaction_ttest)
cat("\nEffect size (Cohen's d):", round(effect_size, 3))
7 Exercise I: Logistic Regression
7.1 Step I1: Setup
You can continue working in the same project or working directory that you created in the previous lab and have been working in for the past exercises. You should, however, do the following:
- Create a new R script: Go to File > New File > R Script.
- Save the script in your
scripts
folder with an appropriate name, e.g.,Lab2_Exercise_I.R
.
You can either work through the following steps and copy/paste the respective code into the Lab_Exercise_I.R
file that you will create in Step I1 or download the R script for Exercise H and follow the instructions below and save the downloaded file in the scripts
folder that you will create.
Now you are ready to begin your work in R and continue with Step I2!
7.2 Step I2: Loading the Dataset
Please download the dataset and save the file in the data
folder within your project folder or working directory.
# Load the CSV file stored in the "data" folder
<- read.csv("data/lab2-sleep-survey.csv") sleep_survey_data
You can easily explore and check the basic structure of your data and get a summary:
# View the first few rows using head()
head(sleep_survey_data)
# Examine the structure
str(sleep_survey_data)
# Get basic summary statistics
summary(sleep_survey_data)
We want to assign labels to some of the categorical variables in the sleep_survey_data
data frame using factors with labeled levels. This method ensures the data remains categorical but with human-readable labels for easier interpretation and analysis.
# Let's start by creating our factor labels in a clear, organized way.
# When working with binary variables (0/1), it's especially important to be
# consistent with our labeling approach.
# First, let's handle the binary (0/1) variables
# We'll create these first since they share the same structure
$sex <- factor(sleep_survey_data$sex,
sleep_survey_datalevels = c(0, 1),
labels = c("female", "male"))
$probsleep <- factor(sleep_survey_data$probsleep,
sleep_survey_datalevels = c(0, 1),
labels = c("no", "yes"))
$fallsleep <- factor(sleep_survey_data$fallsleep,
sleep_survey_datalevels = c(0, 1),
labels = c("no", "yes"))
$staysleep <- factor(sleep_survey_data$staysleep,
sleep_survey_datalevels = c(0, 1),
labels = c("no", "yes"))
# Now let's handle the categorical variables with multiple levels
# For these, we'll use more descriptive labels
$marital <- factor(sleep_survey_data$marital,
sleep_survey_datalevels = 1:4,
labels = c("single",
"married/defacto",
"divorced",
"widowed"))
$edlevel <- factor(sleep_survey_data$edlevel,
sleep_survey_datalevels = 1:5,
labels = c("primary school",
"secondary school",
"trade training",
"undergraduate degree",
"postgraduate degree"))
# Let's add a verification step to make sure our conversions worked correctly
# This function will print the levels of each factor variable we created
<- function(data) {
verify_factors cat("Checking factor levels for all converted variables:\n\n")
# List of variables we converted
<- c("sex", "probsleep", "fallsleep", "staysleep",
factor_vars "marital", "edlevel")
for(var in factor_vars) {
cat(paste0("Levels for ", var, ":\n"))
print(levels(data[[var]]))
cat("\n")
}
}
# Run the verification
verify_factors(sleep_survey_data)
7.3 Step I3: Logistic Regression
7.3.1 What is Logistic Regression?
Logistic regression is a statistical method used when we want to predict a categorical outcome (like yes/no, pass/fail) based on one or more predictor variables. Think of it as answering questions like “What factors influence whether someone has sleep problems?” or “What characteristics predict if a student will pass an exam?”
We use logistic regression when:
- Our outcome variable is categorical (usually binary).
- We have multiple predictor variables (can be continuous or categorical).
- We want to understand both prediction and relationships.
In our sleep study, we’re trying to predict whether someone has sleep problems (yes/no) based on various characteristics and behaviors.
Understanding the Math (In Simple Terms)
While regular regression predicts actual values (like height or weight), logistic regression predicts the probability of belonging to a category (like having sleep problems). It uses a special S-shaped curve called a logistic function that keeps predictions between 0 and 1 (or 0% to 100% probability).
7.3.2 Initial Data Preparation
Let’s prepare our data and examine our sample:
# First, let's check our sample size and missing data
<- nrow(sleep_survey_data)
initial_sample <- sum(complete.cases(sleep_survey_data[
complete_cases c("probsleep", "sex", "age", "fallsleep", "staysleep", "hrswknight")
]))
# Print sample information
cat("Total cases:", initial_sample, "\n")
cat("Complete cases:", complete_cases, "\n")
cat("Missing cases:", initial_sample - complete_cases, "\n")
# Check coding of categorical variables
cat("\nCoding of categorical variables:\n")
cat("\nSleep Problems (probsleep):\n")
print(table(sleep_survey_data$probsleep))
cat("\nSex:\n")
print(table(sleep_survey_data$sex))
cat("\nTrouble Falling Asleep (fallsleep):\n")
print(table(sleep_survey_data$fallsleep))
cat("\nTrouble Staying Asleep (staysleep):\n")
print(table(sleep_survey_data$staysleep))
7.3.3 Building the Logistic Regression Model
Now let’s create our model. We’ll use base R’s glm
(Generalized Linear Model) function with the logit link:
# Create the logistic regression model
<- glm(probsleep ~ sex + age + fallsleep + staysleep + hrswknight,
sleep_model family = binomial(link = "logit"),
data = sleep_survey_data)
# Display the summary of our model
<- summary(sleep_model)
summary_results print(summary_results)
# Calculate odds ratios and confidence intervals
<- exp(coef(sleep_model))
odds_ratios <- exp(confint(sleep_model))
conf_int
# Combine results in a readable format
<- cbind(
results_table Estimate = round(coef(sleep_model), 3),
"Odds Ratio" = round(odds_ratios, 3),
"CI Lower" = round(conf_int[,1], 3),
"CI Upper" = round(conf_int[,2], 3),
"p-value" = round(summary_results$coefficients[,4], 3)
)
# Print results
cat("\nDetailed Results:\n")
print(results_table)
# Calculate model fit statistics
<- sleep_model$null.deviance
null_deviance <- sleep_model$deviance
model_deviance <- 1 - (model_deviance / null_deviance)
pseudo_r2
cat("\nModel Fit Statistics:\n")
cat("Null deviance:", round(null_deviance, 2), "\n")
cat("Model deviance:", round(model_deviance, 2), "\n")
cat("McFadden's Pseudo R²:", round(pseudo_r2, 3), "\n")
7.4 Step I4: Interpreting the Results
Let’s break down what these results tell us.
7.4.1 Model Convergence and Overall Fit
Our model successfully converged, which means it found a stable solution. The reduction in deviance from the null model (model with no predictors) to our final model indicates how much our predictors help explain sleep problems.
7.4.2 Individual Predictors
Let’s examine each predictor’s contribution:
# Create a function to interpret odds ratios
<- function(or, p_value) {
interpret_or <- if(or > 1) "increase" else "decrease"
direction <- abs(1 - or) * 100
magnitude <- if(p_value < 0.05) "significant" else "not significant"
significance
sprintf("%.1f%% %s in odds (%s)", magnitude, direction, significance)
}
# Print interpretations for each predictor
cat("Interpretation of Effects:\n\n")
for(i in 2:nrow(results_table)) {
<- rownames(results_table)[i]
var_name <- results_table[i, "Odds Ratio"]
or <- results_table[i, "p-value"]
p_val
cat(var_name, ":", interpret_or(or, p_val), "\n")
}
7.4.3 Key Findings
Our analysis reveals that:
- Staying Asleep (p < 0.001): This is our strongest predictor. People who have trouble staying asleep are significantly more likely to report general sleep problems.
- Falling Asleep (p = 0.035): Difficulty falling asleep is also a significant predictor, though not as strong as - staying asleep.
- Hours of Sleep (p = 0.007): Each additional hour of sleep is associated with a decrease in the odds of reporting sleep problems.
- Age and Sex: Neither demographic variable significantly predicts sleep problems (p > 0.05).
7.4.4 Model Accuracy
We can examine how well our model predicts sleep problems:
# Calculate predicted probabilities
<- predict(sleep_model, type = "response")
predicted_probs <- ifelse(predicted_probs > 0.5, 1, 0)
predicted_class <- as.numeric(sleep_survey_data$probsleep) - 1
actual_class
# Create confusion matrix
<- table(Predicted = predicted_class, Actual = actual_class)
conf_matrix print("\nConfusion Matrix:")
print(conf_matrix)
# Calculate accuracy metrics
<- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy <- conf_matrix[2,2] / sum(conf_matrix[,2])
sensitivity <- conf_matrix[1,1] / sum(conf_matrix[,1])
specificity
cat("\nModel Performance Metrics:\n")
cat("Accuracy:", round(accuracy, 3), "\n")
cat("Sensitivity:", round(sensitivity, 3), "\n")
cat("Specificity:", round(specificity, 3), "\n")
7.4.5 Practical Implications
Our analysis suggests several important practical implications:
- Sleep Maintenance: The strong effect of staying asleep suggests that interventions focused on sleep maintenance might be most beneficial.
- Sleep Duration: The significant effect of hours of sleep indicates that strategies to increase sleep duration could help reduce sleep problems.
- Universal Impact: The lack of significant age and sex effects suggests that sleep problems affect people regardless of these demographic factors.
7.4.6 Limitations and Considerations
When interpreting these results, consider:
- Self-reported data may be subject to recall bias.
- The binary nature of our outcome variable might oversimplify complex sleep issues.
- There might be other important predictors we haven’t included in the model.
Modern R packages provide enhanced capabilities for conducting and interpreting logistic regression analyses. We’ll use several packages that work together to create a comprehensive analysis while making the process more intuitive and the results more interpretable.
# Install packages if needed
install.packages(c("tidyverse", "broom", "performance", "sjPlot", "marginaleffects"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(broom) # For tidying model output
library(performance) # For model diagnostics and performance metrics
library(sjPlot) # For plotting odds ratios and model summaries
library(marginaleffects) # For calculating and plotting marginal effects
Understanding our Data
The tidyverse
package provides powerful tools for initial data exploration:
# Create our analysis dataset
<- sleep_survey_data %>%
sleep_data select(probsleep, sex, age, fallsleep, staysleep, hrswknight) %>%
# Convert to factors with meaningful labels if not already done
mutate(across(c(probsleep, sex, fallsleep, staysleep),
~factor(., levels = c(0, 1),
labels = c("No", "Yes"))))
# Examine our data structure
glimpse(sleep_data)
# Check missing data patterns
<- sleep_data %>%
missing_pattern summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Missing_Count")
ggplot(missing_pattern,
aes(x = reorder(Variable, Missing_Count), y = Missing_Count)) +
geom_col(fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Missing Data Pattern",
x = "Variable",
y = "Number of Missing Values")
Building the Logistic Regression Model
The tidyverse
and associated packages make model building and interpretation more straightforward:
# Fit the model
<- glm(probsleep ~ sex + age + fallsleep + staysleep + hrswknight,
sleep_model family = binomial(link = "logit"),
data = sleep_data)
# Get tidy model summary
<- tidy(sleep_model,
model_summary conf.int = TRUE,
exponentiate = TRUE)
# Create a nice summary table
%>%
model_summary mutate(across(where(is.numeric), ~round(., 3))) %>%
::kable(col.names = c("Predictor", "Odds Ratio", "Std. Error",
knitr"z value", "p-value", "CI Lower", "CI Upper"),
caption = "Logistic Regression Results") %>%
::kable_styling(bootstrap_options = c("striped", "hover")) kableExtra
Model Diagnostics and Performance
The performance
package provides comprehensive model diagnostics:
# Check model performance
<- check_model(sleep_model)
model_performance
# Get various performance metrics
<- model_performance(sleep_model)
performance_metrics
# Plot ROC curve
<- sleep_data %>%
pred_data mutate(predicted = predict(sleep_model, type = "response"))
ggplot(pred_data, aes(d = as.numeric(probsleep) - 1, m = predicted)) +
geom_roc(n.cuts = 0) +
style_roc() +
theme_minimal() +
labs(title = "ROC Curve for Sleep Problems Model")
Understanding Variable Effects
The marginaleffects
package helps us understand how our predictors influence sleep problems:
# Calculate average marginal effects
<- avg_slopes(sleep_model)
marg_effects
# Plot marginal effects
plot_slopes(sleep_model,
variables = "hrswknight",
condition = "staysleep") +
theme_minimal() +
labs(title = "Effect of Sleep Hours by Staying Asleep Status",
x = "Hours of Sleep per Weeknight",
y = "Predicted Probability of Sleep Problems")
# Create effects plot for categorical variables
plot_model(sleep_model,
type = "pred",
terms = c("staysleep", "fallsleep")) +
theme_minimal() +
labs(title = "Interaction between Falling and Staying Asleep",
y = "Predicted Probability of Sleep Problems")
Visualizing Model Results
sjPlot
provides excellent visualization tools for model results:
# Plot odds ratios
plot_model(sleep_model,
show.values = TRUE,
title = "Odds Ratios for Sleep Problems") +
theme_minimal()
# Create predicted probabilities plot
plot_model(sleep_model,
type = "pred",
terms = "hrswknight",
title = "Effect of Sleep Hours on Problem Probability") +
theme_minimal()
Advantages of Using Modern Packages
These modern R packages offer several benefits over base R:
- The
tidyverse
provides:
- More intuitive data manipulation
- Enhanced visualization capabilities
- Consistent syntax across operations
broom
offers:
- Standardized model output formats
- Easy extraction of model statistics
- Simple integration with other tidyverse tools
performance
gives us:
- Comprehensive model diagnostics
- Easy-to-interpret visualizations
- Multiple performance metrics
marginaleffects
enables:
- Intuitive interpretation of variable effects
- Beautiful visualization of interactions
- Straightforward prediction scenarios
Creating a Complete Report
We can create a comprehensive analysis report using these tools:
# Function to create a complete model summary
<- function(model) {
create_model_report # Model summary
<- glance(model)
summary_stats
# Coefficients and odds ratios
<- tidy(model, conf.int = TRUE, exponentiate = TRUE)
coef_table
# Predictions
<- augment(model, type.predict = "response")
pred_data
# Return as list
list(
model_stats = summary_stats,
coefficients = coef_table,
predictions = pred_data
)
}
# Generate report
<- create_model_report(sleep_model)
model_report
# Print key findings
cat("Model Performance Metrics:\n")
print(model_report$model_stats)
cat("\nKey Predictors:\n")
print(model_report$coefficients)
Practical Implementation
Here’s how to use these results in practice:
# Create a prediction function for new cases
<- function(new_data) {
predict_sleep_problems <- predict(sleep_model,
predictions newdata = new_data,
type = "response")
tibble(
probability = predictions,
predicted_outcome = if_else(predictions > 0.5, "Yes", "No")
)
}
# Example usage
<- tibble(
example_cases sex = factor(c("Male", "Female"), levels = c("Male", "Female")),
age = c(30, 45),
fallsleep = factor(c("Yes", "No"), levels = c("No", "Yes")),
staysleep = factor(c("No", "Yes"), levels = c("No", "Yes")),
hrswknight = c(7, 6)
)
<- predict_sleep_problems(example_cases)
predictions print(predictions)
8 Summary
This lab introduced you to performing a variety of statistical tests and analyses in R (similar to the SPSS Lab 2). You can download the respective R scripts for Exercises D+E+F+G, H, and I below for comparison, in case you created your own: