MGT4018/MGT4090 Lab 2
1 Introduction
This lab is the same as the SPSS Lab 2 for MGT4018 and MGT4090. We use base R functions as the default. While there are many R packages available, understanding base R operations provides a strong foundation for data analysis.
Alternatives for achieving the same outcomes using different R packages are provided in green boxes. If you want to explore these alternatives, each box will introduce the respective package, how to install and use it, and the outputs it can produce.
2 Exercise D: Correlation and Linear Regression
This lab will guide you through creating a dataset, assigning labels, and conducting basic analyses using R. You will learn how to create variables, enter data, and generate summary tables similar to those you would in SPSS.
2.1 Step D1: Setup
You can continue working in the same project (Step A1, Option 1) or working directory (Step A1, Option 2) that you created in the previous lab. You should, however, do the following:
- Create a new R script: Go to File > New File > R Script.
- Save the script in your
scriptsfolder with an appropriate name, e.g.,Lab2_Exercise_D_E_F_G.R.
You can either work through the following steps and copy/paste the respective code into the Lab_Exercise_D_E_F_G.R file that you will create in Step D1 or download the R script for Exercises D, E, F, and G and follow the instructions below and save the downloaded file in the scripts folder that you will create.
Now you are ready to begin your work in R and continue with Step D2!
2.2 Step D2: Loading the Dataset
Please download the dataset and save it in the data folder within your project folder or working directory.
# Load the CSV file stored in the "data" folder
survey_data_full <- read.csv("data/lab2-survey.csv")You can easily explore and check the basic structure of your data and get a summary:
# View the first few rows using head()
head(survey_data_full)
# Examine the structure
str(survey_data_full)
# Get basic summary statistics
summary(survey_data_full)Similar to Exercises B and C in Lab 1, we want to assign labels to the demographic variables in the survey_data_full data frame using factors with labeled levels. This method ensures the data remains categorical but with human-readable labels for easier interpretation and analysis.
# Convert demographic variables to factors with labeled levels
# Assign labels for "sex"
survey_data_full$sex <- factor(
survey_data_full$sex,
levels = c(1, 2),
labels = c("Male", "Female")
)
# Assign labels for "marital"
survey_data_full$marital <- factor(
survey_data_full$marital,
levels = c(1, 2, 3, 4, 5, 6, 7, 8),
labels = c(
"Single", "Steady relationship", "Living with partner",
"Married first time", "Remarried", "Separated",
"Divorced", "Widowed"
)
)
# Assign labels for "child"
survey_data_full$child <- factor(
survey_data_full$child,
levels = c(1, 2),
labels = c("Yes", "No")
)
# Assign labels for "educ"
survey_data_full$educ <- factor(
survey_data_full$educ,
levels = c(1, 2, 3, 4, 5, 6),
labels = c(
"Primary", "Some secondary", "Completed high school",
"Some additional training", "Completed undergraduate",
"Postgraduate completed"
)
)
# Assign labels for "source"
survey_data_full$source <- factor(
survey_data_full$source,
levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
labels = c(
"Work", "Spouse or partner", "Relationships", "Children",
"Family", "Health/illness", "Life in general",
"Money/finances", "Lack of time, too much to do"
)
)
# Assign labels for "smoke"
survey_data_full$smoke <- factor(
survey_data_full$smoke,
levels = c(1, 2),
labels = c("Yes", "No")
)The summary() function can be used to confirm that the labels have been applied correctly to the variables.
# Verify changes by printing a summary
summary(survey_data_full)2.3 Step D3: Correlation
Correlation analysis helps us understand the relationship between two variables. Before we dive into the practical implementation, let’s understand some key concepts.
2.3.1 What is Correlation?
Correlation measures the strength and direction of the relationship between two variables. The Pearson correlation coefficient (PCC), often denoted as r, is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables. It is one of the most widely used correlation metrics.
The correlation coefficient (r) ranges from -1 to +1:
- r = +1: Perfect positive correlation
- r = 0: No linear correlation
- r = -1: Perfect negative correlation
The magnitude of r (how close it is to ±1) indicates the strength of the relationship, while the sign indicates the direction.
The correlation coefficient is standardized, meaning it’s independent of the units of measurement of the original variables. This makes it useful for comparing relationships between different pairs of variables.
2.3.2 Assumptions for Pearson’s Correlation
Before calculating correlations, we should check these assumptions:
- Variables are measured at the interval or ratio level
- Linear relationship between variables
- No significant outliers
- Approximate normal distribution of variables
- Homoscedasticity (equal variances)
2.3.3 Visual Inspection: Scatterplots
Before calculating correlations, it’s important to visualize the relationships:
# Install ggplot2 (NOTE: not needed if you completed the first lab)
install.packages("ggplot2")
# Load ggplot2
library(ggplot2)
# Simple scatterplot
scatter_stress_control <- ggplot(data = survey_data_full, aes(x = tpcoiss, y = tpstress)) +
geom_point(color = "blue", size = 2) +
labs(
title = "Scatterplot of Perceived Stress vs Coping Strategies",
x = "Coping Strategies (tpcoiss)",
y = "Perceived Stress (tpstress)"
) +
theme_minimal()
# Show plot
scatter_stress_controlNote: If you run the code above, you will get the warning message below. This warning occurs because the variables tpstress and tpcoiss have missing values. The warning doesn’t stop the plot from rendering, but it alerts you that data is being removed and you should investigate the reason behind the warning and decide how to handle it appropriately (e.g., filtering, replacing, or imputing missing values if necessary).
Warning message:
Removed 13 rows containing missing values or values outside the scale range (`geom_point()`).
2.3.4 Computing Correlations
We can calculate correlation. In the first instance, we focus on the two variables tpcoiss and tpstress. Afterwards, we will explore how to calculate correlations for all variables at once.
Single Correlation
The cor() function calculates the correlation coefficient:
# Calculate correlation between tpcoiss and tpstress
cor(survey_data_full$tpcoiss, survey_data_full$tpstress, use = "complete.obs")
# Specify the method (default is Pearson, so this will return the same result)
cor(survey_data_full$tpcoiss, survey_data_full$tpstress, use = "complete.obs",
method = "pearson")The cor() function returns NA if any of the input values are missing, as it cannot compute the correlation because the missing values prevent pairwise comparisons. From the warning when creating the scatterplot, we know that there are 13 missing values across the two variables.
Best Practices in Base R
Always visualize your data first
Check assumptions (e.g., normality):
# Test for normality shapiro.test(x) shapiro.test(y1)Consider different correlation methods when appropriate:
# Pearson correlation (default) cor(x, y1, method = "pearson") # Spearman correlation (for non-normal data) cor(x, y1, method = "spearman") # Kendall correlation (for ordinal data) cor(x, y1, method = "kendall")Deal with missing values appropriately.
- Option A: Handle missing values in
cor()with different options foruse:
# Exclude all rows with one missing value. # Recommended in most cases and used for pairwise correlation. cor(x, y1, use = "complete.obs") # Uses all available pairwise data without dropping entire rows. # Works well when computing correlation matrices. cor(x, y1, use = "pairwise.complete.obs")- Option B: Remove or replace missing values before calculating the correlations (note that replacing missing values will lead to different results):
# You can filter the dataset if you prefer to remove missing values before # calculation. clean_data <- na.omit(survey_data_full[, c("tpcoiss", "tpstress")]) cor(clean_data$tpcoiss, clean_data$tpstress) # Alternatively, you can replace missing values (e.g., using the mean). modified_data <- survey_data_full modified_data$tpcoiss[is.na(modified_data$tpcoiss)] <- mean(modified_data$tpcoiss, na.rm = TRUE) modified_data$tpstress[is.na(modified_data$tpstress)] <- mean(modified_data$tpstress, na.rm = TRUE) cor(modified_data$tpcoiss, modified_data$tpstress)- Option A: Handle missing values in
Correlation Matrix
We can calculate correlations for multiple variables at once.
First, let’s create a smaller data frame with only continuous variables. We can do this in base R using the following code:
survey_data_small <- survey_data_full[, c("tpcoiss", "tpstress", "toptim",
"tposaff", "tnegaff", "tlifesat",
"tslfest", "tmarlow")]Note: When subsetting the data above, we use survey_data_full[, c("tpcoiss", .... The comma , in the square brackets explicitly indicates that you are subsetting columns from a data frame. This clarity can help prevent unintended behaviors when working with more complex subsetting tasks (e.g., selecting rows and columns simultaneously) and makes the code more readable to others. Although survey_data_full[c("tpcoiss", ... (without the comma ,) often produces the same output when subsetting columns by name, using the comma notation adheres to standard R conventions for data frame indexing.
You can verify the structure of your new data frame using:
# Check the structure of the new data frame
str(survey_data_small)
# Or see the first few rows
head(survey_data_small)You can also use the tidyverse package to accomplish the same result: a new data frame containing only these eight variables:
library(tidyverse)
survey_data <- survey_data_full %>%
select(tpcoiss, tpstress, toptim, tposaff,
tnegaff, tlifesat, tslfest, tmarlow)You can verify the structure of your new data frame using:
# Check the structure of the new data frame
str(survey_data_small)
# Or see the first few rows
head(survey_data_small)# Calculate correlation matrix
correlation_matrix <- cor(survey_data_small, use = "pairwise.complete.obs")
# Round to 3 decimal places for clarity
round(correlation_matrix, 3)2.3.5 Statistical Testing with cor.test()
The cor.test() function provides a full statistical test:
# Perform correlation test
correlation_test <- cor.test(survey_data_full$tpcoiss,
survey_data_full$tpstress,
use = "complete.obs")
# View complete results
print(correlation_test)This output includes:
- The correlation coefficient
- The test statistic
- The p-value
- The confidence interval
2.3.6 Interpreting Correlation Results
When interpreting correlation results, consider:
Strength: Common guidelines for absolute values:
- 0.00 to 0.19: “very weak”
- 0.20 to 0.39: “weak”
- 0.40 to 0.59: “moderate”
- 0.60 to 0.79: “strong”
- 0.80 to 1.00: “very strong”
Direction: Positive or negative relationship
Statistical Significance: Check the p-value
- p < 0.05 typically indicates statistical significance
- Consider effect size, not just significance
Context: What’s meaningful in your field?
2.3.7 Common Pitfalls and Considerations
Correlation ≠ Causation: Correlation only indicates association, not causation
Outliers: Can strongly influence correlation coefficients
Non-linear Relationships: Pearson’s correlation only measures linear relationships
Missing Data: Handle missing values appropriately
2.4 Step D4: Linear Regression
Regression analysis helps us understand how one variable (the dependent variable) changes when another variable (the independent variable) changes. Before we dive into the practical implementation, let’s understand some key concepts.
2.4.1 What is Linear Regression?
Linear regression is one of the foundational techniques in statistics. Linear regression gives us a mathematical way to describe and predict the relationship between two variables.
Think of linear regression like drawing a “best-fit” line through a cloud of data points. This line helps us:
- Understand the relationship between variables.
- Predict values for new observations.
- Quantify how strong the relationship is.
2.4.2 Creating Our Linear Regression Model
Let’s analyze the relationship between stress (tpstress) and sense of control (tpcoiss). We’ll first visualize the data, then create our regression model.
# Create Scatterplot
ggplot(survey_data_full, aes(x = tpcoiss, y = tpstress)) +
geom_point(color = "darkblue", alpha = 0.7) + # Scatterplot points
geom_smooth(method = "lm", color = "red", se = FALSE, linewidth = 1.5) + # Regression line
labs(
title = "Relationship between Stress and Control",
x = "Sense of Control (tpcoiss)",
y = "Perceived Stress (tpstress)"
) +
theme_minimal()Note: When running the code above, we get the error below. The functions geom_smooth() (regression) and geom_point() (scatterplot) automatically exclude the rows with missing values, triggering the warning. Refer back to the correlation section for how to proactively deal with these issues.
`geom_smooth()` using formula = 'y ~ x'
Warning messages:
1: Removed 13 rows containing non-finite outside the scale range (`stat_smooth()`).
2: Removed 13 rows containing missing values or values outside the scale range (`geom_point()`).
Now, let’s create our regression model using base R:
# Create the linear regression model
stress_model <- lm(tpstress ~ tpcoiss, data = survey_data_full)
# View the complete summary
summary(stress_model)2.4.3 Understanding the Results
Let’s break down each part of the output to understand what it tells us:
1. The Correlation Coefficient (R)
In our case, we can find the correlation coefficient (r) by taking the square root of R-squared. The negative or positive sign comes from the slope coefficient in our regression output.
# Calculate r from our model
r <- sign(coef(stress_model)[2]) * sqrt(summary(stress_model)$r.squared)
cat("Correlation coefficient (r):", round(r, 3))This value tells us:
- The strength of the relationship (how close to -1 or 1)
- The direction (positive or negative)
2. The ANOVA Table
The ANOVA table helps us assess if our model is statistically significant:
# Display the ANOVA table
anova(stress_model)Looking at the p-value (Pr(>F)), we can see if our relationship is statistically significant. A value less than 0.05 suggests strong evidence of a real relationship between our variables.
3. The Regression Equation
From our coefficients table, we can write our regression equation:
# Display coefficients
coef(stress_model)Our regression equation is:
Stress = β₀ + β₁(Control)
Where:
- β₀ is our intercept (constant)
- β₁ is our slope coefficient
Let’s fill in the actual values:
# Extract coefficients
intercept <- coef(stress_model)[1]
slope <- coef(stress_model)[2]
cat("Regression equation:\n")
cat("Stress =", round(intercept, 3), "+", round(slope, 3), "× Control")This equation means:
- When Control = 0, predicted Stress = intercept
- For each one-unit increase in Control, Stress changes by the slope amount
4. R-squared (R²)
R-squared tells us how much of the variation in stress can be explained by control:
# Extract R-squared
r_squared <- summary(stress_model)$r.squared
cat("R-squared:", round(r_squared, 3))This means that approximately {round(r_squared * 100, 1)}% of the variation in stress levels can be explained by a person’s sense of control.
2.4.4 Making Predictions
We can use our model to predict stress levels for new values of control:
# Create some example control values
new_control <- data.frame(tpcoiss = c(20, 30, 40))
# Make predictions
predictions <- predict(stress_model, newdata = new_control)
# Display predictions
cbind(Control = new_control, Predicted_Stress = round(predictions, 2))2.4.5 Checking Model Assumptions
For our regression to be valid, we should check certain assumptions:
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(stress_model)These plots help us check:
- Linearity (Residuals vs Fitted)
- Normality (Normal Q-Q)
- Homoscedasticity (Scale-Location)
- Influential points (Residuals vs Leverage)
2.4.6 Interpretation Guide
When interpreting your regression results, consider:
- Statistical Significance
- Look at the p-value in the ANOVA table
- A p-value < 0.05 suggests a significant relationship
- Practical Significance
- Look at R-squared to understand how much variance is explained
- Consider the slope coefficient for practical impact
- Direction of Relationship
- A negative slope means as one variable increases, the other decreases
- A positive slope means both variables increase together
- Model Assumptions
- Check diagnostic plots for violations
- Consider transformations if assumptions are violated
Introduction While base R provides solid foundations for linear regression, modern R packages offer enhanced capabilities for analysis, visualization, and interpretation. We’ll use several powerful packages that make our analysis more intuitive and visually appealing:
# Install packages if needed
install.packages(c("tidyverse", "broom", "performance", "see", "ggpubr", "sjPlot", "sjmisc", "sjlabelled"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(broom) # For tidying statistical objects
library(performance) # For model performance metrics
library(see) # For model visualization
library(ggpubr) # For publication-ready plots
library(sjPlot) # For model visualization and tables
library(sjmisc) # For model visualization and tables
library(sjlabelled) # For model visualization and tablesData Preparation
First, let’s prepare our data using tidyverse functions:
# Create a focused dataset for analysis
analysis_data <- survey_data_full %>%
select(tpstress, tpcoiss) %>%
drop_na() # Remove any missing values
# Quick summary of our variables
summary(analysis_data)Visual Exploration with ggplot2
The ggplot2 package (part of tidyverse) creates beautiful visualizations:
# Create an enhanced scatter plot
ggplot(analysis_data, aes(x = tpcoiss, y = tpstress)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
labs(
title = "Relationship between Stress and Control",
subtitle = "With linear regression line and 95% confidence interval",
x = "Sense of Control (tpcoiss)",
y = "Perceived Stress Level (tpstress)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)Creating the Regression Model
We’ll create our model and use modern packages to examine it:
# Create the model
model <- lm(tpstress ~ tpcoiss, data = analysis_data)
# Get a tidy summary using broom
tidy_model <- tidy(model, conf.int = TRUE)
glance_model <- glance(model)
# Display tidy results
tidy_modelModel Diagnostics with performance
The performance package provides enhanced diagnostic tools:
# Check model assumptions
check_model(model)
# Model performance metrics
model_performance(model)Creating Publication-Ready Tables with sjPlot
The sjPlot, sjmisc, and sjlabelled packages creates beautiful HTML and Word-compatible tables:
# Create regression table
tab_model(model,
title = "Linear Regression Results",
dv.labels = "Perceived Stress",
pred.labels = c("(Intercept)", "Sense of Control"))
# Export tables in html or Word format
tab_model(model, file = "output/tables/regression_table.html")
tab_model(model, file = "output/tables/regression_table.doc")Visualizing Effects with ggeffects
We can visualize the relationship more clearly:
# Plot predicted values
plot_model(model, type = "pred") +
labs(
title = "Predicted Stress Levels by Control",
x = "Sense of Control",
y = "Predicted Stress Level"
)Enhanced Regression Diagnostics
Let’s create more informative diagnostic plots using ggplot2:
# Get augmented data (includes residuals, etc.)
model_data <- augment(model)
# Create diagnostic plots
p1 <- ggplot(model_data, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted",
x = "Fitted values",
y = "Residuals") +
theme_minimal()
p2 <- ggplot(model_data, aes(sample = .std.resid)) +
stat_qq() +
stat_qq_line() +
labs(title = "Normal Q-Q Plot") +
theme_minimal()
# Arrange plots side by side
diagnostic_plots <- ggarrange(p1, p2, ncol = 2)
# Show combined plot
diagnostic_plots
# Save combined plot as a pdf (change file ending for other formats)
ggsave("output/figures/diagnostic_plots.pdf", plot = diagnostic_plots, width = 8, height = 5)Making Predictions
We can use tidyverse functions for predictions:
# Create new control values
new_control <- tibble(tpcoiss = c(20, 30, 40))
# Make predictions and bind results
predictions <- new_control %>%
mutate(Predicted_Stress = predict(stress_model, newdata = .) %>% round(2))
# Display predictions
print(predictions)Interactive Model Summary
We can create an interactive summary of our model:
# Create model summary
summary_stats <- tibble(
Statistic = c("R-squared", "Adjusted R-squared", "F-statistic", "p-value"),
Value = c(
glance_model$r.squared,
glance_model$adj.r.squared,
glance_model$statistic,
glance_model$p.value
)
) %>%
mutate(Value = round(Value, 3))
# Display as a formatted table
knitr::kable(summary_stats,
caption = "Model Summary Statistics")Advantages of Using Modern Packages
These modern R packages offer several advantages over base R:
- Better Visualization:
ggplot2creates publication-quality graphics - Tidy Output:
broommakes statistical output easier to work with - Enhanced Diagnostics:
performanceprovides comprehensive model checking - Publication-Ready Tables:
sjPlotcreates professional tables - Consistent Interface:
tidyverseprovides a coherent framework
3 Exercise E: T-Test
A t-test helps us determine whether there’s a meaningful difference between groups in our data. Imagine you’re wondering whether men and women tend to have different levels of self-esteem. A t-test can tell us if any difference we observe is likely to be real or just due to chance.
Think of a t-test like a referee in a debate: it helps us decide whether we have enough evidence to say two groups are truly different from each other. Just as a referee needs clear rules to make fair decisions, our t-test uses statistical principles to make this determination.
3.1 Step E1: Understanding the Independent Samples t-test
The independent samples t-test is used when we have:
- A continuous dependent variable (like self-esteem scores)
- Two independent groups (like males and females)
- Independent observations (one person’s score doesn’t influence another’s)
Before we dive into our analysis, let’s prepare our data and create some helpful visualizations:
# First, let's see what our data looks like
head(survey_data_full[c("sex", "tslfest", "tpcoiss")])
# Create boxplot with jittered points of self-esteem by gender
ggplot(survey_data_full, aes(x = factor(sex), y = tslfest)) +
geom_boxplot(aes(fill = factor(sex)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Self-esteem by Gender",
x = "Gender",
y = "Total Self-esteem Score"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent genderAs an alternative to a boxplot with jittered points, you can also create a violin plot. Violin plots are mirrored density plots for displaying of continuous distributions, similar to a boxplot.
# Create violin plot with a small boxplot inside
ggplot(survey_data_full, aes(x = factor(sex), y = tslfest, fill = factor(sex))) +
geom_violin(trim = FALSE, alpha = 0.5) + # Violin plot with full density curve
geom_boxplot(width = 0.2, fill = "white", outlier.shape = NA) + # Add a small boxplot inside
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Self-esteem by Gender",
x = "Gender",
y = "Total Self-esteem Score"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender3.2 Step E2: Performing the T-Test (Self-esteem by Gender)
Let’s conduct our first t-test examining self-esteem differences between genders:
# First, let's check basic descriptive statistics
tapply(survey_data_full$tslfest,
survey_data_full$sex,
function(x) c(mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
n = sum(!is.na(x))))
# Perform Levene's test for equality of variances
var_test <- var.test(tslfest ~ sex, data = survey_data_full)
print("Levene's test results:")
print(var_test)
# Perform t-test based on Levene's test result
t_test_result <- t.test(tslfest ~ sex,
data = survey_data_full,
var.equal = var_test$p.value > 0.05) # true if p > 0.05
print("\nt-test results:")
print(t_test_result)Let’s break down what these results tell us:
Levene’s Test The p-value from our variance test helps us decide whether to assume equal variances. If p > 0.05, we assume equal variances and use the standard t-test. If p < 0.05, we use the Welch’s t-test which doesn’t assume equal variances.
t-test Results
- The t-statistic tells us how many standard errors the group means are apart
- The degrees of freedom (df) help us determine critical values
- The p-value tells us the probability of seeing such differences by chance
- The confidence interval shows us the likely range of the true difference
Effect Size Let’s calculate Cohen’s d to understand the practical significance:
# Calculate Cohen's d
group1 <- survey_data_full$tslfest[survey_data_full$sex == "Male"]
group2 <- survey_data_full$tslfest[survey_data_full$sex == "Female"]
calc_cohens_d <- function(x, y) {
# Remove NA values
x <- x[!is.na(x)]
y <- y[!is.na(y)]
nx <- length(x)
ny <- length(y)
# Ensure non-empty groups
if (nx < 2 || ny < 2) {
return(NA) # Return NA if a group has fewer than 2 values
}
pooled_sd <- sqrt(((nx-1)*var(x) + (ny-1)*var(y))/(nx+ny-2))
abs(mean(x) - mean(y))/pooled_sd
return(abs(mean(x) - mean(y)) / pooled_sd)
}
effect_size <- calc_cohens_d(group1, group2)
cat("Cohen's d effect size:", round(effect_size, 3))Note: We purposefully do not call the function cohens_d as we will explore a function of the same name below when using the rstatix package. If we define a function with the same name, RStudio would refer to our custom defined function rather than using the function from the package.
We can use several R packages that provide enhanced capabilities for conducting and visualizing t-tests and work together to create a comprehensive analysis:
# Install packages if needed
install.packages(c("tidyverse", "car", "rstatix", "ggpubr", "effectsize"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(car) # For Levene's test and additional diagnostics
library(rstatix) # For statistical analysis
library(ggpubr) # For publication-ready plots
library(effectsize) # For effect size calculationsData Preparation
First, let’s prepare our data and ensure it’s in the right format for analysis:
# Create a small dataframe with relevant variables only
analysis_data <- survey_data_full %>% select(sex, tslfest, tpcoiss)
# Display the structure of our prepared data
str(analysis_data)Visual Exploration
The ggplot2 package creates beautiful and informative visualizations (these are the same as in the base R solution, as we always use ggplot2 for visualizations):
# Create boxplot with jittered points of self-esteem by gender
ggplot(analysis_data, aes(x = factor(sex), y = tslfest)) +
geom_boxplot(aes(fill = factor(sex)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Self-esteem by Gender",
x = "Gender",
y = "Total Self-esteem Score"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent genderHere is the code for the violin plot as an alternative to the boxplot with jittered points.
# Alternative: Create violin plot with a small boxplot inside
ggplot(analysis_data, aes(x = factor(sex), y = tslfest, fill = factor(sex))) +
geom_violin(trim = FALSE, alpha = 0.5) + # Violin plot with full density curve
geom_boxplot(width = 0.2, fill = "white", outlier.shape = NA) + # Add a small boxplot inside
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Self-esteem by Gender",
x = "Gender",
y = "Total Self-esteem Score"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent genderThe rstatix package can help us calculate descriptive statistics.
# Add descriptive statistics
desc_stats <- analysis_data %>%
group_by(sex) %>%
get_summary_stats(tslfest, type = "common")
print(desc_stats)Checking Assumptions
Modern packages provide comprehensive tools for checking t-test assumptions:
- Normality Check
We can check normality both visually and statistically:
# Create Q-Q plots by group
ggqqplot(analysis_data, "tslfest", facet.by = "sex")
# Shapiro-Wilk test for each group
analysis_data %>%
group_by(sex) %>%
shapiro_test(tslfest)- Homogeneity of Variances
The car package provides a robust Levene’s test:
# Levene's test using car package
leveneTest(tslfest ~ sex, data = analysis_data)Conducting the T-Test
The rstatix package provides a comprehensive t-test function:
# Perform t-test
t_test_results <- analysis_data %>%
t_test(tslfest ~ sex) %>%
add_significance()
# Display results
t_test_resultsEffect Size Calculation
The rstatix has a function to calculte Cohen’s d:
# # Calculate Cohen's d (using `rstatix` package)
cohens_result <- cohens_d(tslfest ~ sex, data = analysis_data)
print(cohens_result)The effectsize package helps us understand the practical significance:
# Interpret Cohen's d effect size
interpretation <- interpret_cohens_d(cohens_result$Cohens_d)
# Print interpretation
print(interpretation)Advantages of Using Modern Packages
These modern R packages offer several benefits over base R:
- The
rstatixpackage provides:- Clearer output formatting
- Automatic significance indicators
- Built-in assumption checking
- The
ggpubrpackage creates:- Publication-ready plots
- Easy visualization of assumptions
- Combined plots with statistical results
- The
effectsizepackage offers:- Standardized effect size calculations
- Effect size interpretation
- Confidence intervals for effect sizes
Creating a Complete Report
We can create a comprehensive analysis report:
# Function to run complete analysis
analyze_group_difference <- function(data, dv, group_var) {
# Descriptive statistics
desc <- data %>%
group_by(!!sym(group_var)) %>%
get_summary_stats(!!sym(dv), type = "common")
# Assumption checks
normality <- data %>%
group_by(!!sym(group_var)) %>%
shapiro_test(!!sym(dv))
# Levene's test
homogeneity <- leveneTest(as.formula(paste(dv, "~", group_var)), data = data)
# T-test
t_test <- data %>%
t_test(as.formula(paste(dv, "~", group_var))) %>%
add_significance()
# Effect size
effect <- cohens_d(as.formula(paste(dv, "~", group_var)), data = data)
list(
descriptives = desc,
normality_test = normality,
variance_test = homogeneity,
t_test = t_test,
effect_size = effect
)
}
# Example usage
self_esteem_analysis <- analyze_group_difference(
analysis_data,
"tslfest",
"sex"
)
# Display results
print(self_esteem_analysis)Writing Up Results
Here’s how to write up the results in APA format using our comprehensive analysis:
# Create formatted output
report_results <- function(analysis) {
with(analysis, {
cat("Results:\n")
cat("Descriptive Statistics:\n")
print(descriptives)
cat("\nAssumption Tests:\n")
cat("Normality (Shapiro-Wilk):\n")
print(normality_test)
cat("\nHomogeneity of Variance (Levene's Test):\n")
print(variance_test)
cat("\nT-test Results:\n")
print(t_test)
cat("\nEffect Size:\n")
print(effect_size)
})
}
report_results(self_esteem_analysis)3.3 Step E3: Performing the T-Test (Control Levels by Gender)
Now let’s examine our second research question about control levels between genders. First, we want to visually inspect the relationship between gender and perceived control:
# Create boxplot with jittered points of perceived control by gender
ggplot(analysis_data, aes(x = factor(sex), y = tpcoiss)) +
geom_boxplot(aes(fill = factor(sex)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Perceived Control by Gender",
x = "Gender",
y = "Perceived Control"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent genderWe can also check descriptive statistics and perform the analysis.
# Check basic descriptive statistics
tapply(survey_data_full$tslfest,
survey_data_full$sex,
function(x) c(mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
n = sum(!is.na(x))))
# Perform Levene's test for equality of variances
var_test <- var.test(tslfest ~ sex, data = survey_data_full)
print("Levene's test results:")
print(var_test)
# Perform t-test based on Levene's test result
t_test_result <- t.test(tslfest ~ sex,
data = survey_data_full,
var.equal = var_test$p.value > 0.05) # true if p > 0.05
print("\nt-test results:")
print(t_test_result)Lastly, we want to estimate the effect size using Cohen’s d. We are using our own function calc_cohens_d() that we created in Step E2.
# Calculate Cohen's d
group1 <- survey_data_full$tslfest[survey_data_full$sex == "Male"]
group2 <- survey_data_full$tslfest[survey_data_full$sex == "Female"]
# We defined our function `calc_cohens_d()` earlier.
effect_size <- calc_cohens_d(group1, group2)
cat("Cohen's d effect size:", round(effect_size, 3))We can use the same R packages as in the previous example for conducting and visualizing t-tests and creating a comprehensive analysis. Below is the complete analysis for our second research question about control levels.
# Create boxplot with jittered points of perceived control by gender
ggplot(analysis_data, aes(x = factor(sex), y = tpcoiss)) +
geom_boxplot(aes(fill = factor(sex)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightpink")) +
labs(
title = "Distribution of Perceived Control by Gender",
x = "Gender",
y = "Perceived Control"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender
# Check assumptions
# 1. Normality
analysis_data %>%
group_by(sex) %>%
shapiro_test(tpcoiss)
# 2. Homogeneity of variance
leveneTest(tpcoiss ~ sex, data = analysis_data)
# Perform t-test
control_t_test <- analysis_data %>%
t_test(tpcoiss ~ sex) %>%
add_significance()
# Calculate effect size
control_effect <- cohens_d(tpcoiss ~ sex, data = analysis_data)
# Display comprehensive results
list(
"T-test Results" = control_t_test,
"Effect Size" = control_effect
)3.4 Step E4: A Step-by-Step Guide to Interpreting Results
When interpreting t-test results, follow these steps:
- Check Assumptions
- Look at the boxplots for obvious outliers
- Check Levene’s test for equality of variances
- If Levene’s test is significant (p < 0.05), use the Welch’s t-test output (R does this automatically with var.equal = FALSE).
- Consider whether the groups are truly independent
- Check if the data is normally distributed
- If the data is not normally distributed, consider using non-parametric alternatives like the Wilcoxon rank-sum test:
# Example of non-parametric test
wilcox.test(tslfest ~ sex, data = survey_data_full)- Examine the t-test Results
- Look at the p-value (significance level)
- Consider the confidence interval
- Think about the effect size
- Draw Conclusions
- If p < 0.05, we have evidence of a significant difference
- Consider the practical significance (effect size)
- Think about the real-world implications
3.5 Step E5: Writing Up Results
Here is an example for how to write up t-test results in APA format:
“An independent-samples t-test was conducted to compare [variable] between males and females. There was a [significant/non-significant] difference in scores for males (M = XX.XX, SD = XX.XX) and females (M = XX.XX, SD = XX.XX); t(df) = XX.XX, p = .XXX. The magnitude of the differences in the means (mean difference = XX.XX, 95% CI: XX.XX to XX.XX) was [small/medium/large] (d = X.XX).”
4 Exercise F: Analysis of Variance (ANOVA)
Analysis of Variance, commonly known as ANOVA, extends our analytical capabilities beyond comparing just two groups. While t-tests are perfect for comparing means between two groups (like comparing males and females), researchers often need to examine differences across multiple groups. For instance, we might want to know how optimism levels differ across various age groups, or how stress levels vary with education.
Think of ANOVA like a sophisticated referee in a multi-team tournament: instead of just comparing two teams, it helps us determine whether any teams in the entire tournament are significantly different from each other. Just as a tournament referee needs clear rules to make fair decisions across multiple teams, ANOVA uses statistical principles to examine differences across multiple groups simultaneously.
4.1 Step F1: Understanding One-way ANOVA
One-way ANOVA is used when we have:
- A continuous dependent variable (like optimism scores).
- One categorical independent variable with three or more groups (like age groups).
- Independent observations (one person’s score doesn’t influence another’s).
Let’s begin by preparing our data and creating some helpful visualizations:
# First, let's prepare our age groups properly
survey_data_full$agegp3 <- factor(survey_data_full$agegp3,
levels = c(1, 2, 3),
labels = c("18-29", "30-44", "45+"))
# Create boxplot with jittered points of optimism levels by age group
ggplot(survey_data_full, aes(x = factor(agegp3), y = toptim)) +
geom_boxplot(aes(fill = factor(agegp3)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightgreen", "lightpink")) +
labs(
title = "Optimism Levels by Age Group",
x = "Age Group",
y = "Total Optimism Score"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent gender4.2 Step F2: Performing the One-Way ANOVA (Optimism across Age Groups)
Before diving into the ANOVA, let’s examine our data descriptively:
# Create comprehensive descriptive statistics
get_descriptives <- function(dv, group) {
tapply(dv, group, function(x) {
c(n = sum(!is.na(x)),
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
se = sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x))))
})
}
# Get descriptives for optimism by age group
descriptives <- get_descriptives(survey_data_full$toptim,
survey_data_full$agegp3)
print("Descriptive Statistics:")
print(descriptives)Just as t-tests have assumptions, ANOVA requires certain conditions to be met. Levene’s test is often performed before an ANOVA to ensure the validity of the ANOVA results by verifying that the variances across groups are similar (the “homogeneity of variance” assumption). However, Levene’s test uses the mean for deviations, making it appropriate if the data are normally distributed. We can perform the Shapiro-Wilk test, and if p > 0.05, the data are approximately normally distributed and we can use Levene’s test.
# Normality test (Shapiro-Wilk test)
shapiro.test(survey_data_full$toptim)# Levene's test for homogeneity of variance
levene_test <- function(y, group) {
group <- factor(group)
means <- tapply(y, group, mean, na.rm = TRUE)
abs_dev <- abs(y - means[group])
anova(lm(abs_dev ~ group))[1, "Pr(>F)"]
}
# Perform Levene's test
levene_p <- levene_test(survey_data_full$toptim, survey_data_full$agegp3)
cat("Levene's test p-value:", levene_p, "\n")If p > 0.05 from the Shapiro-Wilk test, the data are not normally distributed and the Brown-Forsythe test, which uses the median for deviations, is more appropriate.
# Brown-Forsythe test (more robust to non-normality)
bf_test <- function(y, group) {
group <- factor(group)
meds <- tapply(y, group, median, na.rm = TRUE)
abs_dev <- abs(y - meds[group])
anova(lm(abs_dev ~ group))[1, "Pr(>F)"]
}
bf_p <- bf_test(survey_data_full$toptim, survey_data_full$agegp3)
cat("Brown-Forsythe test p-value:", bf_p, "\n")Now we’ll perform both standard ANOVA and its robust alternative:
# Standard one-way ANOVA
anova_result <- aov(toptim ~ agegp3, data = survey_data_full)
print("Standard ANOVA results:")
print(summary(anova_result))
# Welch's ANOVA (robust to unequal variances)
welch_result <- oneway.test(toptim ~ agegp3,
data = survey_data_full,
var.equal = FALSE)
print("\nWelch's ANOVA results:")
print(welch_result)When conducting an ANOVA test, a significant result indicates that at least one group mean differs from the others. However, ANOVA does not tell us which specific groups differ. To determine where these differences exist, we use Tukey’s Honest Significant Difference (HSD) test, which compares each pair of groups while controlling for multiple comparisons.
# Tukey's HSD test for pairwise comparisons
tukey_result <- TukeyHSD(anova_result)
print("Tukey's HSD test results:")
print(tukey_result)
# Visualize the Tukey results
plot(tukey_result)Following the steps below, we can confidently interpret the results of Tukey’s HSD test and understand which groups significantly differ from each other:
- Check the p-value (
p adj): If it is less than0.05, the difference is statistically significant. - Look at the confidence interval (
lwrandupr): If the interval does not include0, there is a meaningful difference. - Examine the mean difference (
diff): Determine whether one group has a higher or lower mean compared to another.
Let’s calculate eta-squared to understand the practical significance. Statistical significance (i.e., a small p-value) does not tell us how important or meaningful the difference is in real-world terms. This is where effect size measures like eta-squared (η²) come in.
# Calculate eta-squared
aov_summary <- summary(anova_result)
eta_squared <- aov_summary[[1]]$"Sum Sq"[1] / sum(aov_summary[[1]]$"Sum Sq")
cat("Eta-squared:", round(eta_squared, 3))Cohen (1988) provides general guidelines for interpreting eta-squared in the context of ANOVA:
- 0.01 (1%): The independent variable explains a small amount of variance (small effect).
- 0.06 (6%): The independent variable explains a moderate amount of variance (medium effect).
- 0.14+ (14% or more): The independent variable explains a large amount of variance (strong effect).
Modern R packages provide enhanced capabilities for conducting and visualizing Analysis of Variance (ANOVA). We’ll use several powerful packages that work together to create a comprehensive analysis:
# Install packages if needed
install.packages(c("tidyverse", "car", "rstatix", "ggpubr", "effectsize", "emmeans"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(car) # For Levene's test and additional diagnostics
library(rstatix) # For statistical analysis
library(ggpubr) # For publication-ready plots
library(effectsize) # For effect size calculations
library(emmeans) # For estimated marginal means and post-hoc testsData Preparation
First, let’s prepare our data and ensure it’s in the right format for analysis:
analysis_data <- survey_data_full %>%
select(agegp3, educ, toptim, tpstress)
# Display the structure of our prepared data
glimpse(analysis_data)Visual Exploration
The ggpubr package creates beautiful and informative visualizations:
# Create violin plot with boxplot and individual points
ggviolin(analysis_data,
x = "age_group",
y = "toptim",
fill = "age_group",
add = "boxplot",
add.params = list(fill = "white")) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(title = "Optimism Levels by Age Group",
y = "Total Optimism Score",
x = "Age Group") +
theme(legend.position = "none")
# Add descriptive statistics
desc_stats <- analysis_data %>%
group_by(age_group) %>%
get_summary_stats(toptim, type = "common")
print(desc_stats)Checking Assumptions
Modern packages provide comprehensive tools for checking ANOVA assumptions:
- Normality Check
We can check normality both visually and statistically:
# Create Q-Q plots by group
ggqqplot(analysis_data, "toptim", facet.by = "age_group")
# Shapiro-Wilk test for each group
analysis_data %>%
group_by(age_group) %>%
shapiro_test(toptim)- Homogeneity of Variances
The car package provides both Levene’s and Brown-Forsythe tests:
# Levene's test using car package
leveneTest(toptim ~ age_group, data = analysis_data)
# Brown-Forsythe test using car package
leveneTest(toptim ~ age_group, data = analysis_data, center = median)Conducting the ANOVA
The rstatix package provides comprehensive ANOVA functions:
# Perform one-way ANOVA
anova_results <- analysis_data %>%
anova_test(toptim ~ age_group) %>%
add_significance()
# Display results
anova_results
# Welch's ANOVA (robust to heterogeneity of variance)
welch_results <- analysis_data %>%
welch_anova_test(toptim ~ age_group)
print("Welch's ANOVA results:")
welch_resultsPost-hoc Analysis
Modern packages offer various methods for post-hoc analysis:
# Tukey's HSD using emmeans
emmeans_result <- emmeans(aov(toptim ~ age_group, data = analysis_data),
"age_group")
pairs(emmeans_result)
# Games-Howell post-hoc test (robust to heterogeneity of variance)
games_howell_result <- analysis_data %>%
games_howell_test(toptim ~ age_group)
print("Games-Howell test results:")
games_howell_resultEffect Size Calculation
The effectsize package helps us understand the practical significance:
# Calculate eta squared
eta_squared <- effectsize::eta_squared(aov(toptim ~ age_group, data = analysis_data))
print(eta_squared)
# Interpret effect size
interpret_eta_squared(eta_squared$Eta2)Creating Powerful Visualizations
We can create publication-ready plots that combine statistical information:
# Create plot with statistical annotations
ggboxplot(analysis_data,
x = "age_group",
y = "toptim",
color = "age_group") +
stat_compare_means(method = "anova") +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(title = "Optimism Levels by Age Group with ANOVA Results",
y = "Total Optimism Score",
x = "Age Group") +
theme(legend.position = "none")Advantages of Using Modern Packages
These modern R packages offer several benefits over base R:
- The
rstatixpackage provides:- Tidyverse-compatible statistical functions
- Automatic assumption checking
- Built-in effect size calculations
- The
ggpubrpackage creates:- Publication-ready plots
- Easy statistical annotations
- Combined plots with results
- The
emmeanspackage offers:- Sophisticated post-hoc analyses
- Flexible contrast specifications
- Multiple comparison adjustments
Writing Up Results
Here’s how to write up the results in APA format:
# Function to create formatted output
report_anova_results <- function(analysis) {
with(analysis, {
cat("Results:\n")
cat("\nDescriptive Statistics:\n")
print(descriptives)
cat("\nANOVA Results:\n")
print(anova_test)
cat("\nEffect Size:\n")
print(effect_size)
cat("\nPost-hoc Analysis:\n")
print(posthoc)
})
}
report_anova_results(stress_analysis)4.3 Step F3: Performing the One-Way ANOVA (Stress Levels across Age Groups)
Let’s apply what we’ve learned to analyze stress levels (tpstress) across age groups (agegp3). First, we create some helpful visualizations:
# Create boxplot with jittered points of stress levels by age group
ggplot(survey_data_full, aes(x = factor(agegp3), y = tpstress)) +
geom_boxplot(aes(fill = factor(agegp3)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightgreen", "lightpink")) +
labs(
title = "Stress Levels by Age Group",
x = "Age Group",
y = "Total Stress Score"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent age groupBefore diving into the ANOVA, let’s examine our data descriptively:
# Get descriptives for total perceived stress by age group
# The function `get_descriptives()` was created in Step F2
descriptives <- get_descriptives(survey_data_full$tpstress,
survey_data_full$agegp3)
print("Descriptive Statistics:")
print(descriptives)Before conducting our ANOVA, let’s test whether our data meets the requirements. We can perform the Shapiro-Wilk test, and if p > 0.05, the data are approximately normally distributed and we can use Levene’s test, otherwise we should use the Brown-Forsythe test.
# Normality test (Shapiro-Wilk test)
shapiro.test(survey_data_full$tpstress)
# Perform Levene's test (if p > 0.05 from Shapiro-Wilk test)
# The function `get_descriptives()` was created in Step F2
levene_p <- levene_test(survey_data_full$tpstress, survey_data_full$agegp3)
cat("Levene's test p-value:", levene_p, "\n")
# Brown-Forsythe test (if p < 0.05 from Shapiro-Wilk test)
# The function `get_descriptives()` was created in Step F2
bf_p <- bf_test(survey_data_full$tpstress, survey_data_full$agegp3)
cat("Brown-Forsythe test p-value:", bf_p, "\n")Now we’ll perform either the standard ANOVA or its robust alternative:
# Standard one-way ANOVA
anova_result <- aov(tpstress ~ agegp3, data = survey_data_full)
print("Standard ANOVA results:")
print(summary(anova_result))
# Welch's ANOVA (robust to unequal variances)
welch_result <- oneway.test(tpstress ~ agegp3,
data = survey_data_full,
var.equal = FALSE)
print("\nWelch's ANOVA results:")
print(welch_result)If we find significant differences, we need to know which groups differ from each other:
# Tukey's HSD test for pairwise comparisons
tukey_result <- TukeyHSD(anova_result)
print("Tukey's HSD test results:")
print(tukey_result)
# Visualize the Tukey results
plot(tukey_result)Let’s calculate eta-squared to understand the practical significance:
# Calculate eta-squared
aov_summary <- summary(anova_result)
eta_squared <- aov_summary[[1]]$"Sum Sq"[1] / sum(aov_summary[[1]]$"Sum Sq")
cat("Eta-squared:", round(eta_squared, 3))Similar to the one-way ANOVA for total optimism toptim across age groups agegp3, we can use R packages to conduct a one-way ANOVA for stress levels tpstress across age groups agegp3.
Complete and Comprehensive Analysis
# Install packages if needed
install.packages(c("tidyverse", "car", "rstatix", "ggpubr", "effectsize", "emmeans"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(car) # For Levene's test and additional diagnostics
library(rstatix) # For statistical analysis
library(ggpubr) # For publication-ready plots
library(effectsize) # For effect size calculations
library(emmeans) # For estimated marginal means and post-hoc tests
# Create boxplot with jittered points of stress levels by age group
ggplot(survey_data_full, aes(x = factor(agegp3), y = tpstress)) +
geom_boxplot(aes(fill = factor(agegp3)), outlier.shape = NA) + # Boxplot without default outliers
geom_jitter(width = 0.2, alpha = 0.5, color = "darkgray") + # Individual points (jittered)
scale_fill_manual(values = c("lightblue", "lightgreen", "lightpink")) +
labs(
title = "Stress Levels by Age Group",
x = "Age Group",
y = "Total Stress Score"
) +
theme_minimal() +
theme(legend.position = "none") # Remove legend since colors only represent age group
# Function for comprehensive ANOVA analysis
analyze_group_differences <- function(data, dv, group_var) {
# Descriptive statistics
desc <- data %>%
group_by(!!sym(group_var)) %>%
get_summary_stats(!!sym(dv), type = "common")
# Assumption checks
normality <- data %>%
group_by(!!sym(group_var)) %>%
shapiro_test(!!sym(dv))
# Levene's test
homogeneity <- leveneTest(as.formula(paste(dv, "~", group_var)), data = data)
# ANOVA
anova <- anova_test(data = data, as.formula(paste(dv, "~", group_var)))
# Welch's ANOVA
welch <- welch_anova_test(data = data, as.formula(paste(dv, "~", group_var)))
# Post-hoc tests
posthoc <- games_howell_test(data = data, as.formula(paste(dv, "~", group_var)))
# Effect size
model <- aov(as.formula(paste(dv, "~", group_var)), data = data)
effect <- effectsize::eta_squared(model)
# Return results
list(
descriptives = desc,
normality_test = normality,
variance_test = homogeneity,
anova_test = anova,
welch_test = welch,
posthoc = posthoc,
effect_size = effect
)
}
# Run complete analysis
stress_analysis <- analyze_group_differences(
survey_data_full,
"tpstress",
"agegp3"
)
# Display results
print(stress_analysis)Creating Powerful Visualizations
We can create publication-ready plots that combine statistical information:
# Create plot with statistical annotations
ggboxplot(survey_data_full,
x = "agegp3",
y = "tpstress",
color = "agegp3") +
stat_compare_means(method = "anova") +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(title = "Optimism Levels by Age Group with ANOVA Results",
y = "Total Stress Score",
x = "Age Group") +
theme(legend.position = "none")Writing Up Results
Here’s how to write up the results in APA format:
# Function to create formatted output
report_anova_results <- function(analysis) {
with(analysis, {
cat("Results:\n")
cat("\nDescriptive Statistics:\n")
print(descriptives)
cat("\nANOVA Results:\n")
print(anova_test)
cat("\nEffect Size:\n")
print(effect_size)
cat("\nPost-hoc Analysis:\n")
print(posthoc)
})
}
report_anova_results(stress_analysis)4.4 Step F4: A Step-by-Step Guide to Interpreting Results
When interpreting ANOVA results, follow these steps:
- Check Assumptions
- Review Levene’s and Brown-Forsythe test results.
- Consider whether observations are truly independent.
- ANOVA can handle unequal sample sizes, but you should pay extra attention to the homogeneity of variance assumption.
- Examine the boxplots for outliers and distribution shape. If the data isn’t normally distributed, consider using the non-parametric Kruskal-Wallis test:
# Example of non-parametric alternative
kruskal.test(toptim ~ agegp3, data = survey_data_full)- Examine the ANOVA Results
- Look at the F-statistic and p-value
- Consider which ANOVA version to trust (standard or Welch’s)
- Examine effect sizes
- Interpret Post-hoc Tests
- Look for significant pairwise differences
- Consider the practical meaning of the differences
- Think about the confidence intervals
4.5 Step F5: Writing Up Results
Here’s how to write up ANOVA results in APA format:
“A one-way between-groups analysis of variance was conducted to explore the impact of age on optimism levels. Participants were divided into three groups according to their age (Group 1: 18-29 years; Group 2: 30-44 years; Group 3: 45+ years). There was a [significant/non-significant] difference in optimism scores for the three age groups: F(2, XXX) = XX.XX, p = .XXX. The effect size, calculated using eta squared, was .XX. Post-hoc comparisons using the Tukey HSD test indicated that the mean score for Group 1 (M = XX.XX, SD = XX.XX) was significantly different from Group 2 (M = XX.XX, SD = XX.XX)…”
5 Exercise G: Chi-Square Test
Chi-Square tests help us analyze categorical data - information that falls into distinct groups or categories. Think of these tests as ways to determine whether the patterns we see in our data are meaningful or just due to chance. We’ll explore two main types of Chi-Square tests:
- Chi-Square Test for Goodness of Fit: Compares our observed data to what we would expect based on some theory or known population values.
- Chi-Square Test for Independence: Examines whether two categorical variables are related to each other.
5.1 Step G1: Chi-Square Test for Goodness of Fit (Smoking Rates)
Imagine you’re wondering whether the proportion of smokers in your sample matches what you’d expect based on national statistics. This is exactly what a goodness of fit test helps us determine.
In this example, we check whether our sample’s smoking rate matches the expected population rate of 20%:
# First, let's look at our data
smoker_table <- table(survey_data_full$smoke)
print("Observed frequencies:")
print(smoker_table)
# Calculate percentages
smoker_props <- prop.table(smoker_table) * 100
print("\nObserved percentages:")
print(round(smoker_props, 1))
# Perform Chi-Square test
# Expected proportions (20% smokers, 80% non-smokers)
expected_props <- c(0.2, 0.8)
chisq_result <- chisq.test(smoker_table, p = expected_props)
# Display detailed results
print("\nChi-Square Test Results:")
print(chisq_result)
# Calculate and display expected frequencies
n_total <- sum(smoker_table)
expected_freq <- n_total * expected_props
print("\nComparison of observed vs expected frequencies:")
comparison_df <- data.frame(
Category = c("Smokers", "Non-smokers"),
Observed = as.numeric(smoker_table),
Expected = expected_freq,
Difference = as.numeric(smoker_table) - expected_freq
)
print(comparison_df)5.2 Step G2: Interpreting Goodness of Fit Results
The Chi-Square test gives us several pieces of information:
- The Chi-Square statistic (χ²): Measures how different our observed values are from what we expected
- Degrees of freedom (df): The number of categories minus 1
- p-value: Tells us whether any difference is statistically significant
In our example:
- If p < 0.05: The smoking rate in our sample is significantly different from 20%
- If p ≥ 0.05: We don’t have evidence that our sample differs from the expected 20%
Modern R packages provide enhanced capabilities for conducting and visualizing Chi-Square analyses. We’ll use several packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "rstatix", "ggstatsplot", "knitr"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(rstatix) # For statistical analysis
library(ggstatsplot) # For statistical visualization
library(knitr) # For dynamic report generationEnhanced Visualization of Observed vs Expected
The ggstatsplot package creates informative visualizations that include statistical results:
# Create enhanced pie chart with statistical results
ggpiestats(
data = survey_data_full,
x = smoke,
title = "Observed vs Expected Smoking Rates",
xlab = "Smoking Status",
ylab = "Count",
results.subtitle = TRUE,
subtitle = "Testing against expected 20% smoking rate",
type = "parametric",
paired = FALSE,
expected.r = c(0.2, 0.8) # Specify expected proportions correctly
)Here is an alternative approach to visualizing the data and results in a bar chart using ggplot2:
# Alternative approach using ggplot2
smoker_counts <- table(survey_data_full$smoke)
smoker_props <- prop.table(smoker_counts)
expected_props <- c(0.2, 0.8)
expected_counts <- sum(smoker_counts) * expected_props
# Create data frame for plotting
plot_data <- data.frame(
Status = names(smoker_counts),
Observed = as.vector(smoker_counts),
Expected = expected_counts
)
# Plot
ggplot(plot_data, aes(x = Status)) +
geom_bar(aes(y = Observed), stat = "identity", fill = "steelblue", alpha = 0.7) +
geom_point(aes(y = Expected), color = "red", size = 3) +
geom_line(aes(y = Expected, group = 1), color = "red", linetype = "dashed") +
theme_minimal() +
labs(title = "Observed vs Expected Smoking Rates",
subtitle = paste("Chi-square test: p =",
round(chisq.test(smoker_counts, p = expected_props)$p.value, 3)),
x = "Smoking Status",
y = "Count")Comprehensive Statistical Analysis
The rstatix package provides clear statistical output:
# Filter to only include Yes/No responses
filtered_data <- survey_data_full %>%
filter(smoke %in% c("Yes", "No"))
# Now proceed with calculating observed frequencies on filtered data
obs_freq <- filtered_data %>%
count(smoke) %>%
mutate(
prop = n / sum(n), # Observed proportions
expected_prop = c(0.2, 0.8), # Expected proportions
expected_n = sum(n) * expected_prop, # Expected counts
contrib = (n - expected_n)^2 / expected_n # Contribution to Chi-square
)
# Create a contingency table (observed vs expected)
contingency_table <- data.frame(
Category = obs_freq$smoke,
Observed = obs_freq$n,
Expected = obs_freq$expected_n
)
# Perform Chi-square goodness-of-fit test using `chisq_test()`
chisq_result <- chisq_test(as.table(cbind(contingency_table$Observed, contingency_table$Expected)))
# Add significance stars
chisq_result <- chisq_result %>%
add_significance()
# Display observed vs expected frequencies
obs_freq %>%
kable(digits = 3, caption = "Observed vs. Expected Smoking Frequencies")
# Display results
chisq_resultAdvantages of Modern Packages
- Enhanced Visualization
- One-click visualization of observed vs expected frequencies
- Automated proportion plots with error bars
- Clear representation of deviations from expected values
- Clearer Output
- Formatted tables comparing observed and expected frequencies
- Clear presentation of test statistics
- Automatic computation of standardized residuals
- Additional Tools
- Multiple effect size measures (Cohen’s w, Cramer’s V)
- Power analysis capabilities
- Easy comparison with multiple expected distributions
- Better Integration
- Simple workflow from data to visualization
- Easy export of results for reporting
- Consistent syntax across analyses
Best Practices for Reporting
- Report:
- Sample size and degrees of freedom
- Expected proportions and their source/justification
- Chi-square statistic and exact p-value
- Effect size (typically Cohen’s w)
- Visual comparison of observed vs expected frequencies
- Include:
- Clear statement of the null hypothesis
- Table of observed and expected frequencies
- Standardized residuals for each category
- Power analysis if the result is non-significant
- Consider:
- Whether categories should be combined
- If sample size is adequate
- Alternative tests for small samples
5.3 Step G3: Chi-Square Test for Independence (Smoking and Gender)
Now let’s examine whether there’s a relationship between smoking and gender. This test helps us determine if two categorical variables are independent of each other.
First, let’s create a cross-tabulation and examine our data:
# Create cross-tabulation
cross_tab <- table(survey_data_full$sex, survey_data_full$smoker)
# Add row and column names for clarity
dimnames(cross_tab) <- list(
Gender = c("Male", "Female"),
Smoking = c("Smoker", "Non-smoker")
)
# Display the cross-tabulation
print("Cross-tabulation of Gender and Smoking:")
print(cross_tab)
# Calculate and display percentages by gender
prop_table <- prop.table(cross_tab, margin = 1) * 100
print("\nPercentages within each gender:")
print(round(prop_table, 1))Before performing the Chi-Square test of independence, we should check if our data meets the assumptions:
# Calculate expected frequencies
expected <- chisq.test(cross_tab)$expected
print("\nExpected frequencies:")
print(round(expected, 2))
# Check if any expected frequencies are less than 5
min_expected <- min(obs_freq$expected_n)
# Print result using cat()
cat("\nMinimum expected frequency:", min_expected, "\n")
# Print additional warning if expected frequencies are less than 5
if(min_expected < 5) {
print("Warning: Some expected frequencies are less than 5!")
}Now we can perform the Chi-Square test of independence:
# Perform Chi-Square test with continuity correction (for 2x2 tables)
chi_result <- chisq.test(cross_tab, correct = TRUE)
print("\nChi-Square Test Results:")
print(chi_result)
# Calculate effect size (Cramer's V)
n <- sum(cross_tab)
cramer_v <- sqrt(chi_result$statistic / (n * (min(dim(cross_tab)) - 1)))
cat("\nCramer's V:", round(cramer_v, 3))To understand effect sizes in a 2x2 table, we can interpret Cramer’s V as follows:
- 0.1: Small effect
- 0.3: Medium effect
- 0.5: Large effect
Standardized residuals help us understand which cells contribute most to any significant results:
# Calculate adjusted standardized residuals
stdres <- chisq.test(cross_tab)$stdres
dimnames(stdres) <- dimnames(cross_tab)
print("\nAdjusted standardized residuals:")
print(round(stdres, 2))Let’s create some helpful visualizations:
# Create barplot of smoking rates by gender
barplot(prop_table,
beside = TRUE,
main = "Smoking Status by Gender",
xlab = "Smoking Status",
ylab = "Percentage",
col = c("lightblue", "pink"),
legend.text = TRUE)We can also compute the proportions manually and then plot them with ggplot2:
# Create contingency table of counts
table_counts <- table(survey_data_full$sex, survey_data_full$smoke)
# Convert counts to proportions (row-wise percentages)
prop_table <- prop.table(table_counts, margin = 1) * 100 # Convert to percentage
# Convert to data frame (since ggplot2 requires a data frame)
gender <- rep(rownames(prop_table), times = ncol(prop_table))
smoke <- rep(colnames(prop_table), each = nrow(prop_table))
percentage <- as.vector(prop_table)
plot_data <- data.frame(gender, smoke, percentage)
# Create bar plot with ggplot2
ggplot(plot_data, aes(x = smoke, y = percentage, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") + # Grouped bars
labs(
title = "Smoking Status by Gender",
x = "Smoking Status",
y = "Percentage"
) +
scale_fill_manual(values = c("lightblue", "pink")) + # Custom colors
theme_minimal()Modern R packages provide enhanced capabilities for conducting and visualizing Chi-Square analyses. We’ll use several packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "rstatix", "ggstatsplot", "janitor",
"effectsize"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(rstatix) # For statistical analysis
library(ggstatsplot) # For statistical visualization
library(janitor) # For examining and cleaning data
library(effectsize) # For effect size calculationsEnhanced Cross-Tabulation Visualization with ggstatsplot
# Create an enhanced visualization of the relationship
ggbarstats(
data = survey_data_full,
x = sex,
y = smoke,
title = "Relationship between Gender and Smoking Status",
xlab = "Gender",
ylab = "Proportion",
label = "percentage",
results.subtitle = TRUE
)Comprehensive Analysis with rstatix
# Create a contingency table (cross-tabulation of sex and smoking status)
contingency_table <- table(survey_data_full$sex, survey_data_full$smoke)
# Perform Chi-square test for independence
independence_test <- chisq_test(contingency_table) %>%
add_significance()
# Calculate effect size (Cramér’s V)
independence_effect <- cramer_v(contingency_table)
# Display results
list(
"Chi-square Test Results" = independence_test,
"Effect Size (Cramér's V)" = independence_effect
)Post-hoc Analysis
If we find significant associations, we can examine standardized residuals:
# Create contingency table
contingency_table <- table(survey_data_full$sex, survey_data_full$smoke)
# Perform Chi-square test using base R (needed for residuals)
chisq_result <- chisq.test(contingency_table)
# Extract standardized residuals
residuals_matrix <- chisq_result$stdres # Standardized residuals matrix
# Convert matrix to a data frame
residuals_analysis <- as.data.frame(as.table(residuals_matrix))
colnames(residuals_analysis) <- c("sex", "smoke", "residual")
# Create heatmap of standardized residuals
ggplot(residuals_analysis, aes(x = sex, y = smoke, fill = residual)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white") +
geom_text(aes(label = round(residual, 2))) +
theme_minimal() +
labs(
title = "Standardized Residuals from Chi-Square Test",
fill = "Residual"
)Effect Size Visualization
The effectsize package helps understand practical significance:
# Calculate Cramer's V effect size
effect_size <- effectsize::cramers_v(
table(survey_data_full$sex, survey_data_full$smoke)
)
# Print result
effect_sizeCreating a Complete Report
We can create a comprehensive analysis function:
analyze_categorical <- function(data, var1, var2 = NULL) {
if (is.null(var2)) {
# One-variable analysis
result <- list(
# Frequency table
frequency = data %>%
count(!!sym(var1)) %>%
mutate(proportion = n / sum(n)),
# Bar chart visualization
visualization = ggbarstats(
data = data,
x = !!sym(var1)
),
# Chi-square goodness-of-fit test
test = data %>%
count(!!sym(var1)) %>%
pull(n) %>%
chisq.test() %>%
broom::tidy() %>%
add_significance(),
# Effect size (not applicable for one variable)
effect = NA
)
} else {
# Two-variable independence test
contingency_table <- table(data[[var1]], data[[var2]]) # Create contingency table
result <- list(
# Cross-tabulation (contingency table)
frequency = as.data.frame.matrix(contingency_table),
# Bar chart visualization
visualization = ggbarstats(
data = data,
x = !!sym(var1),
y = !!sym(var2)
),
# Chi-square independence test
test = chisq_test(contingency_table) %>%
add_significance(),
# Cramér’s V effect size
effect = cramer_v(contingency_table)
)
}
return(result)
}
# Example usage
smoking_analysis <- analyze_categorical(
data = survey_data_full,
var1 = "sex",
var2 = "smoke"
)
# Display results
smoking_analysisAdvantages of Modern Packages
- Enhanced Visualization
- Mosaic plots with statistical annotations
- Heat maps of associations
- Interactive contingency tables
- Clearer Output
- Well-formatted contingency tables
- Clear presentation of cell percentages
- Highlighted significant associations
- Additional Tools
- Multiple association measures (Phi, Cramer’s V, Lambda)
- Post-hoc cell-wise tests
- Assumption checking for expected frequencies
- Better Integration
- Streamlined workflow for multiple variables
- Easy integration with other categorical analyses
- Consistent reporting formats
Best Practices for Reporting
- Report:
- Complete contingency table with row and column totals
- Chi-square statistic and exact p-value
- Effect size (Cramer’s V or Phi for 2x2 tables)
- Degrees of freedom
- Smallest expected frequency
- Include:
- Cross-tabulation with percentages
- Standardized residuals for significant results
- Visual representation of the relationship
- Confidence intervals for proportions
- Consider:
- Cell size assumptions
- Need for Fisher’s exact test
- Multiple comparison corrections
- Direction and strength of associations
5.4 Step G4: Writing Up Results
Here’s how to write up the results of a goodness of fit test in a formal report:
“A chi-square goodness of fit test was performed to determine whether the sample’s smoking rate differed from the expected population rate of 20%. The test revealed [insert significance statement based on results]. The observed rate was [X]% compared to the expected 20%.”
And how to write up the results of an independence test in a formal report:
“A chi-square test of independence was conducted to examine the relationship between gender and smoking status. All expected cell frequencies were greater than 5. The test revealed [insert significance statement based on results]. Among males, [X]% were smokers compared to [Y]% of females.”
6 Exercise H: Exploring the Staff Survey Data
6.1 Step H1: Setup
You can continue working in the same project or working directory that you created in the previous lab and have been working in for the past exercises. You should, however, do the following:
- Create a new R script: Go to File > New File > R Script.
- Save the script in your
scriptsfolder with an appropriate name, e.g.,Lab2_Exercise_H.R.
Try to start with the empty script that you created above (Lab2_Exercise_H.R) and work through this exerise before looking at the solutions. But you can also download the R script for Exercise H and save the downloaded file in the scripts folder.
Now you are ready to begin your work in R and continue with Step H2!
6.2 Step H2: Loading the Dataset
Please download the dataset and download the code book and save both in the data folder within your project folder or working directory.
# Load the CSV file stored in the "data" folder
staff_survey_data <- read.csv("data/lab2-staff-survey.csv")You can easily explore and check the basic structure of your data and get a summary:
# View the first few rows using head()
head(staff_survey_data)
# Examine the structure
str(staff_survey_data)
# Get basic summary statistics
summary(staff_survey_data)In line with the code book, we want to assign labels to the variables in the staff_survey_data data frame using factors with labeled levels. This method ensures the data remains categorical but with human-readable labels for easier interpretation and analysis.
# First, let's create vectors for our common level labels since they're reused often
extent_levels <- c("not at all", "to a slight extent", "to a moderate extent",
"to a great extent", "to a very great extent")
importance_levels <- c("not important", "slightly important", "moderately important",
"very important", "extremely important")
# Create age groups labels
staff_survey_data$age <- factor(staff_survey_data$age,
levels = 1:5,
labels = c("under 20", "21 to 30", "31 to 40",
"41 to 50", "over 50"))
# Employment status
staff_survey_data$employstatus <- factor(staff_survey_data$employstatus,
levels = 1:2,
labels = c("permanent", "casual"))
# Now let's handle the repeated patterns for Q1a through Q10a
# We'll use a loop to avoid repetitive code
for(i in 1:10) {
# Convert "extent" questions (Q1a through Q10a)
staff_survey_data[[paste0("Q", i, "a")]] <- factor(
staff_survey_data[[paste0("Q", i, "a")]],
levels = 1:5,
labels = extent_levels
)
# Convert "importance" questions (Q1imp through Q10imp)
staff_survey_data[[paste0("Q", i, "imp")]] <- factor(
staff_survey_data[[paste0("Q", i, "imp")]],
levels = 1:5,
labels = importance_levels
)
}
# Finally, convert the recommend variable
staff_survey_data$recommend <- factor(staff_survey_data$recommend,
levels = c(0, 1),
labels = c("no", "yes"))
# Let's add a verification step to check our work
# This will print the levels of each variable to confirm they're correctly labeled
verify_factors <- function(data) {
for(col in names(data)) {
if(is.factor(data[[col]])) {
cat("\nLevels for", col, ":\n")
print(levels(data[[col]]))
}
}
}
# Run the verification
verify_factors(staff_survey_data)The summary() function can also be used to confirm that the labels have been applied correctly to the variables.
# Verify changes by printing a summary
summary(staff_survey_data)Try working through the folloing steps and answer the questions before looking at the solutions.
6.3 Step H3: Frequency Tables
Create frequency tables of the demographic variables city, service, and employstatus.
Note: The solution is provided in the callout notes below. We’ll explore the staff_survey_data using only base R functions (orange callout note) and also provide solutions using R packages (in the usual green callout notes). While modern packages offer many conveniences, understanding base R approaches provides a solid foundation and helps us appreciate the underlying statistical processes.
Let’s begin by examining the distribution of our key demographic variables. Base R provides several functions for creating and formatting frequency tables:
# Create frequency table for city
city_table <- table(staff_survey_data$city)
city_props <- prop.table(city_table) * 100
# Combine counts and percentages
city_summary <- cbind(
Frequency = as.vector(city_table),
Percentage = as.vector(city_props)
)
rownames(city_summary) <- names(city_table)
# Display the results with formatting
print("Distribution of Staff by City:")
print(round(city_summary, 1))
# Create frequency table for employment status
status_table <- table(staff_survey_data$employstatus)
status_props <- prop.table(status_table) * 100
status_summary <- cbind(
Frequency = as.vector(status_table),
Percentage = as.vector(status_props)
)
rownames(status_summary) <- names(status_table)
print("\nDistribution of Employment Status:")
print(round(status_summary, 1))
# Create frequency table for service
# First, let's create reasonable bins for years of service
service_breaks <- seq(0, max(staff_survey_data$service, na.rm = TRUE) + 5, by = 5)
service_cats <- cut(staff_survey_data$service,
breaks = service_breaks,
include.lowest = TRUE)
service_table <- table(service_cats)
service_props <- prop.table(service_table) * 100
service_summary <- cbind(
Frequency = as.vector(service_table),
Percentage = as.vector(service_props)
)
rownames(service_summary) <- names(service_table)
print("\nDistribution of Years of Service:")
print(round(service_summary, 1))When interpreting these results, consider:
- Look for notable imbalances in group sizes
- Consider whether distributions match expectations
We can also use R packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "knitr", "kableExtra"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(kableExtra) # For enhanced table formattingFirst, we’ll create clear, informative frequency tables for our key demographic variables:
# Function to create a nicely formatted frequency table
create_freq_table <- function(data, variable) {
data %>%
count({{variable}}) %>%
mutate(Percentage = n/sum(n) * 100) %>%
kable(digits = 1,
col.names = c("Category", "Count", "Percentage"),
caption = paste("Distribution of", deparse(substitute(variable)))) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
}
# Create tables for each demographic variable
create_freq_table(staff_survey_data, city)
create_freq_table(staff_survey_data, service)
create_freq_table(staff_survey_data, employstatus)6.4 Step H4: Hisogram
Create a histogram of the demographic variable service and look for outliers.
We’ll create a histogram to understand the distribution of service years and identify potential outliers:
# Create histogram
hist(staff_survey_data$service,
main = "Distribution of Years of Service",
xlab = "Years of Service",
ylab = "Frequency",
breaks = 20,
col = "lightblue",
border = "white")
# Add a box plot for outlier detection
boxplot(staff_survey_data$service,
horizontal = TRUE,
main = "Box Plot of Years of Service",
xlab = "Years of Service",
col = "lightblue")
# Reset plotting parameters
par(mfrow = c(1, 1))
# Calculate summary statistics
service_summary_stats <- summary(staff_survey_data$service)
service_sd <- sd(staff_survey_data$service, na.rm = TRUE)
# Print summary statistics
print("\nSummary Statistics for Years of Service:")
print(service_summary_stats)
print(paste("Standard Deviation:", round(service_sd, 2)))
# Identify potential outliers using the 1.5 * IQR rule
Q1 <- quantile(staff_survey_data$service, 0.25, na.rm = TRUE)
Q3 <- quantile(staff_survey_data$service, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
outlier_threshold_upper <- Q3 + 1.5 * IQR
outlier_threshold_lower <- Q1 - 1.5 * IQR
outliers <- staff_survey_data$service[staff_survey_data$service > outlier_threshold_upper |
staff_survey_data$service < outlier_threshold_lower]
if(length(outliers) > 0) {
print("\nPotential outliers identified:")
print(sort(outliers))
}When interpreting these results, consider:
- Note the shape of the distribution
- Identify any unusual patterns or gaps
- Consider whether outliers represent errors or true values
We can also use R packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "knitr", "kableExtra"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(kableExtra) # For enhanced table formattingLet’s create a detailed histogram to understand the distribution of years of service and identify any potential outliers:
# Create histogram with density curve
ggplot(staff_survey_data, aes(x = service)) +
geom_histogram(aes(y = ..density..),
bins = 30,
fill = "lightblue",
color = "black") +
geom_density(color = "red") +
theme_minimal() +
labs(x = "Years of Service",
y = "Density",
title = "Distribution of Years of Service",
subtitle = "With density curve overlay") +
# Add boxplot for outlier visualization
geom_boxplot(aes(y = -0.02), width = 0.1)
# Calculate summary statistics for service
service_summary <- staff_survey_data %>%
summarize(
Mean = mean(service, na.rm = TRUE),
Median = median(service, na.rm = TRUE),
SD = sd(service, na.rm = TRUE),
Q1 = quantile(service, 0.25, na.rm = TRUE),
Q3 = quantile(service, 0.75, na.rm = TRUE),
IQR = IQR(service, na.rm = TRUE),
Min = min(service, na.rm = TRUE),
Max = max(service, na.rm = TRUE)
)
# Print summary statistics
kable(service_summary, digits = 1,
caption = "Summary Statistics for Years of Service") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
# Identify potential outliers
outliers <- staff_survey_data %>%
filter(service > (service_summary$Q3 + 1.5 * service_summary$IQR) |
service < (service_summary$Q1 - 1.5 * service_summary$IQR))Looking at the distribution of years of service, we can observe:
- The shape of the distribution (whether it’s symmetric or skewed)
- Any unusual patterns or gaps
- Potential outliers, which we’ve identified using the 1.5 × IQR rule
6.5 Step H5: Cross-Tabulation
Cross-tabulate and study the age groups (agerecode) against employstatus.
Let’s examine the relationship between age groups and employment status using a cross-tabulation:
# Create cross-tabulation
age_employ_table <- table(staff_survey_data$agerecode,
staff_survey_data$employstatus)
# Calculate row percentages
age_employ_props <- prop.table(age_employ_table, margin = 1) * 100
# Print the results
print("Counts by Age Group and Employment Status:")
print(age_employ_table)
print("\nPercentages within Age Groups:")
print(round(age_employ_props, 1))
# Create a visual representation using base R
# Set up the plotting area
barplot(t(age_employ_props),
beside = TRUE,
col = c("lightblue", "lightgreen"),
main = "Employment Status by Age Group",
xlab = "Age Group",
ylab = "Percentage",
legend.text = colnames(age_employ_props))
# Perform chi-square test of independence
chi_sq_test <- chisq.test(age_employ_table)
print("\nChi-square test of independence:")
print(chi_sq_test)When interpreting these results, consider:
- Look for patterns in employment type across age groups
- Consider the chi-square test results
- Think about implications for workforce planning
We can also use R packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "knitr", "kableExtra"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(kableExtra) # For enhanced table formattingLet’s examine the relationship between age groups and employment status:
# Create cross-tabulation
age_employ_table <- table(staff_survey_data$agerecode,
staff_survey_data$employstatus)
# Convert to percentages and format nicely
age_employ_props <- prop.table(age_employ_table, margin = 1) * 100
# Combine counts and percentages in a nice table
age_employ_combined <- cbind(
as.data.frame.matrix(age_employ_table),
as.data.frame.matrix(age_employ_props)
) %>%
setNames(c("Permanent (n)", "Casual (n)",
"Permanent (%)", "Casual (%)"))
kable(age_employ_combined,
digits = 1,
caption = "Age Groups by Employment Status") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
# Create a visualization
ggplot(staff_survey_data,
aes(x = agerecode, fill = employstatus)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(x = "Age Group",
y = "Proportion",
fill = "Employment Status",
title = "Employment Status Distribution by Age Group") +
coord_flip()6.6 Step H6: Total Staff Satisfaction
How do you think total staff satisfaction (totsatis) is calculated? Is there a difference in Total staff satisfaction among the two staff statuses? (Hint: t-test)
To understand how total staff satisfaction (totsatis) might be calculated, let’s examine its relationship with individual satisfaction items:
# Identify satisfaction-related variables (Q*a)
satisfaction_vars <- grep("Q.*a$", names(staff_survey_data), value = TRUE)
# Calculate correlations with total satisfaction
correlations <- sapply(staff_survey_data[satisfaction_vars],
function(x) cor(x, staff_survey_data$totsatis,
use = "complete.obs"))
# Print correlations
print("Correlations with Total Satisfaction:")
print(round(correlations, 3))
# Create scatterplot matrix of selected variables
pairs(staff_survey_data[c(satisfaction_vars[1:5], "totsatis")],
main = "Relationships between Satisfaction Measures")Let’s examine whether there are differences in total satisfaction between permanent and casual staff:
# Calculate descriptive statistics by group
tapply(staff_survey_data$totsatis,
staff_survey_data$employstatus,
function(x) c(n = length(x),
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE)))
# Create box plot for visual comparison
boxplot(totsatis ~ employstatus,
data = staff_survey_data,
main = "Total Satisfaction by Employment Status",
xlab = "Employment Status",
ylab = "Total Satisfaction Score",
col = "lightblue")
# Add individual points for better visualization
stripchart(totsatis ~ employstatus,
data = staff_survey_data,
vertical = TRUE,
method = "jitter",
add = TRUE,
pch = 20,
col = "darkgray")
# Perform t-test
satisfaction_ttest <- t.test(totsatis ~ employstatus,
data = staff_survey_data)
# Print t-test results
print("\nt-test Results:")
print(satisfaction_ttest)
# Calculate effect size (Cohen's d)
group1 <- staff_survey_data$totsatis[staff_survey_data$employstatus == "permanent"]
group2 <- staff_survey_data$totsatis[staff_survey_data$employstatus == "casual"]
cohens_d <- function(x, y) {
nx <- length(x)
ny <- length(y)
pooled_sd <- sqrt(((nx-1)*var(x) + (ny-1)*var(y))/(nx+ny-2))
abs(mean(x) - mean(y))/pooled_sd
}
effect_size <- cohens_d(group1, group2)
print(paste("\nEffect size (Cohen's d):", round(effect_size, 3)))When interpreting these results, consider:
- Consider both statistical significance (p-value) and practical significance (effect size)
- Look at the distribution of scores within each group
- Think about potential confounding variables
We can also use R packages that work together to create comprehensive and intuitive analyses:
# Install packages if needed
install.packages(c("tidyverse", "knitr", "kableExtra"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(kableExtra) # For enhanced table formattingThe total staff satisfaction score (totsatis) is likely calculated as a composite measure of various satisfaction-related questions in the survey. Let’s examine its components and distribution:
# Examine the structure of satisfaction-related variables
satisfaction_vars <- names(staff_survey_data)[grep("Q.*a$", names(staff_survey_data))]
# Look at correlations between these variables and totsatis
satisfaction_correlations <- staff_survey_data %>%
select(all_of(c(satisfaction_vars, "totsatis"))) %>%
cor(use = "complete.obs")
# Print correlations with total satisfaction
kable(satisfaction_correlations["totsatis", ],
digits = 3,
caption = "Correlations with Total Satisfaction") %>%
kable_styling(bootstrap_options = c("striped", "hover"))To examine whether there are differences in total satisfaction between permanent and casual staff, we’ll conduct a t-test:
# First, let's look at descriptive statistics
satisfaction_by_status <- staff_survey_data %>%
group_by(employstatus) %>%
summarise(
n = n(),
mean = mean(totsatis, na.rm = TRUE),
sd = sd(totsatis, na.rm = TRUE),
se = sd/sqrt(n)
)
# Print descriptive statistics
kable(satisfaction_by_status,
digits = 2,
caption = "Satisfaction Scores by Employment Status") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
# Create visualization
ggplot(staff_survey_data, aes(x = employstatus, y = totsatis)) +
geom_boxplot(fill = "lightblue") +
geom_jitter(width = 0.2, alpha = 0.2) +
theme_minimal() +
labs(x = "Employment Status",
y = "Total Satisfaction Score",
title = "Distribution of Satisfaction Scores by Employment Status")
# Perform t-test
satisfaction_ttest <- t.test(totsatis ~ employstatus,
data = staff_survey_data)
# Calculate effect size (Cohen's d)
cohens_d <- function(x, y) {
nx <- length(x)
ny <- length(y)
pooled_sd <- sqrt(((nx-1)*var(x) + (ny-1)*var(y))/(nx+ny-2))
abs(mean(x) - mean(y))/pooled_sd
}
effect_size <- with(staff_survey_data,
cohens_d(totsatis[employstatus == "permanent"],
totsatis[employstatus == "casual"]))
# Print t-test results
print(satisfaction_ttest)
cat("\nEffect size (Cohen's d):", round(effect_size, 3))7 Exercise I: Logistic Regression
7.1 Step I1: Setup
You can continue working in the same project or working directory that you created in the previous lab and have been working in for the past exercises. You should, however, do the following:
- Create a new R script: Go to File > New File > R Script.
- Save the script in your
scriptsfolder with an appropriate name, e.g.,Lab2_Exercise_I.R.
You can either work through the following steps and copy/paste the respective code into the Lab_Exercise_I.R file that you will create in Step I1 or download the R script for Exercise H and follow the instructions below and save the downloaded file in the scripts folder that you will create.
Now you are ready to begin your work in R and continue with Step I2!
7.2 Step I2: Loading the Dataset
Please download the dataset and save the file in the data folder within your project folder or working directory.
# Load the CSV file stored in the "data" folder
sleep_survey_data <- read.csv("data/lab2-sleep-survey.csv")You can easily explore and check the basic structure of your data and get a summary:
# View the first few rows using head()
head(sleep_survey_data)
# Examine the structure
str(sleep_survey_data)
# Get basic summary statistics
summary(sleep_survey_data)We want to assign labels to some of the categorical variables in the sleep_survey_data data frame using factors with labeled levels. This method ensures the data remains categorical but with human-readable labels for easier interpretation and analysis.
# Let's start by creating our factor labels in a clear, organized way.
# When working with binary variables (0/1), it's especially important to be
# consistent with our labeling approach.
# First, let's handle the binary (0/1) variables
# We'll create these first since they share the same structure
sleep_survey_data$sex <- factor(sleep_survey_data$sex,
levels = c(0, 1),
labels = c("female", "male"))
sleep_survey_data$probsleep <- factor(sleep_survey_data$probsleep,
levels = c(0, 1),
labels = c("no", "yes"))
sleep_survey_data$fallsleep <- factor(sleep_survey_data$fallsleep,
levels = c(0, 1),
labels = c("no", "yes"))
sleep_survey_data$staysleep <- factor(sleep_survey_data$staysleep,
levels = c(0, 1),
labels = c("no", "yes"))
# Now let's handle the categorical variables with multiple levels
# For these, we'll use more descriptive labels
sleep_survey_data$marital <- factor(sleep_survey_data$marital,
levels = 1:4,
labels = c("single",
"married/defacto",
"divorced",
"widowed"))
sleep_survey_data$edlevel <- factor(sleep_survey_data$edlevel,
levels = 1:5,
labels = c("primary school",
"secondary school",
"trade training",
"undergraduate degree",
"postgraduate degree"))
# Let's add a verification step to make sure our conversions worked correctly
# This function will print the levels of each factor variable we created
verify_factors <- function(data) {
cat("Checking factor levels for all converted variables:\n\n")
# List of variables we converted
factor_vars <- c("sex", "probsleep", "fallsleep", "staysleep",
"marital", "edlevel")
for(var in factor_vars) {
cat(paste0("Levels for ", var, ":\n"))
print(levels(data[[var]]))
cat("\n")
}
}
# Run the verification
verify_factors(sleep_survey_data)7.3 Step I3: Logistic Regression
7.3.1 What is Logistic Regression?
Logistic regression is a statistical method used when we want to predict a categorical outcome (like yes/no, pass/fail) based on one or more predictor variables. Think of it as answering questions like “What factors influence whether someone has sleep problems?” or “What characteristics predict if a student will pass an exam?”
We use logistic regression when:
- Our outcome variable is categorical (usually binary).
- We have multiple predictor variables (can be continuous or categorical).
- We want to understand both prediction and relationships.
In our sleep study, we’re trying to predict whether someone has sleep problems (yes/no) based on various characteristics and behaviors.
Understanding the Math (In Simple Terms)
While regular regression predicts actual values (like height or weight), logistic regression predicts the probability of belonging to a category (like having sleep problems). It uses a special S-shaped curve called a logistic function that keeps predictions between 0 and 1 (or 0% to 100% probability).
7.3.2 Initial Data Preparation
Let’s prepare our data and examine our sample:
# First, let's check our sample size and missing data
initial_sample <- nrow(sleep_survey_data)
complete_cases <- sum(complete.cases(sleep_survey_data[
c("probsleep", "sex", "age", "fallsleep", "staysleep", "hrswknight")
]))
# Print sample information
cat("Total cases:", initial_sample, "\n")
cat("Complete cases:", complete_cases, "\n")
cat("Missing cases:", initial_sample - complete_cases, "\n")
# Check coding of categorical variables
cat("\nCoding of categorical variables:\n")
cat("\nSleep Problems (probsleep):\n")
print(table(sleep_survey_data$probsleep))
cat("\nSex:\n")
print(table(sleep_survey_data$sex))
cat("\nTrouble Falling Asleep (fallsleep):\n")
print(table(sleep_survey_data$fallsleep))
cat("\nTrouble Staying Asleep (staysleep):\n")
print(table(sleep_survey_data$staysleep))7.3.3 Building the Logistic Regression Model
Now let’s create our model. We’ll use base R’s glm (Generalized Linear Model) function with the logit link:
# Create the logistic regression model
sleep_model <- glm(probsleep ~ sex + age + fallsleep + staysleep + hrswknight,
family = binomial(link = "logit"),
data = sleep_survey_data)
# Display the summary of our model
summary_results <- summary(sleep_model)
print(summary_results)
# Calculate odds ratios and confidence intervals
odds_ratios <- exp(coef(sleep_model))
conf_int <- exp(confint(sleep_model))
# Combine results in a readable format
results_table <- cbind(
Estimate = round(coef(sleep_model), 3),
"Odds Ratio" = round(odds_ratios, 3),
"CI Lower" = round(conf_int[,1], 3),
"CI Upper" = round(conf_int[,2], 3),
"p-value" = round(summary_results$coefficients[,4], 3)
)
# Print results
cat("\nDetailed Results:\n")
print(results_table)
# Calculate model fit statistics
null_deviance <- sleep_model$null.deviance
model_deviance <- sleep_model$deviance
pseudo_r2 <- 1 - (model_deviance / null_deviance)
cat("\nModel Fit Statistics:\n")
cat("Null deviance:", round(null_deviance, 2), "\n")
cat("Model deviance:", round(model_deviance, 2), "\n")
cat("McFadden's Pseudo R²:", round(pseudo_r2, 3), "\n")7.4 Step I4: Interpreting the Results
Let’s break down what these results tell us.
7.4.1 Model Convergence and Overall Fit
Our model successfully converged, which means it found a stable solution. The reduction in deviance from the null model (model with no predictors) to our final model indicates how much our predictors help explain sleep problems.
7.4.2 Individual Predictors
Let’s examine each predictor’s contribution:
# Create a function to interpret odds ratios
interpret_or <- function(or, p_value) {
direction <- if(or > 1) "increase" else "decrease"
magnitude <- abs(1 - or) * 100
significance <- if(p_value < 0.05) "significant" else "not significant"
sprintf("%.1f%% %s in odds (%s)", magnitude, direction, significance)
}
# Print interpretations for each predictor
cat("Interpretation of Effects:\n\n")
for(i in 2:nrow(results_table)) {
var_name <- rownames(results_table)[i]
or <- results_table[i, "Odds Ratio"]
p_val <- results_table[i, "p-value"]
cat(var_name, ":", interpret_or(or, p_val), "\n")
}7.4.3 Key Findings
Our analysis reveals that:
- Staying Asleep (p < 0.001): This is our strongest predictor. People who have trouble staying asleep are significantly more likely to report general sleep problems.
- Falling Asleep (p = 0.035): Difficulty falling asleep is also a significant predictor, though not as strong as - staying asleep.
- Hours of Sleep (p = 0.007): Each additional hour of sleep is associated with a decrease in the odds of reporting sleep problems.
- Age and Sex: Neither demographic variable significantly predicts sleep problems (p > 0.05).
7.4.4 Model Accuracy
We can examine how well our model predicts sleep problems:
# Calculate predicted probabilities
predicted_probs <- predict(sleep_model, type = "response")
predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)
actual_class <- as.numeric(sleep_survey_data$probsleep) - 1
# Create confusion matrix
conf_matrix <- table(Predicted = predicted_class, Actual = actual_class)
print("\nConfusion Matrix:")
print(conf_matrix)
# Calculate accuracy metrics
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity <- conf_matrix[2,2] / sum(conf_matrix[,2])
specificity <- conf_matrix[1,1] / sum(conf_matrix[,1])
cat("\nModel Performance Metrics:\n")
cat("Accuracy:", round(accuracy, 3), "\n")
cat("Sensitivity:", round(sensitivity, 3), "\n")
cat("Specificity:", round(specificity, 3), "\n")7.4.5 Practical Implications
Our analysis suggests several important practical implications:
- Sleep Maintenance: The strong effect of staying asleep suggests that interventions focused on sleep maintenance might be most beneficial.
- Sleep Duration: The significant effect of hours of sleep indicates that strategies to increase sleep duration could help reduce sleep problems.
- Universal Impact: The lack of significant age and sex effects suggests that sleep problems affect people regardless of these demographic factors.
7.4.6 Limitations and Considerations
When interpreting these results, consider:
- Self-reported data may be subject to recall bias.
- The binary nature of our outcome variable might oversimplify complex sleep issues.
- There might be other important predictors we haven’t included in the model.
Modern R packages provide enhanced capabilities for conducting and interpreting logistic regression analyses. We’ll use several packages that work together to create a comprehensive analysis while making the process more intuitive and the results more interpretable.
# Install packages if needed
install.packages(c("tidyverse", "broom", "performance", "sjPlot", "marginaleffects"))
# Load required packages
library(tidyverse) # For data manipulation and visualization
library(broom) # For tidying model output
library(performance) # For model diagnostics and performance metrics
library(sjPlot) # For plotting odds ratios and model summaries
library(marginaleffects) # For calculating and plotting marginal effectsUnderstanding our Data
The tidyverse package provides powerful tools for initial data exploration:
# Create our analysis dataset
sleep_data <- sleep_survey_data %>%
select(probsleep, sex, age, fallsleep, staysleep, hrswknight) %>%
# Convert to factors with meaningful labels if not already done
mutate(across(c(probsleep, sex, fallsleep, staysleep),
~factor(., levels = c(0, 1),
labels = c("No", "Yes"))))
# Examine our data structure
glimpse(sleep_data)
# Check missing data patterns
missing_pattern <- sleep_data %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Missing_Count")
ggplot(missing_pattern,
aes(x = reorder(Variable, Missing_Count), y = Missing_Count)) +
geom_col(fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Missing Data Pattern",
x = "Variable",
y = "Number of Missing Values")Building the Logistic Regression Model
The tidyverse and associated packages make model building and interpretation more straightforward:
# Fit the model
sleep_model <- glm(probsleep ~ sex + age + fallsleep + staysleep + hrswknight,
family = binomial(link = "logit"),
data = sleep_data)
# Get tidy model summary
model_summary <- tidy(sleep_model,
conf.int = TRUE,
exponentiate = TRUE)
# Create a nice summary table
model_summary %>%
mutate(across(where(is.numeric), ~round(., 3))) %>%
knitr::kable(col.names = c("Predictor", "Odds Ratio", "Std. Error",
"z value", "p-value", "CI Lower", "CI Upper"),
caption = "Logistic Regression Results") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))Model Diagnostics and Performance
The performance package provides comprehensive model diagnostics:
# Check model performance
model_performance <- check_model(sleep_model)
# Get various performance metrics
performance_metrics <- model_performance(sleep_model)
# Plot ROC curve
pred_data <- sleep_data %>%
mutate(predicted = predict(sleep_model, type = "response"))
ggplot(pred_data, aes(d = as.numeric(probsleep) - 1, m = predicted)) +
geom_roc(n.cuts = 0) +
style_roc() +
theme_minimal() +
labs(title = "ROC Curve for Sleep Problems Model")Understanding Variable Effects
The marginaleffects package helps us understand how our predictors influence sleep problems:
# Calculate average marginal effects
marg_effects <- avg_slopes(sleep_model)
# Plot marginal effects
plot_slopes(sleep_model,
variables = "hrswknight",
condition = "staysleep") +
theme_minimal() +
labs(title = "Effect of Sleep Hours by Staying Asleep Status",
x = "Hours of Sleep per Weeknight",
y = "Predicted Probability of Sleep Problems")
# Create effects plot for categorical variables
plot_model(sleep_model,
type = "pred",
terms = c("staysleep", "fallsleep")) +
theme_minimal() +
labs(title = "Interaction between Falling and Staying Asleep",
y = "Predicted Probability of Sleep Problems")Visualizing Model Results
sjPlot provides excellent visualization tools for model results:
# Plot odds ratios
plot_model(sleep_model,
show.values = TRUE,
title = "Odds Ratios for Sleep Problems") +
theme_minimal()
# Create predicted probabilities plot
plot_model(sleep_model,
type = "pred",
terms = "hrswknight",
title = "Effect of Sleep Hours on Problem Probability") +
theme_minimal()Advantages of Using Modern Packages
These modern R packages offer several benefits over base R:
- The
tidyverseprovides:
- More intuitive data manipulation
- Enhanced visualization capabilities
- Consistent syntax across operations
broomoffers:
- Standardized model output formats
- Easy extraction of model statistics
- Simple integration with other tidyverse tools
performancegives us:
- Comprehensive model diagnostics
- Easy-to-interpret visualizations
- Multiple performance metrics
marginaleffectsenables:
- Intuitive interpretation of variable effects
- Beautiful visualization of interactions
- Straightforward prediction scenarios
Creating a Complete Report
We can create a comprehensive analysis report using these tools:
# Function to create a complete model summary
create_model_report <- function(model) {
# Model summary
summary_stats <- glance(model)
# Coefficients and odds ratios
coef_table <- tidy(model, conf.int = TRUE, exponentiate = TRUE)
# Predictions
pred_data <- augment(model, type.predict = "response")
# Return as list
list(
model_stats = summary_stats,
coefficients = coef_table,
predictions = pred_data
)
}
# Generate report
model_report <- create_model_report(sleep_model)
# Print key findings
cat("Model Performance Metrics:\n")
print(model_report$model_stats)
cat("\nKey Predictors:\n")
print(model_report$coefficients)Practical Implementation
Here’s how to use these results in practice:
# Create a prediction function for new cases
predict_sleep_problems <- function(new_data) {
predictions <- predict(sleep_model,
newdata = new_data,
type = "response")
tibble(
probability = predictions,
predicted_outcome = if_else(predictions > 0.5, "Yes", "No")
)
}
# Example usage
example_cases <- tibble(
sex = factor(c("Male", "Female"), levels = c("Male", "Female")),
age = c(30, 45),
fallsleep = factor(c("Yes", "No"), levels = c("No", "Yes")),
staysleep = factor(c("No", "Yes"), levels = c("No", "Yes")),
hrswknight = c(7, 6)
)
predictions <- predict_sleep_problems(example_cases)
print(predictions)8 Summary
This lab introduced you to performing a variety of statistical tests and analyses in R (similar to the SPSS Lab 2). You can download the respective R scripts for Exercises D+E+F+G, H, and I below for comparison, in case you created your own: