Introduction to R for Analytics

Analytics

Introduction

Published

March 22, 2023

R is a powerful language specifically designed for data analysis and visualization. This guide demonstrates practical examples of using R for real-world analytics tasks.

Exploring a Dataset

R comes with several built-in datasets perfect for practice. This example examines the mtcars dataset:

# View the first few rows
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Quick summary of the dataset structure
str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# Statistical summary of key variables
summary(mtcars[, c("mpg", "wt", "hp")])

      mpg              wt              hp       
 Min.   :10.40   Min.   :1.513   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:2.581   1st Qu.: 96.5  
 Median :19.20   Median :3.325   Median :123.0  
 Mean   :20.09   Mean   :3.217   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:3.610   3rd Qu.:180.0  
 Max.   :33.90   Max.   :5.424   Max.   :335.0

The mtcars dataset contains information about 32 cars from Motor Trend magazine, including fuel efficiency (mpg), weight (wt), and horsepower (hp).

Data Visualization

Visualization is essential for understanding patterns in data. The following examples create informative plots:

library(ggplot2)

# 1. A scatter plot with regression line
p1 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(size = hp, color = factor(cyl)), alpha = 0.7) +
  geom_smooth(method = "lm", formula = y ~ x, color = "#2c3e50") +
  labs(title = "Car Weight vs. Fuel Efficiency",
       subtitle = "Size represents horsepower, color represents cylinders",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1", name = "Cylinders")

# 2. Distribution of fuel efficiency
p2 <- ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
  geom_histogram(bins = 10, alpha = 0.7, position = "identity") +
  labs(title = "Distribution of Fuel Efficiency",
       x = "Miles Per Gallon",
       y = "Count") +
  scale_fill_brewer(palette = "Set1", name = "Cylinders") +
  theme_minimal()

# Display plots (if using patchwork)
library(patchwork)
p1 / p2

These visualizations reveal:

A clear negative correlation between car weight and fuel efficiency
Higher cylinder cars tend to be heavier with lower MPG
The MPG distribution varies significantly by cylinder count

Data Transformation

Data rarely comes in the exact format needed. The dplyr package makes transformations straightforward:

# Load required packages
library(dplyr)
library(tibble)  # For rownames_to_column function

# Create an enhanced version of the dataset
mtcars_enhanced <- mtcars %>%
  # Add car names as a column (they're currently row names)
  rownames_to_column("car_name") %>%
  # Create useful derived metrics
  mutate(
    # Efficiency ratio (higher is better)
    efficiency_ratio = mpg / wt,
    
    # Power-to-weight ratio (higher is better)
    power_to_weight = hp / wt,
    
    # Categorize cars by efficiency
    efficiency_category = case_when(
      mpg > 25 ~ "High Efficiency",
      mpg > 15 ~ "Medium Efficiency",
      TRUE ~ "Low Efficiency"
    )
  ) %>%
  # Arrange from most to least efficient
  arrange(desc(efficiency_ratio))

# Display the top 5 most efficient cars
head(mtcars_enhanced[, c("car_name", "mpg", "wt", "hp", "efficiency_ratio", "efficiency_category")], 5)

        car_name  mpg    wt  hp efficiency_ratio efficiency_category
1   Lotus Europa 30.4 1.513 113         20.09253     High Efficiency
2    Honda Civic 30.4 1.615  52         18.82353     High Efficiency
3 Toyota Corolla 33.9 1.835  65         18.47411     High Efficiency
4       Fiat 128 32.4 2.200  66         14.72727     High Efficiency
5      Fiat X1-9 27.3 1.935  66         14.10853     High Efficiency

Answering Business Questions with Data

The enhanced dataset can be used to answer practical questions:

# Question 1: What are the average characteristics by cylinder count?
cylinder_analysis <- mtcars_enhanced %>%
  group_by(cyl) %>%
  summarize(
    count = n(),
    avg_mpg = mean(mpg),
    avg_weight = mean(wt),
    avg_horsepower = mean(hp),
    avg_efficiency_ratio = mean(efficiency_ratio),
    avg_power_to_weight = mean(power_to_weight)
  ) %>%
  arrange(cyl)

# Display the results
cylinder_analysis

# A tibble: 3 × 7
    cyl count avg_mpg avg_weight avg_horsepower avg_efficiency_ratio
  <dbl> <int>   <dbl>      <dbl>          <dbl>                <dbl>
1     4    11    26.7       2.29           82.6                12.7 
2     6     7    19.7       3.12          122.                  6.44
3     8    14    15.1       4.00          209.                  3.95
# ℹ 1 more variable: avg_power_to_weight <dbl>

# Question 2: Which transmission type is more fuel efficient?
transmission_efficiency <- mtcars_enhanced %>%
  # am: 0 = automatic, 1 = manual
  mutate(transmission = if_else(am == 1, "Manual", "Automatic")) %>%
  group_by(transmission) %>%
  summarize(
    count = n(),
    avg_mpg = mean(mpg),
    median_mpg = median(mpg),
    mpg_std_dev = sd(mpg)
  )

# Display the results
transmission_efficiency

# A tibble: 2 × 5
  transmission count avg_mpg median_mpg mpg_std_dev
  <chr>        <int>   <dbl>      <dbl>       <dbl>
1 Automatic       19    17.1       17.3        3.83
2 Manual          13    24.4       22.8        6.17

# Visualize the difference
ggplot(mtcars, aes(x = factor(am, labels = c("Automatic", "Manual")), y = mpg, fill = factor(am))) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, alpha = 0.5) +
  labs(title = "Fuel Efficiency by Transmission Type",
       x = "Transmission Type",
       y = "Miles Per Gallon") +
  theme_minimal() +
  theme(legend.position = "none")

Correlation Analysis for Decision Making

Understanding relationships between variables is crucial for business decisions:

# Calculate correlations
cor_matrix <- cor(mtcars[, c("mpg", "wt", "hp", "disp", "qsec")])
cor_df <- round(cor_matrix, 2)

# Display correlation matrix
cor_df

       mpg    wt    hp  disp  qsec
mpg   1.00 -0.87 -0.78 -0.85  0.42
wt   -0.87  1.00  0.66  0.89 -0.17
hp   -0.78  0.66  1.00  0.79 -0.71
disp -0.85  0.89  0.79  1.00 -0.43
qsec  0.42 -0.17 -0.71 -0.43  1.00

# Visualize correlations (requires the corrplot package)
library(corrplot)
corrplot(cor_matrix, method = "circle", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black")

# Scatter plot matrix of key variables
pairs(mtcars[, c("mpg", "wt", "hp", "disp")], 
      main = "Scatter Plot Matrix of Key Variables",
      pch = 21, bg = "lightblue", cex = 1.2)

Working with Real-World Datasets

The famous Iris dataset demonstrates a complete workflow:

# Load packages
library(tidyr)
library(dplyr)
library(ggplot2)

# Examine the dataset
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Calculate summary statistics by species
iris_stats <- iris %>%
  group_by(Species) %>%
  summarize(across(where(is.numeric), 
                   list(mean = mean, 
                        median = median,
                        sd = sd,
                        min = min,
                        max = max)))

# View summary for Sepal.Length
iris_stats %>% select(Species, starts_with("Sepal.Length"))

# A tibble: 3 × 6
  Species Sepal.Length_mean Sepal.Length_median Sepal.Length_sd Sepal.Length_min
  <fct>               <dbl>               <dbl>           <dbl>            <dbl>
1 setosa               5.01                 5             0.352              4.3
2 versic…              5.94                 5.9           0.516              4.9
3 virgin…              6.59                 6.5           0.636              4.9
# ℹ 1 more variable: Sepal.Length_max <dbl>

# Create a visualization comparing all measurements across species
iris_long <- iris %>%
  pivot_longer(
    cols = -Species,
    names_to = "Measurement",
    values_to = "Value"
  )

# Box plots with data points
ggplot(iris_long, aes(x = Species, y = Value, fill = Species)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.15, alpha = 0.5, color = "darkgrey") +
  facet_wrap(~Measurement, scales = "free_y") +
  labs(title = "Iris Measurements Across Species",
       subtitle = "Box plots with individual observations") +
  theme_minimal() +
  theme(legend.position = "none")

# Find the most distinguishing features between species
iris_wide <- iris %>%
  pivot_longer(cols = -Species, names_to = "Measurement", values_to = "Value") %>%
  group_by(Measurement, Species) %>%
  summarise(mean_value = mean(Value), .groups = "drop") %>%
  pivot_wider(names_from = Species, values_from = mean_value) %>%
  mutate(versicolor_vs_setosa = abs(versicolor - setosa),
         virginica_vs_setosa = abs(virginica - setosa),
         virginica_vs_versicolor = abs(virginica - versicolor),
         max_difference = pmax(versicolor_vs_setosa, virginica_vs_setosa, virginica_vs_versicolor))

# Display the results ordered by maximum difference
iris_wide %>% arrange(desc(max_difference))

# A tibble: 4 × 8
  Measurement  setosa versicolor virginica versicolor_vs_setosa
  <chr>         <dbl>      <dbl>     <dbl>                <dbl>
1 Petal.Length  1.46        4.26      5.55                2.80 
2 Petal.Width   0.246       1.33      2.03                1.08 
3 Sepal.Length  5.01        5.94      6.59                0.93 
4 Sepal.Width   3.43        2.77      2.97                0.658
# ℹ 3 more variables: virginica_vs_setosa <dbl>, virginica_vs_versicolor <dbl>,
#   max_difference <dbl>

Handling Missing Data

Missing data is a common challenge. This practical example demonstrates handling techniques:

# Create a simulated customer dataset with missing values
set.seed(123) # For reproducibility

customers <- data.frame(
  customer_id = 1:100,
  age = sample(18:70, 100, replace = TRUE),
  income = round(rnorm(100, 50000, 15000)),
  years_as_customer = sample(0:20, 100, replace = TRUE),
  purchase_frequency = sample(1:10, 100, replace = TRUE)
)

# Introduce missing values randomly
set.seed(456)
customers$age[sample(1:100, 10)] <- NA
customers$income[sample(1:100, 15)] <- NA
customers$purchase_frequency[sample(1:100, 5)] <- NA

# 1. Identify missing data
missing_summary <- sapply(customers, function(x) sum(is.na(x)))
missing_summary

       customer_id                age             income  years_as_customer 
                 0                 10                 15                  0 
purchase_frequency 
                 5

# 2. Visualize the pattern of missing data
library(naniar) # May need to install this package
vis_miss(customers)

# 3. Handle missing data with multiple approaches

# Option A: Remove rows with any missing values
clean_customers <- na.omit(customers)
nrow(customers) - nrow(clean_customers) # Number of rows removed

[1] 26

# Option B: Impute with mean/median (numeric variables only)
imputed_customers <- customers %>%
  mutate(
    age = ifelse(is.na(age), median(age, na.rm = TRUE), age),
    income = ifelse(is.na(income), mean(income, na.rm = TRUE), income),
    purchase_frequency = ifelse(is.na(purchase_frequency), 
                               median(purchase_frequency, na.rm = TRUE), 
                               purchase_frequency)
  )

# Option C: Predictive imputation (using age to predict income)
library(mice) # For more sophisticated imputation
# Quick imputation model - in practice more parameters would be used
imputed_data <- mice(customers, m = 5, method = "pmm", printFlag = FALSE)
customers_complete <- complete(imputed_data)

# Compare results by calculating customer value score
calculate_value <- function(df) {
  df %>%
    mutate(customer_value = (income/10000) * (purchase_frequency/10) * log(years_as_customer + 1)) %>%
    arrange(desc(customer_value)) %>%
    select(customer_id, customer_value, everything())
}

# Top 5 customers by value (original with NAs removed)
head(calculate_value(clean_customers), 5)

  customer_id customer_value age income years_as_customer purchase_frequency
1           7       24.63960  67  82249                19                 10
2          54       15.73965  22  70961                15                  8
3          59       15.67045  50  70649                15                  8
4          84       15.09251  21  55732                14                 10
5          72       14.27848  23  61853                12                  9

# Top 5 customers by value (with imputed values)
head(calculate_value(customers_complete), 5)

  customer_id customer_value age income years_as_customer purchase_frequency
1           7       24.63960  67  82249                19                 10
2          54       15.73965  22  70961                15                  8
3          59       15.67045  50  70649                15                  8
4          84       15.09251  21  55732                14                 10
5          72       14.27848  23  61853                12                  9

Time Series Analysis for Business Trends

Time series analysis is essential for understanding business trends and forecasting:

# Load packages
library(forecast)
library(tseries)

# Examine the built-in AirPassengers dataset (monthly air passengers from 1949 to 1960)
data(AirPassengers)
class(AirPassengers)

[1] "ts"

# Plot the time series
autoplots <- autoplot(AirPassengers) +
  labs(title = "Monthly Air Passengers (1949-1960)",
       y = "Passenger Count",
       x = "Year") +
  theme_minimal()

# Decompose the time series into seasonal components
decomposed <- decompose(AirPassengers, "multiplicative")
autoplot(decomposed) +
  labs(title = "Decomposition of Air Passengers Time Series") +
  theme_minimal()

# Forecasting future values using auto.arima
fit <- auto.arima(AirPassengers)
forecasts <- forecast(fit, h = 24) # Forecast 2 years ahead

# Plot the forecasts
plot(forecasts, 
     main = "Air Passengers Forecast (24 months)",
     xlab = "Year", 
     ylab = "Passenger Count")

# Summary of the forecast model
summary(fit)

Series: AirPassengers 
ARIMA(2,1,1)(0,1,0)[12] 

Coefficients:
         ar1     ar2      ma1
      0.5960  0.2143  -0.9819
s.e.  0.0888  0.0880   0.0292

sigma^2 = 132.3:  log likelihood = -504.92
AIC=1017.85   AICc=1018.17   BIC=1029.35

Training set error measures:
                 ME     RMSE     MAE      MPE     MAPE     MASE        ACF1
Training set 1.3423 10.84619 7.86754 0.420698 2.800458 0.245628 -0.00124847