The data.table
package is a high-performance extension of R’s data.frame
that provides concise syntax for data manipulation. It is particularly efficient for large datasets.
Getting Started with data.table
First, install and load the package:
# Install if needed (uncomment to run)
# install.packages("data.table")
# Load the package
library (data.table)
# Convert the built-in mtcars dataset to a data.table
dt_cars <- as.data.table (mtcars, keep.rownames = TRUE )
setnames (dt_cars, "rn" , "model" ) # Rename the rownames column
head (dt_cars)
model mpg cyl disp hp drat wt qsec vs am
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
gear carb
<num> <num>
1: 4 4
2: 4 4
3: 4 1
4: 3 1
5: 3 2
6: 3 1
Basic Syntax: [i, j, by]
data.table uses a concise syntax based on [i, j, by]
: - i
: Subset rows (WHERE) - j
: Compute on columns (SELECT) - by
: Group by columns (GROUP BY)
Subsetting Rows (i)
# Select cars with 6 cylinders
dt_cars[cyl == 6 ]
model mpg cyl disp hp drat wt qsec vs am gear
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4
2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4
3: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3
4: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3
5: Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4
6: Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4
7: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5
carb
<num>
1: 4
2: 4
3: 1
4: 1
5: 4
6: 4
7: 6
# Multiple conditions
dt_cars[cyl == 6 & mpg > 20 ]
model mpg cyl disp hp drat wt qsec vs am gear
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
3: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3
carb
<num>
1: 4
2: 4
3: 1
# Select specific rows by position
dt_cars[1 : 5 ]
model mpg cyl disp hp drat wt qsec vs am
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
gear carb
<num> <num>
1: 4 4
2: 4 4
3: 4 1
4: 3 1
5: 3 2
Selecting and Computing on Columns (j)
# Select specific columns
dt_cars[1 : 10 , .(mpg, hp, cyl)]
mpg hp cyl
<num> <num> <num>
1: 21.0 110 6
2: 21.0 110 6
3: 22.8 93 4
4: 21.4 110 6
5: 18.7 175 8
6: 18.1 105 6
7: 14.3 245 8
8: 24.4 62 4
9: 22.8 95 4
10: 19.2 123 6
# Compute new values
dt_cars[1 : 10 , .(kpl = mpg * 0.425 )]
kpl
<num>
1: 8.9250
2: 8.9250
3: 9.6900
4: 9.0950
5: 7.9475
6: 7.6925
7: 6.0775
8: 10.3700
9: 9.6900
10: 8.1600
# Select and compute multiple columns
dt_cars[1 : 10 , .(model, kpl = mpg * 0.425 , hp_per_cyl = hp/ cyl)]
model kpl hp_per_cyl
<char> <num> <num>
1: Mazda RX4 8.9250 18.33333
2: Mazda RX4 Wag 8.9250 18.33333
3: Datsun 710 9.6900 23.25000
4: Hornet 4 Drive 9.0950 18.33333
5: Hornet Sportabout 7.9475 21.87500
6: Valiant 7.6925 17.50000
7: Duster 360 6.0775 30.62500
8: Merc 240D 10.3700 15.50000
9: Merc 230 9.6900 23.75000
10: Merc 280 8.1600 20.50000
# Apply functions
dt_cars[1 : 10 , .(avg_mpg = mean (mpg), max_hp = max (hp))]
avg_mpg max_hp
<num> <num>
1: 20.37 245
Grouping (by)
# Group by cylinder and calculate statistics
dt_cars[, .(count = .N, avg_mpg = mean (mpg)), by = cyl]
cyl count avg_mpg
<num> <int> <num>
1: 6 7 19.74286
2: 4 11 26.66364
3: 8 14 15.10000
# Multiple grouping variables
dt_cars[, .(count = .N, avg_mpg = mean (mpg)), by = .(cyl, gear)]
cyl gear count avg_mpg
<num> <num> <int> <num>
1: 6 4 4 19.750
2: 4 4 8 26.925
3: 6 3 2 19.750
4: 8 3 12 15.050
5: 4 3 1 21.500
6: 4 5 2 28.200
7: 8 5 2 15.400
8: 6 5 1 19.700
# Grouping with expressions
dt_cars[, .(count = .N), by = .(cyl, high_mpg = mpg > 20 )]
cyl high_mpg count
<num> <lgcl> <int>
1: 6 TRUE 3
2: 4 TRUE 11
3: 8 FALSE 14
4: 6 FALSE 4
Special Symbols in data.table
data.table provides special symbols for common operations:
# .N: number of rows
dt_cars[, .N]
cyl N
<num> <int>
1: 6 7
2: 4 11
3: 8 14
# .SD: Subset of Data
dt_cars[, lapply (.SD, mean), by = cyl, .SDcols = c ("mpg" , "hp" , "wt" )]
cyl mpg hp wt
<num> <num> <num> <num>
1: 6 19.74286 122.28571 3.117143
2: 4 26.66364 82.63636 2.285727
3: 8 15.10000 209.21429 3.999214
# .I: Row numbers
dt_cars[, .I[1 : 2 ], by = cyl] # First two row numbers for each cyl group
cyl V1
<num> <int>
1: 6 1
2: 6 2
3: 4 3
4: 4 8
5: 8 5
6: 8 7
Modifying Data
data.table allows efficient in-place modifications:
# Create a copy to avoid modifying the original
dt_copy <- copy (dt_cars)
# Add a new column
dt_copy[, efficiency : = mpg/ wt]
head (dt_copy)
model mpg cyl disp hp drat wt qsec vs am
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
gear carb efficiency
<num> <num> <num>
1: 4 4 8.015267
2: 4 4 7.304348
3: 4 1 9.827586
4: 3 1 6.656299
5: 3 2 5.436047
6: 3 1 5.231214
# Update existing values
dt_copy[cyl == 4 , mpg : = mpg * 1.1 ] # Increase mpg by 10% for 4-cylinder cars
head (dt_copy)
model mpg cyl disp hp drat wt qsec vs am
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.00 6 160 110 3.90 2.620 16.46 0 1
2: Mazda RX4 Wag 21.00 6 160 110 3.90 2.875 17.02 0 1
3: Datsun 710 25.08 4 108 93 3.85 2.320 18.61 1 1
4: Hornet 4 Drive 21.40 6 258 110 3.08 3.215 19.44 1 0
5: Hornet Sportabout 18.70 8 360 175 3.15 3.440 17.02 0 0
6: Valiant 18.10 6 225 105 2.76 3.460 20.22 1 0
gear carb efficiency
<num> <num> <num>
1: 4 4 8.015267
2: 4 4 7.304348
3: 4 1 9.827586
4: 3 1 6.656299
5: 3 2 5.436047
6: 3 1 5.231214
# Delete columns
dt_copy[, c ("carb" , "vs" ) : = NULL ]
head (dt_copy)
model mpg cyl disp hp drat wt qsec am gear
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.00 6 160 110 3.90 2.620 16.46 1 4
2: Mazda RX4 Wag 21.00 6 160 110 3.90 2.875 17.02 1 4
3: Datsun 710 25.08 4 108 93 3.85 2.320 18.61 1 4
4: Hornet 4 Drive 21.40 6 258 110 3.08 3.215 19.44 0 3
5: Hornet Sportabout 18.70 8 360 175 3.15 3.440 17.02 0 3
6: Valiant 18.10 6 225 105 2.76 3.460 20.22 0 3
efficiency
<num>
1: 8.015267
2: 7.304348
3: 9.827586
4: 6.656299
5: 5.436047
6: 5.231214
Keys and Indexing
Setting keys enables fast subsetting and joins:
# Set a key
setkey (dt_copy, cyl)
dt_copy
Key: <cyl>
model mpg cyl disp hp drat wt qsec am gear
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Datsun 710 25.08 4 108.0 93 3.85 2.320 18.61 1 4
2: Merc 240D 26.84 4 146.7 62 3.69 3.190 20.00 0 4
3: Merc 230 25.08 4 140.8 95 3.92 3.150 22.90 0 4
4: Fiat 128 35.64 4 78.7 66 4.08 2.200 19.47 1 4
5: Honda Civic 33.44 4 75.7 52 4.93 1.615 18.52 1 4
6: Toyota Corolla 37.29 4 71.1 65 4.22 1.835 19.90 1 4
7: Toyota Corona 23.65 4 120.1 97 3.70 2.465 20.01 0 3
8: Fiat X1-9 30.03 4 79.0 66 4.08 1.935 18.90 1 4
9: Porsche 914-2 28.60 4 120.3 91 4.43 2.140 16.70 1 5
10: Lotus Europa 33.44 4 95.1 113 3.77 1.513 16.90 1 5
11: Volvo 142E 23.54 4 121.0 109 4.11 2.780 18.60 1 4
12: Mazda RX4 21.00 6 160.0 110 3.90 2.620 16.46 1 4
13: Mazda RX4 Wag 21.00 6 160.0 110 3.90 2.875 17.02 1 4
14: Hornet 4 Drive 21.40 6 258.0 110 3.08 3.215 19.44 0 3
15: Valiant 18.10 6 225.0 105 2.76 3.460 20.22 0 3
16: Merc 280 19.20 6 167.6 123 3.92 3.440 18.30 0 4
17: Merc 280C 17.80 6 167.6 123 3.92 3.440 18.90 0 4
18: Ferrari Dino 19.70 6 145.0 175 3.62 2.770 15.50 1 5
19: Hornet Sportabout 18.70 8 360.0 175 3.15 3.440 17.02 0 3
20: Duster 360 14.30 8 360.0 245 3.21 3.570 15.84 0 3
21: Merc 450SE 16.40 8 275.8 180 3.07 4.070 17.40 0 3
22: Merc 450SL 17.30 8 275.8 180 3.07 3.730 17.60 0 3
23: Merc 450SLC 15.20 8 275.8 180 3.07 3.780 18.00 0 3
24: Cadillac Fleetwood 10.40 8 472.0 205 2.93 5.250 17.98 0 3
25: Lincoln Continental 10.40 8 460.0 215 3.00 5.424 17.82 0 3
26: Chrysler Imperial 14.70 8 440.0 230 3.23 5.345 17.42 0 3
27: Dodge Challenger 15.50 8 318.0 150 2.76 3.520 16.87 0 3
28: AMC Javelin 15.20 8 304.0 150 3.15 3.435 17.30 0 3
29: Camaro Z28 13.30 8 350.0 245 3.73 3.840 15.41 0 3
30: Pontiac Firebird 19.20 8 400.0 175 3.08 3.845 17.05 0 3
31: Ford Pantera L 15.80 8 351.0 264 4.22 3.170 14.50 1 5
32: Maserati Bora 15.00 8 301.0 335 3.54 3.570 14.60 1 5
model mpg cyl disp hp drat wt qsec am gear
efficiency
<num>
1: 9.827586
2: 7.648903
3: 7.238095
4: 14.727273
5: 18.823529
6: 18.474114
7: 8.722110
8: 14.108527
9: 12.149533
10: 20.092531
11: 7.697842
12: 8.015267
13: 7.304348
14: 6.656299
15: 5.231214
16: 5.581395
17: 5.174419
18: 7.111913
19: 5.436047
20: 4.005602
21: 4.029484
22: 4.638070
23: 4.021164
24: 1.980952
25: 1.917404
26: 2.750234
27: 4.403409
28: 4.425036
29: 3.463542
30: 4.993498
31: 4.984227
32: 4.201681
efficiency
# Fast subsetting using key
dt_copy[.(6 )] # All rows where cyl == 6
Key: <cyl>
model mpg cyl disp hp drat wt qsec am gear
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 1 4
2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 1 4
3: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 0 3
4: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 0 3
5: Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 0 4
6: Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 0 4
7: Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 1 5
efficiency
<num>
1: 8.015267
2: 7.304348
3: 6.656299
4: 5.231214
5: 5.581395
6: 5.174419
7: 7.111913
# Multiple keys
setkey (dt_copy, cyl, gear)
dt_copy[.(6 , 4 )] # All rows where cyl == 6 and gear == 4
Key: <cyl, gear>
model mpg cyl disp hp drat wt qsec am gear
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 1 4
2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 1 4
3: Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 0 4
4: Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 0 4
efficiency
<num>
1: 8.015267
2: 7.304348
3: 5.581395
4: 5.174419
Joins in data.table
data.table provides efficient joins using keys:
# Create sample data.tables
manufacturers <- data.table (
make = c ("Honda" , "Toyota" , "Ford" , "BMW" , "Mercedes" ),
country = c ("Japan" , "Japan" , "USA" , "Germany" , "Germany" )
)
cars <- data.table (
model = c ("Civic" , "Corolla" , "Focus" , "3 Series" , "Fiesta" ),
make = c ("Honda" , "Toyota" , "Ford" , "BMW" , "Ford" )
)
# Set keys for joining
setkey (manufacturers, make)
setkey (cars, make)
# Inner join
cars[manufacturers]
Key: <make>
model make country
<char> <char> <char>
1: 3 Series BMW Germany
2: Focus Ford USA
3: Fiesta Ford USA
4: Civic Honda Japan
5: <NA> Mercedes Germany
6: Corolla Toyota Japan
# Left join
manufacturers[cars, nomatch= NA ]
Key: <make>
make country model
<char> <char> <char>
1: BMW Germany 3 Series
2: Ford USA Focus
3: Ford USA Fiesta
4: Honda Japan Civic
5: Toyota Japan Corolla
# Non-equi joins
dt_cars[dt_cars[, .(max_mpg = max (mpg)), by = cyl], on = .(mpg = max_mpg, cyl)]
model mpg cyl disp hp drat wt qsec vs am gear
<char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3
2: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4
3: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3
carb
<num>
1: 1
2: 1
3: 2
Reshaping data
data.table provides functions for reshaping data:
# Create a sample data.table
dt <- data.table (
id = rep (1 : 3 , each = 2 ),
variable = rep (c ("height" , "weight" ), 3 ),
value = c (170 , 68 , 155 , 52 , 182 , 75 )
)
dt
id variable value
<int> <char> <num>
1: 1 height 170
2: 1 weight 68
3: 2 height 155
4: 2 weight 52
5: 3 height 182
6: 3 weight 75
# Wide to long
dt_wide <- dcast (dt, id ~ variable, value.var = "value" )
dt_wide
Key: <id>
id height weight
<int> <num> <num>
1: 1 170 68
2: 2 155 52
3: 3 182 75
# Long to wide
dt_long <- melt (dt_wide, id.vars = "id" , variable.name = "measure" , value.name = "value" )
dt_long
id measure value
<int> <fctr> <num>
1: 1 height 170
2: 2 height 155
3: 3 height 182
4: 1 weight 68
5: 2 weight 52
6: 3 weight 75