Mutating columns

Jeff Stevens

2023-02-15

Review

Data wrangling

Mutating columns

Mental model of mutating columns

Set-up

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Changing and creating columns

dplyr::mutate()

Changing columns

Changing existing columns

Changing existing columns

mutate(flights, month = as.character(month))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <chr> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013 1         1      517            515         2      830            819
 2  2013 1         1      533            529         4      850            830
 3  2013 1         1      542            540         2      923            850
 4  2013 1         1      544            545        -1     1004           1022
 5  2013 1         1      554            600        -6      812            837
 6  2013 1         1      554            558        -4      740            728
 7  2013 1         1      555            600        -5      913            854
 8  2013 1         1      557            600        -3      709            723
 9  2013 1         1      557            600        -3      838            846
10  2013 1         1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Changing existing columns

Conditional changes

ifelse()

mutate(flights, month = ifelse(month < 10, # conditional statement
                               paste0("0", month), # what to do if TRUE
                               as.character(month))) # what to do if FALSE
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <chr> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013 01        1      517            515         2      830            819
 2  2013 01        1      533            529         4      850            830
 3  2013 01        1      542            540         2      923            850
 4  2013 01        1      544            545        -1     1004           1022
 5  2013 01        1      554            600        -6      812            837
 6  2013 01        1      554            558        -4      740            728
 7  2013 01        1      555            600        -5      913            854
 8  2013 01        1      557            600        -3      709            723
 9  2013 01        1      557            600        -3      838            846
10  2013 01        1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Changing existing columns

Conditional changes

dplyr::if_else()

mutate(flights, month = if_else(month < 10,  # conditional statement
                                paste0("0", month),  # what to do if TRUE
                                as.character(month),  # what to do if FALSE
                                NA)) # what to do if missing
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <chr> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013 01        1      517            515         2      830            819
 2  2013 01        1      533            529         4      850            830
 3  2013 01        1      542            540         2      923            850
 4  2013 01        1      544            545        -1     1004           1022
 5  2013 01        1      554            600        -6      812            837
 6  2013 01        1      554            558        -4      740            728
 7  2013 01        1      555            600        -5      913            854
 8  2013 01        1      557            600        -3      709            723
 9  2013 01        1      557            600        -3      838            846
10  2013 01        1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Changing existing columns

Multiple changes

mutate(flights, 
       month = if_else(month < 10, paste0("0", month), as.character(month), NA),
       day = if_else(day < 10, paste0("0", day), as.character(day), NA)
)
# A tibble: 336,776 × 19
    year month day   dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <chr> <chr>    <int>          <int>     <dbl>    <int>          <int>
 1  2013 01    01         517            515         2      830            819
 2  2013 01    01         533            529         4      850            830
 3  2013 01    01         542            540         2      923            850
 4  2013 01    01         544            545        -1     1004           1022
 5  2013 01    01         554            600        -6      812            837
 6  2013 01    01         554            558        -4      740            728
 7  2013 01    01         555            600        -5      913            854
 8  2013 01    01         557            600        -3      709            723
 9  2013 01    01         557            600        -3      838            846
10  2013 01    01         558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Creating columns

Creating new columns

Creating new columns

mutate(flights, speed = distance / air_time * 60)
# A tibble: 336,776 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, speed <dbl>

Creating new columns

Move column when creating

mutate(flights, speed = distance / air_time * 60, .after = day)
# A tibble: 336,776 × 20
    year month   day speed dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int> <dbl>    <int>          <int>     <dbl>    <int>
 1  2013     1     1  370.      517            515         2      830
 2  2013     1     1  374.      533            529         4      850
 3  2013     1     1  408.      542            540         2      923
 4  2013     1     1  517.      544            545        -1     1004
 5  2013     1     1  394.      554            600        -6      812
 6  2013     1     1  288.      554            558        -4      740
 7  2013     1     1  404.      555            600        -5      913
 8  2013     1     1  259.      557            600        -3      709
 9  2013     1     1  405.      557            600        -3      838
10  2013     1     1  319.      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Creating new columns

Conditionals with multiple outcomes

dplyr::case_when

mutate(flights, season = case_when(month %in% c(3:5) ~ "spring",
                                   month %in% c(6:8) ~ "summer",
                                   month %in% c(9:11) ~ "fall",
                                   month %in% c(12, 1:2) ~ "winter"),
       .after = day)
# A tibble: 336,776 × 20
    year month   day season dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int> <chr>     <int>          <int>     <dbl>    <int>
 1  2013     1     1 winter      517            515         2      830
 2  2013     1     1 winter      533            529         4      850
 3  2013     1     1 winter      542            540         2      923
 4  2013     1     1 winter      544            545        -1     1004
 5  2013     1     1 winter      554            600        -6      812
 6  2013     1     1 winter      554            558        -4      740
 7  2013     1     1 winter      555            600        -5      913
 8  2013     1     1 winter      557            600        -3      709
 9  2013     1     1 winter      557            600        -3      838
10  2013     1     1 winter      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Creating new columns

Multiple columns

mutate(flights, speed = distance / air_time * 60, .after = dep_time) %>%
  mutate(month = if_else(month < 10, paste0("0", month), as.character(month), NA),
         day = if_else(day < 10, paste0("0", day), as.character(day), NA),
         date = paste(year, month, day, sep = "-"), 
         .after = day)
# A tibble: 336,776 × 21
    year month day   date       dep_time speed sched_dep_time dep_delay arr_time
   <int> <chr> <chr> <chr>         <int> <dbl>          <int>     <dbl>    <int>
 1  2013 01    01    2013-01-01      517  370.            515         2      830
 2  2013 01    01    2013-01-01      533  374.            529         4      850
 3  2013 01    01    2013-01-01      542  408.            540         2      923
 4  2013 01    01    2013-01-01      544  517.            545        -1     1004
 5  2013 01    01    2013-01-01      554  394.            600        -6      812
 6  2013 01    01    2013-01-01      554  288.            558        -4      740
 7  2013 01    01    2013-01-01      555  404.            600        -5      913
 8  2013 01    01    2013-01-01      557  259.            600        -3      709
 9  2013 01    01    2013-01-01      557  405.            600        -3      838
10  2013 01    01    2013-01-01      558  319.            600        -2      753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Creating new columns

Remove old columns

mutate(flights, date = paste(year, month, day, sep = "-"), 
       .before = 1, 
       .keep = "unused")
# A tibble: 336,776 × 17
   date     dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <chr>       <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1 2013-1-1      517            515         2      830            819        11
 2 2013-1-1      533            529         4      850            830        20
 3 2013-1-1      542            540         2      923            850        33
 4 2013-1-1      544            545        -1     1004           1022       -18
 5 2013-1-1      554            600        -6      812            837       -25
 6 2013-1-1      554            558        -4      740            728        12
 7 2013-1-1      555            600        -5      913            854        19
 8 2013-1-1      557            600        -3      709            723       -14
 9 2013-1-1      557            600        -3      838            846        -8
10 2013-1-1      558            600        -2      753            745         8
# ℹ 336,766 more rows
# ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Creating new columns

Keep only new column

mutate(flights, date = paste(year, month, day, sep = "-"), 
       .keep = "none")
# A tibble: 336,776 × 1
   date    
   <chr>   
 1 2013-1-1
 2 2013-1-1
 3 2013-1-1
 4 2013-1-1
 5 2013-1-1
 6 2013-1-1
 7 2013-1-1
 8 2013-1-1
 9 2013-1-1
10 2013-1-1
# ℹ 336,766 more rows

Working with multiple columns

Apply functions to multiple columns

mutate(flights, min_dep_time = min(dep_time, sched_dep_time, na.rm = TRUE), 
       .after = arr_time)
# A tibble: 336,776 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time min_dep_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>        <int>
 1  2013     1     1      517            515         2      830            1
 2  2013     1     1      533            529         4      850            1
 3  2013     1     1      542            540         2      923            1
 4  2013     1     1      544            545        -1     1004            1
 5  2013     1     1      554            600        -6      812            1
 6  2013     1     1      554            558        -4      740            1
 7  2013     1     1      555            600        -5      913            1
 8  2013     1     1      557            600        -3      709            1
 9  2013     1     1      557            600        -3      838            1
10  2013     1     1      558            600        -2      753            1
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Apply functions to multiple columns

dplyr::rowwise()

rowwise(flights) %>% 
  mutate(min_dep_time = min(dep_time, sched_dep_time), .after = arr_time)
# A tibble: 336,776 × 20
# Rowwise: 
    year month   day dep_time sched_dep_time dep_delay arr_time min_dep_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>        <int>
 1  2013     1     1      517            515         2      830          515
 2  2013     1     1      533            529         4      850          529
 3  2013     1     1      542            540         2      923          540
 4  2013     1     1      544            545        -1     1004          544
 5  2013     1     1      554            600        -6      812          554
 6  2013     1     1      554            558        -4      740          554
 7  2013     1     1      555            600        -5      913          555
 8  2013     1     1      557            600        -3      709          557
 9  2013     1     1      557            600        -3      838          557
10  2013     1     1      558            600        -2      753          558
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Changing multiple columns

dplyr::across()

mutate(flights, across(contains("_time"), as.character))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int> <chr>    <chr>              <dbl> <chr>    <chr>         
 1  2013     1     1 517      515                    2 830      819           
 2  2013     1     1 533      529                    4 850      830           
 3  2013     1     1 542      540                    2 923      850           
 4  2013     1     1 544      545                   -1 1004     1022          
 5  2013     1     1 554      600                   -6 812      837           
 6  2013     1     1 554      558                   -4 740      728           
 7  2013     1     1 555      600                   -5 913      854           
 8  2013     1     1 557      600                   -3 709      723           
 9  2013     1     1 557      600                   -3 838      846           
10  2013     1     1 558      600                   -2 753      745           
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <chr>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Changing multiple columns

dplyr::across()

What if you need to pass arguments to your function?

mutate(flights, across(contains("_time"), ~ .x / 60))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
 1  2013     1     1     8.62           8.58         2     13.8           13.6
 2  2013     1     1     8.88           8.82         4     14.2           13.8
 3  2013     1     1     9.03           9            2     15.4           14.2
 4  2013     1     1     9.07           9.08        -1     16.7           17.0
 5  2013     1     1     9.23          10           -6     13.5           14.0
 6  2013     1     1     9.23           9.3         -4     12.3           12.1
 7  2013     1     1     9.25          10           -5     15.2           14.2
 8  2013     1     1     9.28          10           -3     11.8           12.0
 9  2013     1     1     9.28          10           -3     14.0           14.1
10  2013     1     1     9.3           10           -2     12.6           12.4
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Changing multiple columns

dplyr::across()

What if you need to pass arguments to your function?

print(mutate(flights, across(contains("_time"), ~ .x / 60)), n = 5)

Note

  • Start with ~
  • Replace where the column name should be with .x
  • Note dividing these numbers by 60 doesn’t make sense—just an example

Mental model of mutating columns

Let’s code!

Mutating data [Rmd]