Data Wrangling in R with the Tidyverse (Part 2)

# Data Wrangling in R with the Tidyverse (Part 2)
### Jessica Minnier, PhD & Meike Niederhausen, PhD <a href="https://www.ohsu.edu/xd/research/centers-institutes/octri/education-training/octri-research-forum.cfm">OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop</a> 
### 2019/04/18 (Part 1) & 2019/04/25 (Part 2) and again! 2019/05/16 (Part 1) & 2019/05/23 (Part 2) slides: <a href="http://bit.ly/berd_tidy2">bit.ly/berd_tidy2</a> pdf: <a href="http://bit.ly/berd_tidy2_pdf">bit.ly/berd_tidy2_pdf</a>

---

---

# Load files for today's workshop

.pull-left-40[
- Open the slides of this workshop: [bit.ly/berd_tidy2](https://bit.ly/berd_tidy2)
- If you haven't already,
  + Open the [pre-workshop homework](https://jminnier-berd-r-courses.netlify.com/02-data-wrangling-tidyverse/02_pre_course_homework.html)
  + Follow steps 1-5
  + Download zip folder
  + Open `berd_tidyverse_project.Rproj`
- __Open a new R script and run the commands to the right__
]

```r
# install.packages("tidyverse","janitor","glue") 
library(tidyverse)
library(lubridate) 
library(janitor)
library(glue)
demo_data <- read_csv("data/yrbss_demo.csv")
qn_data <- read_csv("data/yrbss_qn.csv")
```
<center><img src="img/horst_tidyverse.jpg" width="90%" height="90%"></center>[Allison Horst](https://github.com/allisonhorst/stats-illustrations)

]

---

# Learning objectives

.pull-left[
Previously, in Part 1:
- What is data wrangling?
- A few good practices in R/RStudio
- What is tidy data?
- What is tidyverse?
- Manipulate data 
]
.pull-right[
Part 2:
- Summarizing data
- Combine (join) data sets
- Reshaping data: wide vs. long formats 
- Data cleaning:
 + Missing data
 + Strings/character vectors
 + Dates
 + Factors
 + Messy names
]

---

# More data wrangling

[Alison Horst](https://github.com/allisonhorst/stats-illustrations)

---

# Review Part 1: manipulating data

## Columns ([part 1 slides](https://jminnier-berd-r-courses.netlify.com/02-data-wrangling-tidyverse/02_data_wrangling_slides_part1.html#28))

- `select()` to subset columns
- `rename()` to rename columns
- `mutate()` to add new columns or change values within existing columns
  + `separate()` and `unite()` are shortcuts for specific `mutate` type operations

## Rows ([part 1 slides](https://jminnier-berd-r-courses.netlify.com/02-data-wrangling-tidyverse/02_data_wrangling_slides_part1.html#25))

- `filter()` to subset rows
  + `na.omit()` and `distinct()` are shortcuts for specific `filter` type operations
- `arrange()` to order the data

---

# Changing *all* columns at once ([part 1 slides](https://jminnier-berd-r-courses.netlify.com/02-data-wrangling-tidyverse/02_data_wrangling_slides_part1.html#48))

`mutate_all()`, `rename_all()`: applies a function to *all* columns: `mutate_all(FUNCTION, FUNCTION_ARGUMENTS)`

```r
# mutate_all changes the data in all columns
demo_data %>% mutate_all(as.character) %>% head(2)
```

```
# A tibble: 2 x 8
 record age sex grade race4 race7 bmi stweight
 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 
1 931897 15 years old Female 10th White White 17.179 54.43 
2 333862 17 years old Female 12th White White 20.2487 57.15 
```

```r
# rename_all changes all column names
demo_data %>% rename_all(str_sub, end = 2) %>% head(3)
```

```
# A tibble: 3 x 8
 re ag se gr ra ra bm st
 <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17.2 54.4
2 333862 17 years old Female 12th White White 20.2 57.2
3 36253 18 years old or… Male 11th Hispanic/La… Hispanic/L… NA NA 
```

---

# Changing *some* columns at once ([part 1 slides](https://jminnier-berd-r-courses.netlify.com/02-data-wrangling-tidyverse/02_data_wrangling_slides_part1.html#48))

`mutate_at()`, `rename_at()`: uses `vars()` to select specific variables to apply a function to i.e. `mutate_at(vars(SELECT), FUNCTION, FUNCTION_ARGUMENTS)`

```r
# mutate_at changes the data in specified columns
demo_data %>% mutate_at(vars(contains("race"), sex), as.factor) %>% head(2)
```

```
# A tibble: 2 x 8
 record age sex grade race4 race7 bmi stweight
 <dbl> <chr> <fct> <chr> <fct> <fct> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17.2 54.4
2 333862 17 years old Female 12th White White 20.2 57.2
```

```r
# rename_at changes specified column names
demo_data %>% rename_at(vars(record:grade),toupper) %>% head(3)
```

```
# A tibble: 3 x 8
 RECORD AGE SEX GRADE race4 race7 bmi stweight
 <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17.2 54.4
2 333862 17 years old Female 12th White White 20.2 57.2
3 36253 18 years old o… Male 11th Hispanic/L… Hispanic/… NA NA 
```

---

# Changing *some* columns at once ([part 1 slides](https://jminnier-berd-r-courses.netlify.com/02-data-wrangling-tidyverse/02_data_wrangling_slides_part1.html#48))

`mutate_if()`, `rename_if()`, `select_if()`: uses a function that returns TRUE/FALSE to select columns and applies function on the TRUE columns: `mutate_if(BOOLEAN, FUNCTION, FUNCTION_ARGUMENTS)`

```r
demo_data %>% mutate_if(is.numeric, round, digits = 0) %>% head(3)
```

```
# A tibble: 3 x 8
 record age sex grade race4 race7 bmi stweight
 <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17 54
2 333862 17 years old Female 12th White White 20 57
3 36253 18 years old o… Male 11th Hispanic/L… Hispanic/… NA NA
```

```r
demo_data %>% rename_if(is.character, str_sub, end = 2) %>% head(3)
```

```
# A tibble: 3 x 8
 record ag se gr ra ra bmi stweight
 <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17.2 54.4
2 333862 17 years old Female 12th White White 20.2 57.2
3 36253 18 years old o… Male 11th Hispanic/L… Hispanic/… NA NA 
```

---

# Add one or more rows: `add_row()`

```r
*demo_data %>% add_row(record=100, age=NA, sex="Female", grade="9th") %>%
  arrange(record) %>% head(3)
```

```
# A tibble: 3 x 8
 record age sex grade race4 race7 bmi stweight
 <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 100 <NA> Female 9th <NA> <NA> NA NA
2 30592 <NA> <NA> <NA> <NA> <NA> NA NA
3 30593 <NA> <NA> 9th Hispanic/Latino Hispanic/Latino NA NA
```

```r
*demo_data %>% add_row(record=100:102, bmi=c(25,30,18)) %>%
  arrange(record) %>% head(3)
```

```
# A tibble: 3 x 8
 record age sex grade race4 race7 bmi stweight
 <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 100 <NA> <NA> <NA> <NA> <NA> 25 NA
2 101 <NA> <NA> <NA> <NA> <NA> 30 NA
3 102 <NA> <NA> <NA> <NA> <NA> 18 NA
```

---

# Add one or more columns: `add_column()`

```r
*demo_data %>% add_column(study_date = "2019-04-10", .after="record") %>%
  head(3)
```

```
# A tibble: 3 x 9
 record study_date age sex grade race4 race7 bmi stweight
 <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 2019-04-10 15 years … Female 10th White White 17.2 54.4
2 333862 2019-04-10 17 years … Female 12th White White 20.2 57.2
3 36253 2019-04-10 18 years … Male 11th Hispani… Hispan… NA NA 
```

```r
*demo_data %>% add_column(id = 1:nrow(demo_data), .before="record") %>%
  head(3)
```

```
# A tibble: 3 x 9
 id record age sex grade race4 race7 bmi stweight
 <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1 931897 15 years old Female 10th White White 17.2 54.4
2 2 333862 17 years old Female 12th White White 20.2 57.2
3 3 36253 18 years ol… Male 11th Hispanic… Hispanic… NA NA 
```

---
class: center, inverse, middle

# Quick tips on summarizing data

## categorical data
## numerical data

[janitor](https://cran.r-project.org/web/packages/janitor/readme/README.html),   [dplyr](https://dplyr.tidyverse.org/)

---

# Frequency tables: `janitor` package's `tabyl` function

```r
# default table
demo_data %>% tabyl(grade)
```

```
 grade n percent valid_percent
 10th 4907 0.24535 0.2504338
 11th 4891 0.24455 0.2496172
 12th 4577 0.22885 0.2335919
 9th 5219 0.26095 0.2663570
 <NA> 406 0.02030 NA
```

```r
# output can be treated as tibble
demo_data %>% tabyl(grade) %>% select(-n)
```

```
 grade percent valid_percent
 10th 0.24535 0.2504338
 11th 0.24455 0.2496172
 12th 0.22885 0.2335919
 9th 0.26095 0.2663570
 <NA> 0.02030 NA
```
]
.pull-right[

`adorn_` your table!

```r
demo_data %>%
  tabyl(grade) %>%
* adorn_totals("row") %>%
* adorn_pct_formatting(digits=2)
```

```
 grade n percent valid_percent
 10th 4907 24.54% 25.04%
 11th 4891 24.45% 24.96%
 12th 4577 22.88% 23.36%
 9th 5219 26.10% 26.64%
 <NA> 406 2.03% -
 Total 20000 100.00% 100.00%
```

]

---

# 2x2 `tabyl`s

```r
# default 2x2 table
demo_data %>% tabyl(grade, sex)
```

```
 grade Female Male NA_
 10th 2332 2539 36
 11th 2365 2496 30
 12th 2277 2263 37
 9th 2492 2684 43
 <NA> 126 195 85
```

What adornments does the tabyl to right have?
]
.pull-right-60[

```r
demo_data %>% tabyl(grade, sex) %>%
* adorn_percentages(denominator = "col") %>%
* adorn_totals("row") %>%
* adorn_pct_formatting(digits = 1) %>%
* adorn_ns()
```

```
 grade Female Male NA_
 10th 24.3% (2332) 24.9% (2539) 15.6% (36)
 11th 24.7% (2365) 24.5% (2496) 13.0% (30)
 12th 23.7% (2277) 22.2% (2263) 16.0% (37)
 9th 26.0% (2492) 26.4% (2684) 18.6% (43)
 <NA> 1.3% (126) 1.9% (195) 36.8% (85)
 Total 100.0% (9592) 100.0% (10177) 100.0% (231)
```
]

- Notice `grade` is not sorted in a pleasing way. We will learn how to deal with this when we discuss `factors` as a data type in R.
- Base R has a `table` function, but it is clunkier and the output is not a data frame. 
- See the [tabyl vignette](https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html) for more information, adorn options, & 3-way `tabyl`s

---

# Numerical data summaries: `summarize()`

- We can summarize data as a whole, or in groups with `group_by()`
- `group_by()` is very powerful, see [data wrangling cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
- Can also use `summarize_at()`, `summarize_if()`, `summarize_all()`

```r
# summary of all data as a whole
demo_data %>% 
* summarize(bmi_mean =mean(bmi,na.rm=TRUE),
*           bmi_sd = sd(bmi,na.rm=TRUE))
```

```
# A tibble: 1 x 2
 bmi_mean bmi_sd
 <dbl> <dbl>
1 23.5 4.99
```

What does `na.rm=TRUE` do and what happens if we leave it out?
]

```r
# summary by group variable
demo_data %>% 
* group_by(grade) %>%
  summarize(n_per_group = n(), 
            bmi_mean =mean(bmi,na.rm=TRUE),
            bmi_sd = sd(bmi,na.rm=TRUE))
```

```
# A tibble: 5 x 4
 grade n_per_group bmi_mean bmi_sd
 <chr> <int> <dbl> <dbl>
1 <NA> 406 23.5 6.45
2 10th 4907 23.2 4.76
3 11th 4891 23.8 4.89
4 12th 4577 24.2 5.20
5 9th 5219 22.8 4.92
```
]

---

# Combining data sets

<center>
<img src="img/combine_cases.png" width="25%" height="25%"> 
<img src="img/combine_variables.png" width="40%" height="40%"> be careful!!! 
<a href="https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf">dplyr data transformation cheatsheet</a>
</center>

---

# Rows (cases): paste data below each other

`bind_rows()` combines rows from different data sets  
& accounts for different column names

```r
data1
```

```
# A tibble: 2 x 4
 id name height age
 <int> <chr> <dbl> <dbl>
1 1 Nina 2 4
2 2 Yi 1 2
```

```r
data2
```

```
# A tibble: 3 x 4
 id name height years
 <int> <chr> <dbl> <dbl>
1 7 Bo 2 3
2 8 Al 1.7 1
3 9 Juan 1.8 2
```
]
.pull-right[

```r
*bind_rows(data1,data2, .id = "group")
```

```
# A tibble: 5 x 6
 group id name height age years
 <chr> <int> <chr> <dbl> <dbl> <dbl>
1 1 1 Nina 2 4 NA
2 1 2 Yi 1 2 NA
3 2 7 Bo 2 NA 3
4 2 8 Al 1.7 NA 1
5 2 9 Juan 1.8 NA 2
```

<center><img src="img/bind_rows_cheatsheet.png" width="80%" height="80%"> 
<a href="https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf">dplyr data transformation cheatsheet</a>
</center>
]

---

# Columns (variables): *DO NOT USE `bind_cols()`!!*

- `bind_cols()` blindly pastes columns next to each other without preserving order of variables that they have in common  
    + Use `join` to preserve ordering - see next slides

```r
# datasets must have same number of rows to use bind_cols()
demo_sub <- demo_data %>% slice(1:20) # first 20 rows of demo_data
qn_sub <- qn_data %>% slice(1:20) # first 20 rows of qn_data
*bind_cols(demo_sub, qn_sub) # blindly bind columns; 2nd record column got renamed
```

```
# A tibble: 20 x 13
 record age sex grade race4 race7 bmi stweight record1 q8 q12 
 <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
 1 9.32e5 15 y… Fema… 10th White White 17.2 54.4 922375 Neve… <NA> 
 2 3.34e5 17 y… Fema… 12th White White 20.2 57.2 36718 Neve… <NA> 
 3 3.63e4 18 y… Male 11th Hisp… Hisp… NA NA 187667 Did … <NA> 
 4 1.10e6 15 y… Male 10th Blac… Blac… 28.0 85.7 109951 Did … <NA> 
 5 1.30e6 14 y… Male 9th All … Mult… 24.5 66.7 1109733 Neve… <NA> 
 6 2.62e5 17 y… Male 9th All … <NA> NA NA 761127 Most… <NA> 
 7 9.27e5 16 y… Male 11th All … Asian 20.5 70.3 930811 Neve… <NA> 
 8 1.31e6 17 y… Male 12th White White 19.3 59.0 401921 Rare… <NA> 
 9 5.06e5 18 y… Male 12th Hisp… Hisp… 33.1 123. 42244 Rare… <NA> 
10 1.80e5 14 y… Male 10th Blac… Blac… NA NA 271375 Alwa… <NA> 
11 1.10e6 15 y… Male 9th <NA> <NA> 17.1 45.4 639521 Rare… <NA> 
12 1.31e6 16 y… Male 10th Hisp… Hisp… 21.8 66.7 1309762 Did … 6 to…
13 3.12e4 15 y… Male 10th All … <NA> NA NA 1313278 Did … All …
14 1.09e5 16 y… Male 11th White White NA NA 640176 Did … <NA> 
15 1.81e5 15 y… Male 9th White White NA NA 925706 Did … <NA> 
16 1.31e6 15 y… Fema… 10th White White 22.0 65.8 1306447 Neve… I di…
17 7.70e5 16 y… Fema… 10th White White 32.4 86.2 31079 Neve… <NA> 
18 9.38e5 18 y… Fema… 12th White White 21.7 64.9 185891 Some… <NA> 
19 1.31e6 16 y… Male 11th White White 28.3 102. 270777 Alwa… <NA> 
20 2.72e5 18 y… Fema… 11th White White NA NA 1313665 Neve… 0 da…
# … with 2 more variables: q31 <chr>, qn24 <lgl>
```

---

# *`join`*ing your data sets

.pull-left[
- `Join` uses overlapping or selected columns to combine two or more data sets.
- Also called "merging" or "mutating join". 
- Function names are based off of SQL operations for databases.

<center>
<img src="img/joins_x_y.png" width="90%" height="90%">
</center>
]

.pull-right[
<center>
<img src="img/joins_cheatsheet_expl.png" width="90%" height="90%"> 
<a href="https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf">dplyr data transformation cheatsheet</a>
</center>
]

---

# `join` options visually

<center><img src="img/dplyr_join_venn.png" width="30%" height="30%"> 
<a href="https://twitter.com/yutannihilation/status/551572539697143808">Hiroaki Yutani</a></center>

---

# Most commonly used: `left_join()`

- `left_join(x,y)` includes all observations in `x`, regardless of whether they match ones in `y` or not.
- It includes all columns in `y`, but only rows that match `x`'s observations.

```r
df1 <- tibble(a = c(1, 2), b = 2:1)
df2 <- tibble(a = c(1, 3), c = 10:11)
df1
df2
```

```
# A tibble: 2 x 2
 a b
 <dbl> <int>
1 1 2
2 2 1
# A tibble: 2 x 2
 a c
 <dbl> <int>
1 1 10
2 3 11
```
]
.pull-right[

```r
left_join(df1, df2)
```

```
# A tibble: 2 x 3
 a b c
 <dbl> <int> <int>
1 1 2 10
2 2 1 NA
```

- Which common column(s) were used to merge the datasets?  
- What if we want to specify which columns to join by when merging?  
see next slide...
]
---

# Which columns will be used to join?

- If no columns are specified to join by, then *all* overlapping (intersecting) column names will be used
- Often we want to specify which columns to use, 
    + and also how to rename duplicated columns that were not merged

.pull-right[
<center>
<img src="img/joins_cheatsheet_arguments.png" width="90%" height="90%"> 
<a href="https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf">dplyr data transformation cheatsheet</a>
</center>
]

---

# Check for overlapping column names

Goal: merge the demographics (`demo_data`) and questionnaire (`qn_data`) together.

What column names do these datasets have in common?

```r
colnames(demo_data)
```

```
[1] "record"   "age"      "sex"      "grade"    "race4"    "race7"   
[7] "bmi"      "stweight"
```

```r
colnames(qn_data)
```

```
[1] "record" "q8"     "q12"    "q31"    "qn24"  
```

```r
*intersect(colnames(demo_data), colnames(qn_data))
```

```
[1] "record"
```

---

# Merge `demo_data` and `qn_data` together

```r
merged_data <- 
 full_join(demo_data, qn_data, 
 by = "record")

# Check dimensions of original and new datasets
```

]
.pull-right[

```r
dim(demo_data); dim(qn_data); dim(merged_data) 
```

```
[1] 20000     8
```

```
[1] 10000     5
```

```
[1] 20000    12
```
]

```r
merged_data
```

```
# A tibble: 20,000 x 12
 record age sex grade race4 race7 bmi stweight q8 q12 q31 
 <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
 1 9.32e5 15 y… Fema… 10th White White 17.2 54.4 Neve… <NA> Yes 
 2 3.34e5 17 y… Fema… 12th White White 20.2 57.2 <NA> <NA> <NA> 
 3 3.63e4 18 y… Male 11th Hisp… Hisp… NA NA Neve… <NA> <NA> 
 4 1.10e6 15 y… Male 10th Blac… Blac… 28.0 85.7 Neve… <NA> Yes 
 5 1.30e6 14 y… Male 9th All … Mult… 24.5 66.7 <NA> <NA> <NA> 
 6 2.62e5 17 y… Male 9th All … <NA> NA NA <NA> <NA> <NA> 
 7 9.27e5 16 y… Male 11th All … Asian 20.5 70.3 <NA> <NA> <NA> 
 8 1.31e6 17 y… Male 12th White White 19.3 59.0 Neve… 3 to… No 
 9 5.06e5 18 y… Male 12th Hisp… Hisp… 33.1 123. <NA> <NA> <NA> 
10 1.80e5 14 y… Male 10th Blac… Blac… NA NA Neve… <NA> No 
# … with 19,990 more rows, and 1 more variable: qn24 <lgl>
```

---

# Learn more about `join`ing data

- [Two-table verbs vignette for `dplyr` package](https://dplyr.tidyverse.org/articles/two-table.html)
- [Jenny Bryan's STAT545 dplyr cheatsheet for join](https://stat545.com/bit001_dplyr-cheatsheet.html)
- [R for Data Science's "Relational data" chapter (great diagrams)](https://r4ds.had.co.nz/relational-data.html)

---

# Practice

1. Add a column of `1`'s to `qn_data` called `qn_yes` and save the resulting data as `qn_data2`.

1. Join `demo_data` and `qn_data2` by column `record`. Keep all rows from `demo_data` and only rows from `qn_data2` that match records in `demo_data`. Call the resulting data `all_data`.

1. Create a `tabyl()` of `qn_yes` for the data `all_data`.

1. Create a 2x2 table of `qn_yes` vs `grade`.

Note about the data:

- q8 = How often wear bicycle helmet
- q12 = Texted while driving
- q31 = Ever smoked
- qn24 = Bullied past 12 months

---

# Reshaping data

wide vs. long

.pull-left[
<img src="img/horst_tidyr_spread_gather.png" width="80%" height="80%"> 
[Allison Horst](https://github.com/allisonhorst/stats-illustrations)
]
.pull-right[
<img src="img/hex-tidyr.png" width="70%" height="70%"> 
[tidyr](https://tidyr.tidyverse.org)
]
---

# Wide vs. long data

- __Wide__ data has one row per subject, with multiple columns for their repeated measurements
- __Long__ data has multiple rows per subject, with one column for the measurement variable and another indicating from when/where the repeated measures are from

.pull-left[
wide 
<img src="img/SBP_wide2.png" width="73%" height="73%">
]
.pull-right[
long 
<img src="img/SBP_long2.png" width="30%" height="30%">
]

---

# Example wide dataset

Copy and paste the code below into R to create this example dataset

```r
BP_wide <- tibble(id = letters[1:4],
 sex = c("F", "M", "M", "F"),
 SBP_v1 = c(130, 120, 130, 119),
 SBP_v2 = c(110, 116, 136, 106),
 SBP_v3 = c(112, 122, 138, 118))
BP_wide
```

```
# A tibble: 4 x 5
 id sex SBP_v1 SBP_v2 SBP_v3
 <chr> <chr> <dbl> <dbl> <dbl>
1 a F 130 110 112
2 b M 120 116 122
3 c M 130 136 138
4 d F 119 106 118
```

- What do you think the data in the table are measures of?
- How can we tell the data are wide?

---

# Wide to long: `gather()`

```r
BP_wide
```

`gather` columns into rows to make the data long. Need to __specify__:
- __new column names__
    + __key__: stores row names of wide data's gathered columns 
    + __value__: stores data values
- __which columns to gather__
]
.pull-right[

```r
BP_long <- BP_wide %>% 
 gather(key = "visit", value = "SBP", 
 SBP_v1:SBP_v3)
BP_long
```

```
# A tibble: 12 x 4
 id sex visit SBP
 <chr> <chr> <chr> <dbl>
 1 a F SBP_v1 130
 2 b M SBP_v1 120
 3 c M SBP_v1 130
 4 d F SBP_v1 119
 5 a F SBP_v2 110
 6 b M SBP_v2 116
 7 c M SBP_v2 136
 8 d F SBP_v2 106
 9 a F SBP_v3 112
10 b M SBP_v3 122
11 c M SBP_v3 138
12 d F SBP_v3 118
```
]

---

# Long to wide: `spread()`

```r
BP_long
```

]
.pull-right[
`spread` rows into columns to make the data wide. Need to __specify__ which columns in the long data to use:
- __key__ column: has the variable names 
- __value__ column: has the data values

```r
BP_wide2 <- BP_long %>% 
 spread(key = "visit", value = "SBP")
BP_wide2
```

```
# A tibble: 4 x 5
 id sex SBP_v1 SBP_v2 SBP_v3
 <chr> <chr> <dbl> <dbl> <dbl>
1 a F 130 110 112
2 b M 120 116 122
3 c M 130 136 138
4 d F 119 106 118
```
]

---

# Clean up long data's visit column (key column)

```r
BP_long
```

```r
BP_long2 <- BP_long %>% 
 mutate(visit = 
 str_replace(visit,"SBP_v","")) 
BP_long2
```

```
# A tibble: 12 x 4
 id sex visit SBP
 <chr> <chr> <chr> <dbl>
 1 a F 1 130
 2 b M 1 120
 3 c M 1 130
 4 d F 1 119
 5 a F 2 110
 6 b M 2 116
 7 c M 2 136
 8 d F 2 106
 9 a F 3 112
10 b M 3 122
11 c M 3 138
12 d F 3 118
```
]

---

# Make cleaned-up long data wide

```r
head(BP_long2, 2)
```

```
# A tibble: 2 x 4
 id sex visit SBP
 <chr> <chr> <chr> <dbl>
1 a F 1 130
2 b M 1 120
```

```r
BP_wide3 <- BP_long2 %>% 
 spread(key = "visit", value = "SBP")
BP_wide3
```

```
# A tibble: 4 x 5
 id sex `1` `2` `3`
 <chr> <chr> <dbl> <dbl> <dbl>
1 a F 130 110 112
2 b M 120 116 122
3 c M 130 136 138
4 d F 119 106 118
```
]

*Problem*: have numbers as column names, since `spread`'s default is to use the levels of the `key` as the new row names.

*Solution*: have row names start with the `key` column's name `sep`arated by a character

```r
BP_wide4 <- BP_long2 %>% 
 spread(key = "visit", value = "SBP",
 sep="_") # specify separating character
BP_wide4
```

```
# A tibble: 4 x 5
 id sex visit_1 visit_2 visit_3
 <chr> <chr> <dbl> <dbl> <dbl>
1 a F 130 110 112
2 b M 120 116 122
3 c M 130 136 138
4 d F 119 106 118
```
]

---

# Practice

Copy and paste the code below into R to create the dataset `DBP_wide`

```r
DBP_wide <- tibble(id = letters[1:4],
 sex = c("F", "M", "M", "F"),
 v1.DBP = c(88, 84, 102, 70),
 v2.DBP = c(78, 78, 96, 76),
 v3.DBP = c(94, 82, 94, 74),
 age=c(23, 56, 41, 38)
 )
```

1. Make `DBP_wide` into a long dataframe based on the repeated DBP columns and save it as `DBP_long`.
1. Clean up the visit column of `DBP_long` so that the values are 1, 2, 3, and save it as `DBP_long`.

1. Make `DBP_long` wide with column names `visit.1, visit.2, visit.3` for the DBP values, and save it as `DBP_wide2`.

1. Join `DBP_long` with `BP_long2` so that we have one data frame with columns id, sex, visit, SBP, DBP, and age. Save this as `BP_both_long`.

---

# Data cleaning

## (messy NAs, names, strings, dates, factors)

---

# Removing missing data: `drop_na()`

A small data example:

```r
mydata <- tibble(id = 7:9, 
 name = c("Bo","Al","Juan"), 
 height = c(2, NA, 1.8), 
 years = c(51,35,NA))
mydata
```

```
# A tibble: 3 x 4
 id name height years
 <int> <chr> <dbl> <dbl>
1 7 Bo 2 51
2 8 Al NA 35
3 9 Juan 1.8 NA
```
]

Remove *all* rows with **any missing data**

```r
mydata %>% drop_na()
```

```
# A tibble: 1 x 4
 id name height years
 <int> <chr> <dbl> <dbl>
1 7 Bo 2 51
```

Remove rows with `NA` in **selected columns**

```r
mydata %>% drop_na(height)
```

```
# A tibble: 2 x 4
 id name height years
 <int> <chr> <dbl> <dbl>
1 7 Bo 2 51
2 9 Juan 1.8 NA
```
]

---

# Replace `NA`s with another value: `replace_na()`

```r
mydata
```

```
# A tibble: 3 x 4
 id name height years
 <int> <chr> <dbl> <dbl>
1 7 Bo 2 51
2 8 Al NA 35
3 9 Juan 1.8 NA
```

]
.pull-right-60[

```r
mydata %>% 
* mutate(height = replace_na(height, "Unknown"),
*        years = replace_na(years, 0) )
```

```
# A tibble: 3 x 4
 id name height years
 <int> <chr> <chr> <dbl>
1 7 Bo 2 51
2 8 Al Unknown 35
3 9 Juan 1.8 0
```
]

---

# `replace_na()` advanced example

Replaces `NAs` in all columns starting with "q" with the string "No answer"

```r
qn_data %>% 
* mutate_at(vars(starts_with("q")),
*           .funs = list(~replace_na(.,"No answer"))) %>%
  tabyl(q8, q31)
```

```
                             q8   No No answer  Yes
           Always wore a helmet  411        58  216
         Did not ride a bicycle 1318       343 1223
 Most of the time wore a helmet  320        31  173
            Never wore a helmet 1429       418 1672
                      No answer  267       212  300
           Rarely wore a helmet  481        73  405
        Sometimes wore a helmet  351        53  246
```

---

# Convert (i.e. "No answer", 9999, etc) to `NA`: `na_if()`

```r
all_data %>% tabyl(race4)
```

```
 race4 n percent valid_percent
 All other races 4713 0.23565 0.2443235
 Black or African American 4093 0.20465 0.2121825
 Hispanic/Latino 4670 0.23350 0.2420943
 White 5814 0.29070 0.3013997
 <NA> 710 0.03550 NA
```

```r
all_data %>%
* mutate(race4 = na_if(race4, "All other races")) %>%
  tabyl(race4)
```

```
 race4 n percent valid_percent
 Black or African American 4093 0.20465 0.2807848
 Hispanic/Latino 4670 0.23350 0.3203677
 White 5814 0.29070 0.3988475
 <NA> 5423 0.27115 NA
```

---
# `na_if()` for all your data

**Avoid this** by reading in your data correctly:

```r
smalldata <- read_csv("data/small_data.csv",
* na = c("","9999","NA")) # specify your own missing values
```

**Otherwise** `na_if()` everything:

```r
# replace all "" with NA
all_data %>%
* mutate_if(is.character, .funs = na_if(.,""))

# replace all 9999's with NA 
all_data %>%
* mutate_if(is.numeric, .funs = na_if(.,9999))
```

---

# Working with character strings

- [Use the package `stringr`](https://stringr.tidyverse.org/) (loaded with `tidyverse`)
- Paste strings or values together [with package `glue`](https://glue.tidyverse.org/) (installed, not loaded w/ `tidyverse`)
- *advanced tip*: learn ["regular expressions"](https://stringr.tidyverse.org/articles/regular-expressions.html) ([regex](https://xkcd.com/208/)) for pattern matching (see [cheatsheet](https://www.rstudio.com/resources/cheatsheets/#stringr)) and matching multiple characters/strings at once

<center><img src="img/stringr_cheatsheet_clip.png" width="100%" height="90%"></center>
[stringr cheatsheet](https://www.rstudio.com/resources/cheatsheets/#stringr)

---

# `str_detect()` find strings

```r
mydata <- tibble(name = c("J.M.","Ella","Jay"), state = c("New Mexico","New York","Oregon"))
```

Filter based on string detection

```r
mydata %>% filter(str_detect(name,"J"))       
```

```
# A tibble: 2 x 2
 name state 
 <chr> <chr> 
1 J.M. New Mexico
2 Jay Oregon 
```
]
.pull-right[

Creates a column of TRUE/FALSE if detected

```r
mydata %>% mutate(
  new_state = str_detect(state,"New"))    
```

```
# A tibble: 3 x 3
 name state new_state
 <chr> <chr> <lgl> 
1 J.M. New Mexico TRUE 
2 Ella New York TRUE 
3 Jay Oregon FALSE 
```
]
---

# `str_replace_all()`, `str_replace()`

```r
mydata %>% mutate(state_old = str_replace_all(state, "New", "Old"))
```

```
# A tibble: 3 x 3
 name state state_old 
 <chr> <chr> <chr> 
1 J.M. New Mexico Old Mexico
2 Ella New York Old York 
3 Jay Oregon Oregon 
```

```r
mydata %>% mutate(
  name2 = str_replace(name, "l", "-"),           # first instance
  name3 = str_replace_all(name, "l", "-"),       # all instances
  name4 = str_replace_all(name, fixed("."), "")) # special characters with fixed()
```

```
# A tibble: 3 x 5
 name state name2 name3 name4
 <chr> <chr> <chr> <chr> <chr>
1 J.M. New Mexico J.M. J.M. JM 
2 Ella New York E-la E--a Ella 
3 Jay Oregon Jay Jay Jay 
```

---

# `str_sub()`: shorten strings

Based on position `1` (`start = 1`) to length of string (`end = -1`)

```r
mydata %>% mutate(
  short_name  = str_sub(name, start = 1, end = 2),   # specify start to end
  short_name2 = str_sub(name, end = 2),              # specify only end
  short_state = str_sub(state, end = -3)             # negative endices, from end
  )
```

```
# A tibble: 3 x 5
 name state short_name short_name2 short_state
 <chr> <chr> <chr> <chr> <chr> 
1 J.M. New Mexico J. J. New Mexi 
2 Ella New York El El New Yo 
3 Jay Oregon Ja Ja Oreg 
```

---

# Paste strings together with `glue()`

- `paste()` is the base R way of pasting strings (surprise, it's hard to use)
- `glue()` is most useful when pasting data columns together
- **column *names* or function *operations* go inside `{}`**
- See the [glue vignette](https://glue.tidyverse.org/index.html)

```r
all_data %>%
* mutate(info = glue("Student {record} is {age} with BMI = {round(bmi,1)}")) %>%
  select(record, info) %>% head(5)
```

```
# A tibble: 5 x 2
 record info 
 <dbl> <S3: glue> 
1 931897 Student 931897 is 15 years old with BMI = 17.2 
2 333862 Student 333862 is 17 years old with BMI = 20.2 
3 36253 Student 36253 is 18 years old or older with BMI = NA
4 1095530 Student 1095530 is 15 years old with BMI = 28 
5 1303997 Student 1303997 is 14 years old with BMI = 24.5 
```

---

# Using `glue` to summarize data

- Useful for tables (will cover this more in another session)
- Example, calculate the S.E. of the mean and create a column with "mean (SE)" of bmi:

```r
demo_data %>% 
  group_by(sex) %>%
  summarize(n_sex = n(),
            bmi_mean = mean(bmi,na.rm=TRUE),
            bmi_sd = sd(bmi,na.rm=TRUE)) %>%
* mutate(bmi_mean_se = glue("{round(bmi_mean,1)} ({signif(bmi_sd/sqrt(n_sex),2)})"))
```

```
# A tibble: 3 x 5
 sex n_sex bmi_mean bmi_sd bmi_mean_se 
 <chr> <int> <dbl> <dbl> <S3: glue> 
1 <NA> 231 NaN NaN NaN (NaN) 
2 Female 9592 23.3 4.96 23.3 (0.051)
3 Male 10177 23.7 5.01 23.7 (0.05) 
```

---

# Wrangle dates with `lubridate`

.pull-left[
- Convert characters to special "Date" type
- Convert *terrible excel date formats* into workable data
- Easy date magic examples:
    + add and subtract dates
    + convert to minutes/years/etc
    + change timezones
    + add 1 month to a date...
- [`lubridate` cheat sheet](https://www.rstudio.com/resources/cheatsheets/#lubridate)
- `read_csv` and `read_excel` etc automatically import dates correctly

]

.pull-right[
<center><img src="img/horst_lubridate.png" width="100%" height="100%"><a href="https://github.com/allisonhorst/stats-illustrations"> Allison Horst</a>
</center>
]

---

# What kind of date do you have?

.pull-left-40[
<center><img src="img/lubridate_parse_date_times.png" width="100%" height="100%"></center>
[`lubridate` cheat sheet](https://www.rstudio.com/resources/cheatsheets/#lubridate)
]
.pull-right-60[

```r
timedata <- 
 tibble(name = c("Yi","Bo","DJ"), 
 dob=c("10/31/1952","1/12/1984","2/02/2002"))
timedata %>% 
 mutate(dob_date = mdy(dob),
 dob_wrong = dmy(dob)) # wrong order
```

```
# A tibble: 3 x 4
 name dob dob_date dob_wrong 
 <chr> <chr> <date> <date> 
1 Yi 10/31/1952 1952-10-31 NA 
2 Bo 1/12/1984 1984-01-12 1984-12-01
3 DJ 2/02/2002 2002-02-02 2002-02-02
```
]

---

# Math with dates

```r
timedata %>% mutate(
  dob = mdy(dob),                            # convert to a date
  dob_year = year(dob),                      # extract the year
  time_since_birth = dob %--% today(),       # create an "interval"
  age = time_since_birth %/% years(1),       # modulus on "years"
  dobplus = dob + days(10)                   # add 10 days
  )                  
```

```
# A tibble: 3 x 6
 name dob dob_year time_since_birth age dobplus 
 <chr> <date> <dbl> <S4: Interval> <dbl> <date> 
1 Yi 1952-10-31 1952 1952-10-31 UTC--2019-05-23 UTC 66 1952-11-10
2 Bo 1984-01-12 1984 1984-01-12 UTC--2019-05-23 UTC 35 1984-01-22
3 DJ 2002-02-02 2002 2002-02-02 UTC--2019-05-23 UTC 17 2002-02-12
```

---

# Factors - categorical data

.pull-left-40[
- Clean and order factors with `forcats` package
- Will go over this for `ggplot2` (visualization), statistical modeling (i.e. for `lm()`), and probably a workshop for creating tables
- See [`forcats` cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/factors.pdf)
and [`forcats` vignette]()
]
.pull-right-60[
<center><img src="img/forcats_cheatsheet.png" width="90%" height="90%"></center>
]

---

# `forcats` examples - specify levels `fct_relevel()`

```r
mydata <- tibble(
 id = 1:4, 
 grade=c("9th","10th","11th","9th")) %>%
* mutate(grade_fac = factor(grade))
levels(mydata$grade_fac)
```

```
[1] "10th" "11th" "9th" 
```

```r
mydata %>% arrange(grade_fac)
```

```
# A tibble: 4 x 3
 id grade grade_fac
 <int> <chr> <fct> 
1 2 10th 10th 
2 3 11th 11th 
3 1 9th 9th 
4 4 9th 9th 
```

]

```r
mydata <- mydata %>% 
 mutate(
* grade_fac =
* fct_relevel(grade_fac,
* c("9th","10th","11th")))
levels(mydata$grade_fac)
```

```
[1] "9th"  "10th" "11th"
```

```r
mydata %>% arrange(grade_fac)
```

```
# A tibble: 4 x 3
 id grade grade_fac
 <int> <chr> <fct> 
1 1 9th 9th 
2 4 9th 9th 
3 2 10th 10th 
4 3 11th 11th 
```

]

---

# `forcats` examples - collapse levels

```r
mydata <- tibble(loc = c("SW","NW","NW","NE","SE","SE"))

mydata %>% mutate(
  loc_fac = factor(loc),
* loc2 = fct_collapse(loc_fac,                         # collapse levels
*                     south = c("SW","SE"),
*                     north = c("NE","NW")),
  loc3 = fct_lump(loc_fac, n=2, other_level = "other") # most common 2 levels + other
  )
```

```
# A tibble: 6 x 4
 loc loc_fac loc2 loc3 
 <chr> <fct> <fct> <fct>
1 SW SW south other
2 NW NW north NW 
3 NW NW north NW 
4 NE NE north other
5 SE SE south SE 
6 SE SE south SE 
```

---

# Other "janitor" work

---

# Clean messy column names with `clean_names()`

```r
mydata <- tibble("First Name"= c("Yi","DJ"), "last init" = c("C","R"),
 "% in" = c(0.1, 0.5), "ñ$$$"= 1:2, " "=3:2," hi"=c("a","b"), 
 "null"=c(NA,NA))
mydata
```

```
# A tibble: 2 x 7
 `First Name` `last init` `% in` `ñ$$$` ` ` ` hi` null 
 <chr> <chr> <dbl> <int> <int> <chr> <lgl>
1 Yi C 0.1 1 3 a NA 
2 DJ R 0.5 2 2 b NA 
```

```r
*mydata %>% clean_names() %>%        # in the janitor package
  remove_empty(c("rows","cols"))    # also useful
```

```
# A tibble: 2 x 6
 first_name last_init percent_in n x hi 
 <chr> <chr> <dbl> <int> <int> <chr>
1 Yi C 0.1 1 3 a 
2 DJ R 0.5 2 2 b 
```

---

# Clean names of your excel sheet

```r
library(readxl)
read_excel("data/messy_names.xlsx", .name_repair = janitor::make_clean_names)
```

```
# A tibble: 20,000 x 9
 record age grade_string race race_2 bmi_kg_m_2 weight_kg x x_2 
 <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
 1 931897 15 y… 10th White White 17.2 54.4 some… <NA> 
 2 333862 17 y… 12th White White 20.2 57.2 notes <NA> 
 3 36253 18 y… 11th Hisp… Hispa… NA NA I th… <NA> 
 4 1095530 15 y… 10th Blac… Black… 28.0 85.7 why WHY? 
 5 1303997 14 y… 9th All … Multi… 24.5 66.7 <NA> <NA> 
 6 261619 17 y… 9th All … <NA> NA NA <NA> <NA> 
 7 926649 16 y… 11th All … Asian 20.5 70.3 <NA> <NA> 
 8 1309082 17 y… 12th White White 19.3 59.0 <NA> <NA> 
 9 506337 18 y… 12th Hisp… Hispa… 33.1 123. <NA> <NA> 
10 180494 14 y… 10th Blac… Black… NA NA <NA> <NA> 
# … with 19,990 more rows
```

---

# Practice

Copy and paste the code below into R to create the dataset `messy_data`

```r
messy_data <- tibble(NAME = c("J N","A C","D E"), 
 `months follow up` = c("", 10, 11), 
 `Date of visit` = c("July 31, 2003", "Nov 12, 2005", "Aug 3, 2007"))
```

1. Clean column names with `clean_names()`.

1. Replace missing ("") data in `months_follow_up` with NA.

1. Convert `months_follow_up` to a numeric variable.

1. Convert `date_of_visit` to a date.

1. Create a column called `date_last_visit` that is the date of visit *plus* months of follow up.

1. Remove rows (cases) with missing data in `months_follow_up`.

1. Remove the spaces in `name`.

---

# This was a *lot*, but learning R gets easier!

<center><img src="img/horst_r_first_then.png" width="40%" height="90%"></center>[Allison Horst](https://github.com/allisonhorst/stats-illustrations)

---

# Resources - `tidyverse` & data wrangling

Links

- [Learn the tidyverse](https://www.tidyverse.org/learn/)
- [Data wrangling cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

Some of this is drawn from materials in online books/lessons:

- [R for Data Science](https://r4ds.had.co.nz/index.html) - by Garrett Grolemund & Hadley Wickham
- [Modern Dive](https://moderndive.com/) - An Introduction to Statistical and Data Sciences via R by Chester Ismay & Albert Kim
- [A gRadual intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse) - Workshop for Cascadia R 2017 by Chester Ismay and Ted Laderas
- ["Tidy Data" by Hadley Wickham](https://vita.had.co.nz/papers/tidy-data.pdf)
---

# Possible future workshop topics:

- reproducible reports in R (probably next)
  + R Markdown and knitr
  + Create dynamic Word, html, pdf documents with code + output
- tables
- ggplot2 visualization
- advanced tidyverse: functions, purrr
- statistical modeling in R

**Fill out feedback forms to suggest more or help us prioritize!**

---

## Contact info:

Jessica Minnier: _minnier@ohsu.edu_

Meike Niederhausen: _niederha@ohsu.edu_

## This workshop info:

- Code for these slides on github: [jminnier/berd_r_courses](https://github.com/jminnier/berd_r_courses)
- all the [R code in an R script](https://jminnier-berd-r-courses.netlify.com/02-data-wrangling-tidyverse/02_data_wrangling_slides_part2.R)
- answers to practice problems can be found here: [html](https://jminnier-berd-r-courses.netlify.com/02-data-wrangling-tidyverse/02_data_wrangling_slides_part2_practice_solutions.html)