Data Wrangling in R with the Tidyverse (Part 2)

class: left, middle, inverse, title-slide

# Data Wrangling in R with the Tidyverse (Part 2) - Practice Solutions
### Jessica Minnier, PhD & Meike Niederhausen, PhD <a href="https://www.ohsu.edu/xd/research/centers-institutes/octri/education-training/octri-research-forum.cfm">OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop</a> 
### 2019/04/25 (Part 2) and again! 2019/05/23 (Part 2) solutions: <a href="http://bit.ly/berd_tidy2_solns">bit.ly/berd_tidy2_solns</a> slides: <a href="http://bit.ly/berd_tidy2">bit.ly/berd_tidy2</a> pdf: <a href="http://bit.ly/berd_tidy2_pdf">bit.ly/berd_tidy2_pdf</a>

---

layout: true

---

# Load the data and packages

```r
# install.packages("tidyverse","janitor","glue") 
library(tidyverse)
library(lubridate) 
library(janitor)
library(glue)
demo_data <- read_csv("data/yrbss_demo.csv")
qn_data <- read_csv("data/yrbss_qn.csv")
```

---

# Practice 1

1. Add a column of `1`'s to `qn_data` called `qn_yes` and save the resulting data as `qn_data2`.

1. Join `demo_data` and `qn_data2` by column `record`. Keep all rows from `demo_data` and only rows from `qn_data2` that match records in `demo_data`. Call the resulting data `all_data`.

1. Create a `tabyl()` of `qn_yes` for the data `all_data`.

1. Create a 2x2 table of `qn_yes` vs `grade`.

Note about the data:

- q8 = How often wear bicycle helmet
- q12 = Texted while driving
- q31 = Ever smoked
- qn24 = Bullied past 12 months

---

```r
qn_data2 <- qn_data %>% add_column(qn_yes = 1)
all_data <- left_join(demo_data, qn_data2)
all_data %>% tabyl(qn_yes)
```

```
 qn_yes     n percent valid_percent
      1 10000     0.5             1
     NA 10000     0.5            NA
```

```r
all_data %>% tabyl(qn_yes,grade)
```

```
 qn_yes 10th 11th 12th  9th NA_
      1 2443 2498 2287 2573 199
     NA 2464 2393 2290 2646 207
```

---

# Practice 2

1. Make `DBP_wide` into a long dataframe based on the repeated DBP columns and save it as `DBP_long`.
1. Clean up the visit column of `DBP_long` so that the values are 1, 2, 3, and save it as `DBP_long`.

1. Make `DBP_long` wide with column names `visit.1, visit.2, visit.3` for the DBP values, and save it as `DBP_wide2`.

1. Join `DBP_long` with `BP_long2` so that we have one data frame with columns id, sex, visit, SBP, DBP, and age. Save this as `BP_both_long`.

---

# Practice 2 Initial Data

Copy and paste the code below into R to create the datasets:

```r
DBP_wide <- tibble(id = letters[1:4],
 sex = c("F", "M", "M", "F"),
 v1.DBP = c(88, 84, 102, 70),
 v2.DBP = c(78, 78, 96, 76),
 v3.DBP = c(94, 82, 94, 74),
 age=c(23, 56, 41, 38)
 )
BP_wide <- tibble(id = letters[1:4],
 sex = c("F", "M", "M", "F"),
 SBP_v1 = c(130, 120, 130, 119),
 SBP_v2 = c(110, 116, 136, 106),
 SBP_v3 = c(112, 122, 138, 118))
BP_long <- BP_wide %>% 
 gather(key = "visit", value = "SBP", SBP_v1:SBP_v3)
BP_long2 <- BP_long %>% 
 mutate(visit = str_replace(visit,"SBP_v","")) 
```
---

# Practice 2 solutions (1/2)

.pull-left[

```r
DBP_long <- DBP_wide %>%
 gather(key = "visit", value = "DBP", 
 v1.DBP, v2.DBP, v3.DBP) %>%
 mutate(visit = str_replace(
 visit, c("v"), "")) %>%
 mutate(visit = str_replace(
 visit, ".DBP", "")) 
DBP_long
```

```
# A tibble: 12 x 5
 id sex age visit DBP
 <chr> <chr> <dbl> <chr> <dbl>
 1 a F 23 1 88
 2 b M 56 1 84
 3 c M 41 1 102
 4 d F 38 1 70
 5 a F 23 2 78
 6 b M 56 2 78
 7 c M 41 2 96
 8 d F 38 2 76
 9 a F 23 3 94
10 b M 56 3 82
11 c M 41 3 94
12 d F 38 3 74
```
]
.pull-right[

```r
DBP_wide2 <- DBP_long %>% 
 spread(
 key = "visit", value = "DBP",
 sep=".") # specify separating character
DBP_wide2
```

```
# A tibble: 4 x 6
 id sex age visit.1 visit.2 visit.3
 <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a F 23 88 78 94
2 b M 56 84 78 82
3 c M 41 102 96 94
4 d F 38 70 76 74
```
]
---

# Practice 2 solutions (2/2)

```r
BP_both_long <- left_join(BP_long2, DBP_long, by = c("id", "sex", "visit"))
BP_both_long
```

```
# A tibble: 12 x 6
 id sex visit SBP age DBP
 <chr> <chr> <chr> <dbl> <dbl> <dbl>
 1 a F 1 130 23 88
 2 b M 1 120 56 84
 3 c M 1 130 41 102
 4 d F 1 119 38 70
 5 a F 2 110 23 78
 6 b M 2 116 56 78
 7 c M 2 136 41 96
 8 d F 2 106 38 76
 9 a F 3 112 23 94
10 b M 3 122 56 82
11 c M 3 138 41 94
12 d F 3 118 38 74
```

---

# Practice 3

```r
messy_data <- tibble(NAME = c("J N","A C","D E"), 
 `months follow up` = c("", 10, 11), 
 `Date of visit` = c("July 31, 2003", "Nov 12, 2005", "Aug 3, 2007"))
```

1. Clean column names with `clean_names()`.
1. Replace missing ("") data in `months_follow_up` with NA.
1. Convert `months_follow_up` to a numeric variable.
1. Convert `date_of_visit` to a date.
1. Create a column called `date_last_visit` that is the date of visit *plus* months of follow up.
1. Remove rows (cases) with missing data in `months_follow_up`.
1. Remove the spaces in `name`.

```r
messy_data
```

```
# A tibble: 3 x 3
 NAME `months follow up` `Date of visit`
 <chr> <chr> <chr> 
1 J N "" July 31, 2003 
2 A C 10 Nov 12, 2005 
3 D E 11 Aug 3, 2007 
```

---

# Practice solutions 3 (1/2)

```r
clean_data <- messy_data %>% 
 clean_names() %>%
 mutate(
 months_follow_up = replace_na(months_follow_up,""),
 months_follow_up = as.numeric(months_follow_up),
 date_of_visit = mdy(date_of_visit),
 date_last_visit = date_of_visit + months(months_follow_up))
clean_data
```

```
# A tibble: 3 x 4
 name months_follow_up date_of_visit date_last_visit
 <chr> <dbl> <date> <date> 
1 J N NA 2003-07-31 NA 
2 A C 10 2005-11-12 2006-09-12 
3 D E 11 2007-08-03 2008-07-03 
```

---

# Practice solutions 3 (2/2)

```r
clean_data <- clean_data %>%
 drop_na(months_follow_up) %>%
 mutate(name = str_replace_all(name," ",""))
clean_data
```

```
# A tibble: 2 x 4
 name months_follow_up date_of_visit date_last_visit
 <chr> <dbl> <date> <date> 
1 AC 10 2005-11-12 2006-09-12 
2 DE 11 2007-08-03 2008-07-03 
```