class: center, middle, inverse, title-slide # Getting Started with R and RStudio ### Jessica Minnier, PhD & Meike Niederhausen, PhD ### OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop
2019/02/26 & 2019/03/07
Slides available at
http://bit.ly/berd_r_intro
pdf version:
http://bit.ly/berd_r_intro_pdf
--- # Pre-course installation ## Install R - Windows : Download from https://cran.rstudio.com/bin/windows/base/ - Mac OS X: Download the latest .pkg file (currently R-3.5.2.pkg) from https://cran.rstudio.com/bin/macosx/ ## Install RStudio Desktop Open Source License - Select download file corresponding to your operating system from https://www.rstudio.com/products/rstudio/download/#download --- # Questions - Who has used R? - What other statistical software have you used? - Has anyone used other programming languages (C, java, python, etc)? - Why do you want to learn R? --- # Learning Objectives - Basic operations in R/RStudio - Understand data structures - Be able to load in data - Basic operations on data - Be able to make a plot - Know how to get help --- class: center, inverse, middle # Introduction Rrrrrr? --- # What is R? .pull-left[ - A programming language - Focus on statistical modeling and data analysis + import data, manipulate data, run statistics, make plots - Useful for "Data Science" - Great visualizations - Also useful for most anything else you'd want to tell a computer to do - Interfaces with other languages i.e. python, C++, bash ] .pull-right[ ![](img/R_logo.png) ] For the history and details: [Wikipedia](https://bit.ly/1efFmaY) - an interpreted language (run it through a command line) - procedural programming with functions - Why "R"?? Scheme (?) inspired S (invented at Bell Labs in 1976) which inspired R (**free and open source!** in 1993) --- # What is RStudio? ![](img/01_md_rstudio.png) - R is a programming language - RStudio is an integrated development environment (IDE) = an interface to use R (with perks!) from [Modern Dive](https://moderndive.com/2-getting-started.html); see also [DataCamp's video discussion on the difference](https://campus.datacamp.com/courses/working-with-the-rstudio-ide-part-1/orientation?ex=1) --- # Start RStudio ![](img/01_md_r.png) from [Modern Dive](https://moderndive.com/2-getting-started.html) --- # RStudio anatomy ![](img/RStudio_Anatomy.svg) from [Emma Rand](http://www-users.york.ac.uk/~er13/17C%20-%202018/pracs/01IntroductionToModuleAndRStudio.html#what_are_r_and_rstudio) --- # Rstudio demo --- class: inverse, middle, center # Let's code! --- # Coding in the console .pull-left[ __Typing and execting code in the console __ * Type code in the console * Press __return__ to execute the code <br> _Coding in the console is not advisable for most situations!_ * We only recommend this for short pieces of code that you don't need to save ] .pull-right[ ```r > 7 ``` ``` [1] 7 ``` ```r > 3 + 5 ``` ``` [1] 8 ``` ```r > "hello" ``` ``` [1] "hello" ``` ```r > # this is a comment, nothing happens > # 5 - 8 ``` ] --- # We can do math .pull-left[ ```r > 10^2 ``` ``` [1] 100 ``` ```r > 3 ^ 7 ``` ``` [1] 2187 ``` ```r > 6/9 ``` ``` [1] 0.6666667 ``` ```r > 9-43 ``` ``` [1] -34 ``` ] -- .pull-right[ R follows the rules for order of operations and ignores spaces between numbers (or objects) ```r > 4^3-2* 7+9 /2 ``` ``` [1] 54.5 ``` The equation above is computed as `$$4^3 − (2 \cdot 7) + \frac{9}{2}$$` ] --- # Logarithms and exponentials .pull-left[ Logarithms: `log()` is base `\(e\)` ```r > log(10) ``` ``` [1] 2.302585 ``` ```r > log10(10) ``` ``` [1] 1 ``` ] -- .pull-right[ Exponentials ```r > exp(1) ``` ``` [1] 2.718282 ``` ```r > exp(0) ``` ``` [1] 1 ``` ] <br> -- Check that `log()` is base `\(e\)` ```r > log(exp(1)) ``` ``` [1] 1 ``` --- # Variables Data, information, everything is stored as a variable * Can assign a variable using either `=` or `<-` - Using `<-` is preferable .pull-left[ Assigning just one value: ```r > x = 5 > x ``` ``` [1] 5 ``` ```r > x <- 5 > x ``` ``` [1] 5 ``` ] -- .pull-right[ Assigning a __vector__ of values * Consecutive integers ```r > a <- 3:10 > a ``` ``` [1] 3 4 5 6 7 8 9 10 ``` * __Concatenate__ a string of numbers ```r > b <- c(5, 12, 2, 100, 8) > b ``` ``` [1] 5 12 2 100 8 ``` ] --- # We can do math with variables .pull-left[ Math using variables with just one value ```r > x <- 5 > x ``` ``` [1] 5 ``` ```r > x + 3 ``` ``` [1] 8 ``` ```r > y <- x^2 > y ``` ``` [1] 25 ``` ] -- .pull-right[ Math on vectors of values: element-wise computation ```r > a <- 3:6 > a ``` ``` [1] 3 4 5 6 ``` ```r > a+2 ``` ``` [1] 5 6 7 8 ``` ```r > a*3 ``` ``` [1] 9 12 15 18 ``` ```r > a*a ``` ``` [1] 9 16 25 36 ``` ] --- # Variable can include text (characters) ```r > hi <- "hello" > hi ``` ``` [1] "hello" ``` ```r > greetings <- c("Guten Tag", "Hola", hi) > greetings ``` ``` [1] "Guten Tag" "Hola" "hello" ``` --- # Viewing list of defined variables <!-- __List of defined variables (and other objects)__ --> * The R command to see what objects have been defined is `ls()`. * This list includes all defined objects (including dataframes, functions, etc.) ```r > ls() ``` ``` [1] "a" "b" "greetings" "hi" "x" "y" ``` * You can also look at the list in the Environment window: ![](img/01_ls_screenshot.png) --- # Removing defined variables * The R command to delete an object is `rm()`. ```r > ls() ``` ``` [1] "a" "b" "greetings" "hi" "x" "y" ``` ```r > rm("greetings", hi) # Can run with or without quotes > ls() ``` ``` [1] "a" "b" "x" "y" ``` * Remove EVERYTHING - _Be careful!!_ ```r > rm(list=ls()) > ls() ``` ``` character(0) ``` * Can also remove everything using the _Clear Workspace_ option in the _Session_ menu. --- # Common console errors __Incomplete commands__ .pull-left[ * When the console is waiting for a new command, the prompt line begins with `>` + If the console prompt is `+`, then a previous command is incomplete + You can finish typing the command in the console window ] .pull-right[ Example: ```r > 3 + (2*6 + ) ``` ``` [1] 15 ``` ] -- __Object is not found__ * This happens when text is entered for a non-existent variable (object) Example: ```r > hello ``` ``` Error in eval(expr, envir, enclos): object 'hello' not found ``` --- class: inverse, center, middle # R scripts (save your work!) --- # Coding in a script (1/3) <!-- * Note that both of these options show the keyboard shortcut for your operating system --> * __Create a new script__ by + selecting `File -> New File -> R Script`, + or clicking on ![](img/01_Script_create.png) (the left most button at the top of the scripting window), and then selecting the first option `R Script` * __Type code__ in the script - Type each R command on its own line - Use `#` to convert text to comments so that text doesn't accidentally get executed as an R command ![](img/01_Scripting_practice1.png) --- # Coding in a script (2/3) * __Select code__ you want to execute, by - placing the cursor in the line of code you want to execute, - or highlighting the code you want to execute * __Execute code__ in the script, by - clicking on the ![](img/01_Script_Run.png) button in the top right corner of the scripting window, - or typing one of the following key combinations to execute the code + __Windows__: __ctrl + return__ + __Mac__: __command + return__ ![](img/01_Scripting_practice2.png) --- # Coding in a script (3/3) * The screenshot below shows code in the scripting window (top left window) * The executed highlighted code and its output appear in the console window (bottom left window) ![](img/01_Scripting_practice3.png) --- # Saving a script * __Save a script__ by + selecting `File -> Save`, + or clicking on ![](img/01_Script_Save.png) (towards the left above the scripting window) * You will need to specify + a __filename__ to save the script as - ALWAYS use __.R__ as the filename extension for R scripts + the __folder__ to save the script in --- class: center, inverse, middle # Practice time! --- # Practice questions 1. Create a vector of all integers from 4 to 10, and save it as `a1`. 2. Create a vector of _even_ integers from 4 to 10, and save it as `a2`. 3. What is the sum of `a1` and `a2`? 4. What does the command `sum(a1)` do? 5. What does the command `length(a1)` do? 6. Use the commands to calculate the average of the values in `a1`. 7. The formula for the first `\(n\)` integers is `\(n(n+1)/2\)`. Compute the sum of all integers from 1 to 100 to verify that this formula holds for `\(n=100\)`. 8. Compute the sum of the squares of all integers from 1 to 100. 9. Take a break! --- class: inverse, middle, center # Object types --- # Data frames __Vectors__ vs. __data frames__: a data frame is a collection (or array or table) of vectors ```r > df <- data.frame(IDs=1:3, + gender=c("male", "female", "Male"), + age=c(28, 35.5, 31), + trt = c("control", "1", "1"), + Veteran = c(FALSE, TRUE, TRUE)) > df ``` ``` IDs gender age trt Veteran 1 1 male 28.0 control FALSE 2 2 female 35.5 1 TRUE 3 3 Male 31.0 1 TRUE ``` * A data frame allows different columns to be of different data types (i.e. numeric vs. text), and even allows both numeric and text within a column (stored together as text). * Vectors and data frames are examples of _objects_ in R. + There are other types of R objects to store data, such as matrices, lists, and tibbles. + These will be discussed in future R workshops. --- # Variable types * integer: integer-valued numbers * numeric: numbers that are decimals * factor: how categorical variables are stored * character: text * logical (TRUE, FALSE) Each variable (column) in a data frame can be of a different type. * View the __structure__ of our data frame to see what the variable types are: ```r > str(df) ``` ``` 'data.frame': 3 obs. of 5 variables: $ IDs : int 1 2 3 $ gender : Factor w/ 3 levels "female","male",..: 2 1 3 $ age : num 28 35.5 31 $ trt : Factor w/ 2 levels "1","control": 2 1 1 $ Veteran: logi FALSE TRUE TRUE ``` <!-- * Note that the ID column is _integer_ type since the values are all whole numbers, although we likely would think of it as being a categorical variable and thus prefer it to be a factor. --> --- # Data frame cells, rows, or columns <!-- * Our data frame `df` --> .pull-left[ Show whole data frame ```r > df ``` ``` IDs gender age trt Veteran 1 1 male 28.0 control FALSE 2 2 female 35.5 1 TRUE 3 3 Male 31.0 1 TRUE ``` Specific cell value: `DatSetName[row#, column#]` ```r > # Second row, Third column > df[2, 3] ``` ``` [1] 35.5 ``` ] .pull-right[ Entire column: `DatSetName[, column#]` ```r > # Third column > df[, 3] ``` ``` [1] 28.0 35.5 31.0 ``` Entire row: `DatSetName[row#, ]` ```r > # Second row > df[2,] ``` ``` IDs gender age trt Veteran 2 2 female 35.5 1 TRUE ``` ] --- class: inverse, center, middle # Getting the data into Rstudio --- # Load a data set * Open csv file directly from the internet: ```r > mydata <- read.csv(url("http://bit.ly/berd_data_csv")) ``` * Or, download file and open saved file using Import Dataset button in Environment window: ![](img/01_Import_Dataset.png). + If you use this option, then copy and paste the code from the console importing the data to your script so that you have a record of from where and how you loaded the data set. ```r > View(mydata) > # Can also view the data by clicking on its name in the Environment tab ``` <!-- ![](img/01_View_data_screenshot.png) --> <img src="img/01_View_data_screenshot2.png" width="110%" height="110%"> --- # About the data Data from the CDC's [Youth Risk Behavior Surveillance System (YRBSS) ](https://www.cdc.gov/healthyyouth/data/yrbs/index.htm) - complex survey data - national school-based survey conducted by CDC and state, territorial, tribal, and local surveys conducted by state, territorial, and local education and health agencies and tribal governments - monitors six categories of health-related behaviors that contribute to the leading causes of death and disability among youth and adults (including alcohol & drug use, unhealthy & dangerous behaviors, sexuality, physical activity); see [Questionnaires](https://www.cdc.gov/healthyyouth/data/yrbs/questionnaires.htm) - this data is a small subset (20 rows) of data in the R package [`yrbss`](https://github.com/hadley/yrbss) which includes YRBSS from 1991-2013 - we will use the full R data set in a future workshop teaching data cleaning <img src="img/01_yrbss.png" width="110%" height="110%"> --- # Data set summary ```r > summary(mydata) ``` ``` id age sex grade Min. : 335340 14 years old :1 Female:12 10th:8 1st Qu.: 925193 15 years old :4 Male : 8 11th:4 Median :1207132 16 years old :7 12th:4 Mean :1093150 17 years old :7 9th :4 3rd Qu.:1313188 18 years old or older:1 Max. :1316123 race4 bmi weight_kg All other races :5 Min. :17.48 Min. :43.09 Black or African American:3 1st Qu.:20.36 1st Qu.:57.27 Hispanic/Latino :6 Median :22.23 Median :64.86 White :4 Mean :23.01 Mean :64.09 NA's :2 3rd Qu.:26.58 3rd Qu.:70.31 Max. :29.35 Max. :84.82 text_while_driving_30d smoked_ever bullied_past_12mo 0 days : 5 No :10 Mode :logical 1 or 2 days : 2 Yes : 6 FALSE:11 3 to 5 days : 1 NA's: 4 TRUE :7 All 30 days : 1 NA's :2 I did not drive the past 30 days: 1 NA's :10 ``` --- # Data set info ```r > dim(mydata) ``` ``` [1] 20 10 ``` ```r > nrow(mydata) ``` ``` [1] 20 ``` ```r > ncol(mydata) ``` ``` [1] 10 ``` ```r > names(mydata) ``` ``` [1] "id" "age" [3] "sex" "grade" [5] "race4" "bmi" [7] "weight_kg" "text_while_driving_30d" [9] "smoked_ever" "bullied_past_12mo" ``` --- # Data structure * What are the different __variable types__ in this data set? ```r > str(mydata) # structure of data ``` ``` 'data.frame': 20 obs. of 10 variables: $ id : int 335340 638618 922382 923122 923963 925603 933724 935435 1096564 1108114 ... $ age : Factor w/ 5 levels "14 years old",..: 4 3 1 2 2 3 3 4 2 4 ... $ sex : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 2 1 ... $ grade : Factor w/ 4 levels "10th","11th",..: 1 4 4 4 1 1 1 3 1 4 ... $ race4 : Factor w/ 4 levels "All other races",..: 4 NA 4 4 2 1 1 1 1 2 ... $ bmi : num 27.6 29.3 18.2 21.4 19.6 ... $ weight_kg : num 66.2 84.8 57.6 60.3 63.5 ... $ text_while_driving_30d: Factor w/ 5 levels "0 days","1 or 2 days",..: NA NA NA NA NA NA NA NA NA NA ... $ smoked_ever : Factor w/ 2 levels "No","Yes": NA 2 2 2 1 1 2 1 NA 1 ... $ bullied_past_12mo : logi NA NA FALSE FALSE TRUE TRUE ... ``` --- # View the beginning of a data set ```r > head(mydata) ``` ``` id age sex grade race4 bmi 1 335340 17 years old Female 10th White 27.5671 2 638618 16 years old Female 9th <NA> 29.3495 3 922382 14 years old Male 9th White 18.1827 4 923122 15 years old Male 9th White 21.3754 5 923963 15 years old Male 10th Black or African American 19.5988 6 925603 16 years old Male 10th All other races 22.1910 weight_kg text_while_driving_30d smoked_ever bullied_past_12mo 1 66.23 <NA> <NA> NA 2 84.82 <NA> Yes NA 3 57.61 <NA> Yes FALSE 4 60.33 <NA> Yes FALSE 5 63.50 <NA> No TRUE 6 70.31 <NA> No TRUE ``` ```r > head(mydata, 2) ``` ``` id age sex grade race4 bmi weight_kg 1 335340 17 years old Female 10th White 27.5671 66.23 2 638618 16 years old Female 9th <NA> 29.3495 84.82 text_while_driving_30d smoked_ever bullied_past_12mo 1 <NA> <NA> NA 2 <NA> Yes NA ``` --- # View the end of a data set ```r > tail(mydata) ``` ``` id age sex grade race4 15 1313153 16 years old Female 11th Hispanic/Latino 16 1313291 16 years old Female 11th White 17 1313477 16 years old Female 10th All other races 18 1315121 17 years old Female 11th <NA> 19 1315850 17 years old Female 12th Hispanic/Latino 20 1316123 18 years old or older Female 12th Black or African American bmi weight_kg text_while_driving_30d smoked_ever 15 26.5781 68.04 0 days No 16 24.8047 63.50 3 to 5 days No 17 25.0318 76.66 0 days No 18 22.2687 54.89 I did not drive the past 30 days Yes 19 19.4922 49.90 0 days <NA> 20 27.4894 74.84 All 30 days Yes bullied_past_12mo 15 TRUE 16 FALSE 17 TRUE 18 FALSE 19 FALSE 20 FALSE ``` --- class: inverse, center, middle # Working with the data --- # The $ Suppose we want to single out the column of BMI values. * How did we previously learn to do this? -- ```r > mydata[, 6] ``` ``` [1] 27.5671 29.3495 18.1827 21.3754 19.5988 22.1910 20.9913 17.4814 [9] 22.4593 26.5781 21.1874 19.4637 20.6121 27.4648 26.5781 24.8047 [17] 25.0318 22.2687 19.4922 27.4894 ``` The problem with this method, is that we need to know the column number which can change as we make changes to the data set. -- * Use the `$` instead: `DatSetName$VariableName` ```r > mydata$bmi ``` ``` [1] 27.5671 29.3495 18.1827 21.3754 19.5988 22.1910 20.9913 17.4814 [9] 22.4593 26.5781 21.1874 19.4637 20.6121 27.4648 26.5781 24.8047 [17] 25.0318 22.2687 19.4922 27.4894 ``` --- # Basic plots of numeric data (1/3) ## Histogram ```r > hist(mydata$bmi) ``` <img src="01_getting_started_slides_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> With extra features: ```r > hist(mydata$bmi, xlab = "BMI", main="BMI's of students") ``` <img src="01_getting_started_slides_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- # Basic plots of numeric data (2/3) ## Boxplot .pull-left[ ```r > boxplot(mydata$bmi) ``` <img src="01_getting_started_slides_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r > boxplot(mydata$bmi ~ mydata$sex, + horizontal = TRUE, + xlab = "BMI", ylab = "sex", + main = "BMI's of students by sex") ``` <img src="01_getting_started_slides_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> ] --- # Basic plots of numeric data (3/3) ## Scatterplot .pull-left[ ```r > plot(mydata$weight_kg, mydata$bmi) ``` <img src="01_getting_started_slides_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r > plot(mydata$weight_kg, mydata$bmi, + xlab = "weight (kg)", ylab = "BMI", + main = "BMI vs. Weight") ``` <img src="01_getting_started_slides_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ] --- # Summary stats of numeric data (1/2) ## Standard R `summary` command ```r > summary(mydata$bmi) ``` ``` Min. 1st Qu. Median Mean 3rd Qu. Max. 17.48 20.36 22.23 23.01 26.58 29.35 ``` ## Mean and standard deviation ```r > mean(mydata$bmi) ``` ``` [1] 23.00838 ``` ```r > sd(mydata$bmi) ``` ``` [1] 3.56471 ``` --- # Summary stats of numeric data (2/2) <!-- QQ: Why is (2/2) being cut off? It's not cut off for (1/2).--> ## Min, max, & median .pull-left[ ```r > min(mydata$bmi) ``` ``` [1] 17.4814 ``` ```r > max(mydata$bmi) ``` ``` [1] 29.3495 ``` ] .pull-right[ ```r > median(mydata$bmi) ``` ``` [1] 22.22985 ``` ] ## Quantiles ```r > quantile(mydata$bmi, prob=c(0, .25, .5, .75, 1)) ``` ``` 0% 25% 50% 75% 100% 17.48140 20.35878 22.22985 26.57810 29.34950 ``` --- # Add height column to data frame Since `\(\textrm{BMI} = \frac{kg}{m^2}\)`, we have `\(\textrm{height}(m) = \sqrt{\frac{\textrm{weight}(kg)}{\textrm{BMI}}}\)` <!-- * UPDATE: need correct units! --> ```r > mydata$height_m <- sqrt( mydata$weight_kg / mydata$bmi) > mydata$height_m ``` ``` [1] 1.550000 1.699999 1.779999 1.680001 1.799998 1.780000 1.469998 [8] 1.570002 1.879998 1.600001 1.779998 1.699999 1.730001 1.600001 [15] 1.600001 1.600000 1.750001 1.569998 1.599999 1.650001 ``` ```r > dim(mydata); names(mydata) ``` ``` [1] 20 11 ``` ``` [1] "id" "age" [3] "sex" "grade" [5] "race4" "bmi" [7] "weight_kg" "text_while_driving_30d" [9] "smoked_ever" "bullied_past_12mo" [11] "height_m" ``` --- # Access specific columns in data set .pull-left[ Previously we used `DatSetName[, column#]` ```r > mydata[, c(2, 6)] # 2nd & 6th columns ``` ``` age bmi 1 17 years old 27.5671 2 16 years old 29.3495 3 14 years old 18.1827 4 15 years old 21.3754 5 15 years old 19.5988 6 16 years old 22.1910 7 16 years old 20.9913 8 17 years old 17.4814 9 15 years old 22.4593 10 17 years old 26.5781 11 16 years old 21.1874 12 17 years old 19.4637 13 17 years old 20.6121 14 15 years old 27.4648 15 16 years old 26.5781 16 16 years old 24.8047 17 16 years old 25.0318 18 17 years old 22.2687 19 17 years old 19.4922 20 18 years old or older 27.4894 ``` ] .pull-right[ The code below uses _column names_ instead of numbers. ```r > mydata[, c("age", "bmi")] ``` ``` age bmi 1 17 years old 27.5671 2 16 years old 29.3495 3 14 years old 18.1827 4 15 years old 21.3754 5 15 years old 19.5988 6 16 years old 22.1910 7 16 years old 20.9913 8 17 years old 17.4814 9 15 years old 22.4593 10 17 years old 26.5781 11 16 years old 21.1874 12 17 years old 19.4637 13 17 years old 20.6121 14 15 years old 27.4648 15 16 years old 26.5781 16 16 years old 24.8047 17 16 years old 25.0318 18 17 years old 22.2687 19 17 years old 19.4922 20 18 years old or older 27.4894 ``` ] <!-- This is the same as `mydata$bmi`. --> --- # Access specific rows in data set <!-- Below is code that uses the column names instead of row and column numbers. --> * Rows for 14 year olds only ```r > mydata[mydata$age == "14 years old",] ``` ``` id age sex grade race4 bmi weight_kg 3 922382 14 years old Male 9th White 18.1827 57.61 text_while_driving_30d smoked_ever bullied_past_12mo height_m 3 <NA> Yes FALSE 1.779999 ``` In this case the output is only one row since there is only one 14 year old. * Rows for teens with BMI less than 19 ```r > mydata[mydata$bmi < 19,] ``` ``` id age sex grade race4 bmi weight_kg 3 922382 14 years old Male 9th White 18.1827 57.61 8 935435 17 years old Female 12th All other races 17.4814 43.09 text_while_driving_30d smoked_ever bullied_past_12mo height_m 3 <NA> Yes FALSE 1.779999 8 <NA> No FALSE 1.570002 ``` --- # Access specific values in data set * Grade and race for 15 year olds only ```r > mydata[mydata$age == "15 years old", c("age", "grade", "race4")] ``` ``` age grade race4 4 15 years old 9th White 5 15 years old 10th Black or African American 9 15 years old 10th All other races 14 15 years old 10th Hispanic/Latino ``` * Age, sex, and BMI for students with BMI less than 19 ```r > mydata[mydata$bmi < 19, c("age", "sex", "bmi")] ``` ``` age sex bmi 3 14 years old Male 18.1827 8 17 years old Female 17.4814 ``` --- # Practice 1. Create data frames for males and females separately. 2. Do males and females have similar BMI's? Weights? Compares means, standard deviations, range, and boxplots. 3. Plot BMI vs. weight for each gender separately. Do they have similar relationships? 4. Are males or females more likely to be bullied in the past 12 months? Calculate the percentage bullied for each gender. 5. Are students that were bullied in the past year more likely to have smoked in the past? Does this vary by gender? --- # Save data frame * Save __.RData__ file: the standard R format, which is recommended if saving data for future use in R ```r > save(mydata, file = "mydata.RData") ``` You can load .RData files using the load() command: ```r > load("mydata.RData") ``` <br> * Save __csv__ file: comma-separated values ```r > write.csv(mydata, file = "mydata.csv", col.names = TRUE, row.names = FALSE) ``` --- class: inverse, center, middle # The more you know --- # Installing and using packages (Packages are to R/Rstudio like apps are to your phone/OS) ## CRAN = package mothership [Comprehensive R Archive Network](https://cran.r-project.org/) Also can use the "Packages" tab in the Files/Plots/Packages/Help/Viewer window ```r > # Install a package from CRAN (main package repository) > install.packages("tidyverse") # only do this ONCE > # Load the package > library(tidyverse) ``` ## Other places (i.e. github) = wild west ```r > install.packages("devtools") # only do this ONCE > library(devtools) > # Install a package from github (often in development, no testing) > # https://github.com/hadley/yrbss > install_github("hadley/yrbss") > library(yrbss) ``` --- # How to get help (1/2) Use `?` in front of function name in console. Try this: ![](img/01_help_screenshot.png) --- # How to get help (2/2) - Use `??` (i.e `??dplyr` or `??read_csv`) for searching all documentation in installed packages (including unloaded packages) - search [Stack Overflow #r tag](https://stackoverflow.com/questions/tagged/r) - google your question + rcran or + r (i.e. "make a boxplot rcran" "make a boxplot r") - google the error in quotes (i.e. "Evaluation error: invalid type (closure) for variable '***'") - search [github](https://github.com/search/advanced?q=language:R) for your function name (to see examples) or error - [Rstudio community](https://community.rstudio.com/) - [twitter #rstats](https://twitter.com/search?q=%23rstats&src=typd) --- # Resources - [RStudio IDE Cheatsheet](https://resources.rstudio.com/rstudio-cheatsheets/rstudio-ide-cheat-sheet) - Install R/RStudio [help video](https://www.youtube.com/watch?v=kOQDdJZ7Hl4&feature=youtu.be) - [Basic Basics](http://rladiessydney.org/post/2018/11/05/basicbasics/) Interactive lessons - [DataCamp](www.datacamp.com) + [Introduction to R (free course)](https://www.datacamp.com/courses/free-introduction-to-r) + [Introduction to the Tidyverse](https://www.datacamp.com/courses/introduction-to-the-tidyverse) + [Intermediate R](https://www.datacamp.com/courses/intermediate-r) Some of this is drawn from materials in online books/lessons: - [Intro to R/RStudio](http://www-users.york.ac.uk/~er13/17C%20-%202018/pracs/01IntroductionToModuleAndRStudio.html) by Emma Rand - [Modern Dive](https://moderndive.com/) - An Introduction to Statistical and Data Sciences via R by Chester Ismay & Albert Kim - [Cookbook for R](http://www.cookbook-r.com/) by Winston Chang --- # Local resources - OHSU's [BioData club](https://biodata-club.github.io/) + active slack channel - Portland's [R user meetup group](https://www.meetup.com/portland-r-user-group/) + active slack channel - [R-ladies PDX](https://www.meetup.com/R-Ladies-PDX/) meetup group - in June in Portland, the [WNAR Annual meeting](http://www.wnar.org/event-3013994) (biostats conference) will have R related workshops - in June in Redmond, the [Cascadia R conference](https://www.eventbank.com/event/cascadia-r-conference-2019-11944/) will have presentations --- # Possible Future Workshop Topics? - data wrangling with the tidyverse - reproducible reports in R - tables - ggplot2 visualization - advanced tidyverse: functions, purrr - statistical modeling in R ## Contact info: Jessica Minnier: _minnier@ohsu.edu_ Meike Niederhausen: _niederha@ohsu.edu_ ## This workshop info: - Code for these slides on github: [jminnier/berd_r_courses](https://github.com/jminnier/berd_r_courses) - all the [R code in an R script](https://jminnier-berd-r-courses.netlify.com/01-getting-started/01_getting_started_slides.R) - answers to practice problems can be found here: [html](https://jminnier-berd-r-courses.netlify.com/01-getting-started/01_getting_started_Practice_Answers.html), [pdf](https://jminnier-berd-r-courses.netlify.com/01-getting-started/01_getting_started_Practice_Answers.pdf)