class: left, middle, inverse, title-slide # An Introduction to R and RStudio for Exploratory Data Analysis ### Jessica Minnier, PhD & Meike Niederhausen, PhD
OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop
###
Part 1: 2020/09/16 & Part 2: 2020/09/17
slides:
bit.ly/berd_intro_part1
pdf:
bit.ly/berd_intro_part1_pdf
--- layout: true <!-- <div class="my-footer"><span>bit.ly/berd_tidy</span></div> --> --- ## An Introduction to R and RStudio for Exploratory Data Analysis (Part 1) Instructors: Meike Niederhausen, PhD & Jessica Minnier, PhD<br> [OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop](https://www.ohsu.edu/octri/octri-research-forum-your-monthly-clinical-and-translational-research-event) ### *Do this now:* 1. **Open html slides**: [bit.ly/berd_intro_part1](http://bit.ly/berd_intro_part1) + You will be able to copy and paste code/links from here 1. Make sure you have already **installed R & Rstudio** + instructions here [bit.ly/berd_install](http://bit.ly/berd_install) + If you need help, let us or a helper know 1. **Open google doc** for asking questions: [bit.ly/berd_doc](https://bit.ly/berd_doc) + Helpers will be monitoring this, you can ask questions, copy code or screenshots. --- ### Zoom rules (note: we are recording): 1. **[Change your name in Zoom](https://teaching.nmc.edu/knowledgebase/changing-your-name-in-a-zoom-meeting/)** to a made up name/animal/word if you *do not want your name in recording* + Show participants list, next to your name click Rename 1. **Turn off your video** to save bandwidth, and for recording privacy. If you prefer to have video on during breakout rooms, go ahead! 1. Asking questions: **No private messages to instructors**, we won’t see them. **Chat message everyone or “Helpers” for help or to go to a breakout room**. You may also unmute yourself during lecture. 1. **Breakout rooms** are for getting help with R or with exercises in smaller groups. + The # of your breakout room corresponds to “your” helper. During breaks and exercises, helpers will be in breakout rooms. + You won’t be able to see what is going on in the main room while you are in your breakout room. + You can stay in main room during exercises if you prefer, and can ask questions to the presenters in the main room during that time. --- # Learning Objectives .pull-left[ - Basic operations in R/RStudio - Understand data structures - Be able to load in data - Basic operations on data ] .pull-right[ - Some data wrangling - Use Rstudio projects - Be able to make a plot - Basics of tidyverse and ggplot - Know how to get help ] <center><img src="img/horst_monster_support.jpg" width="70%" height="75%"><a href="https://github.com/allisonhorst/stats-illustrations"><br>Allison Horst</a></center> --- class: center, inverse, middle # Introduction Rrrrrr? --- # What is R? .pull-left-60[ - A programming language - Focus on statistical modeling and data analysis + import data, manipulate data, run statistics, make plots - Useful for "Data Science" - Great visualizations - Also useful for most anything else you'd want to tell a computer to do - Interfaces with other languages i.e. python, C++, bash ] .pull-right-40[ ![](img/R_logo.png) ] For the history and details: [Wikipedia](https://bit.ly/1efFmaY) - an interpreted language (run it through a command line) - procedural programming with functions - Why "R"?? Scheme (?) inspired S (invented at Bell Labs in 1976) which inspired R (**free and open source!** in 1992) --- # Why R? .pull-left[ - Free + Cross-platform (Mac/Windows) - Flexible, fun, many more modern statistics methods, large community for learning and help - One of the most popular data science tools for statistics in academia and industry - SAS and STATA (and SPSS) are still used but becoming less popular (expensive, not as versatile/comprehensive) - Constantly evolving and improving - If you want a job doing stats and not be limited to specific research groups or some pharma companies, you absolutely *need to know R* ] .pull-right[ <center><img src="img/r4stats_popularity_articles_trend.png" width="100%%" height="100%"><a href="http://r4stats.com/articles/popularity/"><br>r4stats Robert A. Muenchen</a></center> ] --- # What is RStudio? .pull-left[ R is a programming language] .pull-right[ RStudio is an integrated development environment (IDE) = an interface to use R (with perks!) ] <center><img src="img/01_md_rstudio.png" width="78%" height="78%"><a href="https://moderndive.com/1-getting-started.html#r-rstudio"><br>Modern Dive</a></center> --- # Start RStudio <center><img src="img/01_md_r.png" width="78%" height="78%"><a href="https://moderndive.com/1-getting-started.html#using-r-via-rstudio"><br>Modern Dive</a></center> --- <center><img src="img/RStudio_Anatomy.svg" width="100%" height="100%"><a href="http://www-users.york.ac.uk/~er13/17C%20-%202018/pracs/01IntroductionToModuleAndRStudio.html#what_are_r_and_rstudio"><br>Emma Rand</a></center> --- # RStudio demo - Start RStudio and explore **Bonus lessons** - [gifs showing how to adjust panels, personalize how Rstudio looks, etc](https://www.pipinghotdata.com/posts/2020-09-07-introducing-the-rstudio-ide-and-r-markdown/#background) --- # Installing and using packages .pull-left[<center><img src="img/r_vs_r_packages.png" width="55%" height="30%"><a href="https://moderndive.com/1-getting-started.html#r-rstudio"><br>Modern Dive</a></center> - Packages contain additional functions and data - Install packages with `install.packages()` + Or use "Packages" tab in Files/Plots/Packages/Help/Viewer window + *Only install once (unless you want to update)* + Installs from [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/) = package mothership ] .pull-right[ - **"Install the app"** = **Install** package once ```r # only do this ONCE, use quotes install.packages("dplyr") ``` - **"Open the app"** = **Load** package to use: At the top of your script or Rmd include **`library()`** commands to load each required package *every* time you open Rstudio or knit your Rmd. ```r # keep in Rmd # run every time you open Rstudio library(dplyr) ``` ] --- class: inverse, middle, center # Let's code! R Basics ![](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/exploder.gif) <center><a href="https://github.com/allisonhorst/stats-illustrations"><br>Allison Horst</a></center> --- # Coding in the console .pull-left[ When you first open R, the console should be empty. <img src="img/01_console_empty.png" width="90%" height="100%"> ] .pull-right[ __Typing and executing code in the console __ * Type code in the console (blue text) * Press __return__ to execute the code * Output shown below in black <img src="img/01_console_commands2.png" width="90%" height="10%"> ] --- # Math calculations using R .pull-left[ ```r > 10^2 ``` ``` [1] 100 ``` ```r > 3 ^ 7 ``` ``` [1] 2187 ``` ```r > 6/9 ``` ``` [1] 0.6666667 ``` ```r > 9-43 ``` ``` [1] -34 ``` ] -- .pull-right[ * Rules for order of operations are followed * Spaces between numbers and characters are ignored ```r > 4^3-2* 7+9 /2 ``` ``` [1] 54.5 ``` The equation above is computed as `$$4^3 − (2 \cdot 7) + \frac{9}{2}$$` ] --- # Variables Variables are used to store data, figures, model output, etc. .pull-left[ * Can assign a variable using either `=` or `<-` - Using `<-` is preferable - type name of variable to print Assign just one value: ```r > x = 5 > x ``` ``` [1] 5 ``` ```r > x <- 5 > x ``` ``` [1] 5 ``` ] -- .pull-right[ Assign a __vector__ of values: * Consecutive integers using `:` ```r > a <- 3:10 > a ``` ``` [1] 3 4 5 6 7 8 9 10 ``` * __Concatenate__ a string of numbers ```r > b <- c(5, 12, 2, 100, 8) > b ``` ``` [1] 5 12 2 100 8 ``` ] --- # We can do math with variables .pull-left[ Math using variables with just one value ```r > x <- 5 > x ``` ``` [1] 5 ``` ```r > x + 3 ``` ``` [1] 8 ``` ```r > y <- x^2 > y ``` ``` [1] 25 ``` ] -- .pull-right[ Math on vectors of values: __element-wise__ computation ```r > a <- 3:6 > a ``` ``` [1] 3 4 5 6 ``` ```r > a+2; a*3 ``` ``` [1] 5 6 7 8 ``` ``` [1] 9 12 15 18 ``` ```r > a*a ``` ``` [1] 9 16 25 36 ``` ] --- # Variables can include text (characters) ```r > hi <- "hello" > hi ``` ``` [1] "hello" ``` ```r > greetings <- c("Guten Tag", "Hola", hi) > greetings ``` ``` [1] "Guten Tag" "Hola" "hello" ``` --- # Using functions * `mean()` is an example of a function * functions have "arguments" that are specified within the `()` * `?mean` in console will show help for `mean()` .pull-left[ Arguments specified by name: ```r > mean(x = 1:4) ``` ``` [1] 2.5 ``` ```r > seq(from = 1, to = 12, by = 3) ``` ``` [1] 1 4 7 10 ``` ```r > seq(by = 3, to = 12, from = 1) ``` ``` [1] 1 4 7 10 ``` ] .pull-right[ Arguments not specified, but listed in order: ```r > mean(1:4) ``` ``` [1] 2.5 ``` ```r > seq(1,12,3) ``` ``` [1] 1 4 7 10 ``` ] --- # Common console errors (1/2) __Incomplete commands__ .pull-left[ * When the console is waiting for a new command, the prompt line begins with `>` + If the console prompt is `+`, then a previous command is incomplete + You can finish typing the command in the console window ] .pull-right[ Example: ```r > 3 + (2*6 + ) ``` ``` [1] 15 ``` ] --- # Common console errors (2/2) __Object is not found__ * This happens when text is entered for a non-existent variable (object) Example: ```r > hello ``` ``` Error in eval(expr, envir, enclos): object 'hello' not found ``` * Can be due to missing quotes ```r > install.packages(dplyr) # need install.packages("dplyr") ``` ``` Error in install.packages(dplyr): object 'dplyr' not found ``` --- class: inverse, center, middle # Saving your code with R Markdown (Rmd) ## or, creating reproducible reports <center><img src="img/horst_rmarkdown_wizards.png" width="60%" height="100%"><a href="https://github.com/allisonhorst/stats-illustrations"><br>Allison Horst</a></center> --- # Create an R Markdown file (`.Rmd`) <!-- * Note that both of these options show the keyboard shortcut for your operating system --> Two options: 1. click on File `\(\rightarrow\)` New File `\(\rightarrow\)` R Markdown `\(\rightarrow\)` OK , or 1. in upper left corner of RStudio click on <img src="img/green_plus_create_file.png"> `\(\rightarrow\)` <img src="img/select_RMarkdown_option.png"> .pull-left[ Pop-up window: * Enter a title and your name * Keep default HTML output format * Then click OK <!-- ![](img/01_rmd_screenshot_popup.png) --> <center><img src="img/01_rmd_screenshot_popup.png" width="60%" height="100%"></center> ] .pull-right[ * You should then see the following text in your editor window: ![](img/01_rmd_screenshot.png) <!-- img src="img/01_rmd_screenshot.png" width="60%" height="100%"> --> ] --- # Save the Markdown file (`.Rmd`) * __Save the file__ by + selecting `File -> Save`, + or clicking on ![](img/01_Script_Save.png) (towards the left above the scripting window), + or keyboard shortcut * PC: _Ctrl + s_ * Mac: _Command + s_ * You will need to specify + a __filename__ to save the file as - ALWAYS use __.Rmd__ as the filename extension for R markdown files + the __folder__ to save the file in --- # Compare the .Rmd file with its html output .pull-left[ .Rmd file <img src="img/default_rmd_html.png" width="84%" height="10%"> ] .pull-right[ html output <img src="img/default_html.png" width="64%" height="40%"> ] --- # Compare the .Rmd file with its html output <center><img src="img/screenshot_default_rmd2html_markedup.png" width="87%" height="70%"></center> --- # How to create the html file? _Knit_ the .Rmd file! <!-- *Before knitting the .Rmd file, you must first **save it**. * --> To **knit** the .Rmd file, either 1. click on the knit icon <img src="img/knit_icon.png"> at the top of the editor window 1. or use keyboard shortcuts * Mac: *Command+Shift+K* * PC: *Ctrl+Shift+K* * A new window will open with the html output. * You will now see both .Rmd and .html files in the folder where you saved the .Rmd file. __Note:__ * The template .Rmd file that RStudio creates will knit to an html file by default --- # 3 types of R Markdown content 1. <span style="color:darkorange">__Code chunks__</span>: type R code and execute it to see code output 2. __Text__: write about your analyses 3. __YAML metadata__: customize the report * This workshop will focus on using <span style="color:darkorange">code chunks</span>. * Watch the [Reproducible Reports with R Markdown](https://github.com/jminnier/berd_r_courses) workshop for customization options and different output formats (Word, pdf, slides). + Slides at https://jminnier-berd-r-courses.netlify.com/03-rmarkdown/03_rmarkdown_slides.html. --- # Create a code chunk Code chunks can be created by either 1. Clicking on ![](img/icon_insert.png) `\(\rightarrow\)` ![](img/icon_insert_Rchunk.png) at top right of the editor window, or 1. __Keyboard shortcut__ * Mac: _Command + Option + I_ * PC: _Ctrl + Alt + I_ * An empty code chunk looks like this: <center><img src="img/01_rmd_chunk_empty.png" width="40%" height="40%"></center> <!-- ![](img/01_rmd_chunk_empty.png) --> * Note that a code chunks start with ` ```{r} ` and ends with ` ``` `. --- # Enter and run code (1/n) .pull-left[ * __Type R code__ inside code chunks * __Select code__ you want to run, by - placing the cursor in the line of code you want to run, - __*or*__ highlighting the code you want to run * __Run selected code__ by - clicking on the ![](img/01_Script_Run.png) button in the top right corner of the scripting window and choosing "Run Selected Line(s)", - or typing one of the following key combinations: + __Windows__: __ctrl + return__ + __Mac__: __command + return__ ] .pull-right[ <center><img src="img/01_rmd_coding1b.png" width="100%" height="100%"></center> <center><img src="img/01_rmd_coding1c.png" width="40%" height="40%"></center> ] --- # Enter and run code (2/n) .pull-left-40[ * __Run all code__ in a chunk by - by clicking the play button in the top right corner of the chunk * The code output appears below the code chunk ] .pull-right-60[ <center><img src="img/01_rmd_coding2.png" width="100%" height="100%"></center> ] --- # Useful keyboard shortcuts .pull-left-60[ action | mac | windows/linux ---| ---| --- Run code in Rmd or script | cmd + enter | ctrl + enter `<-`| option + - | alt + - ] .pull-right-40[ Try typing in Rmd (with shortcut) and running ```r y <- 5 y ``` ] ## Others: ([see full list](https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts)) action | mac | windows/linux ---| ---| --- interrupt currently executing command | esc | esc in console, go to previously run code | up/down | up/down keyboard shortcut help | option + shift + k | alt + shift + k --- class: center, inverse, middle # Practice time! --- # Practice 1 (pg. 1) 1. Create a new Rmd file to type the code and answers for the tasks below in it. 1. Remove the template text starting with line 12 (keep the YAML header and setup code chunk), and save the file as `Practice1.Rmd` 1. Create a new code chunk. 1. Create a vector of all integers from 4 to 10, and save it as `a1`. 1. What does the command `sum(a1)` do? 1. What does the command `length(a1)` do? 1. Use the `sum` and `length` commands to calculate the average of the values in `a1`. 1. Knit the Rmd file. --- # Practice 1 (pg. 2) * Run the code below to install the `tidyverse` and `janitor` packages in R, which we will be using in upcoming slides. + If you get a message about restarting R, click Yes. + If you get an error message (warnings are ok), ask a helper. ```r install.packages("tidyverse") install.packages("janitor") ``` * After running the code, comment out the code with `#` in front of the commands so that they do not run when knitting the file. + *We only need to install packages once* and thus do not need to run this code again. Check that it worked by running this code with no errors: ```r library(tidyverse) library(janitor) ``` * __Take a break!__ --- class: inverse, middle, center # Intro to Data --- # How is data stored, how do we use it? - Often, data is in an excel sheet, or a plain text file (.csv, .txt) - .csv files open in Excel automatically, but actually are plain text - Usually, columns are variables/measures and rows are observations (i.e. a person's measurements) ## Our example data: [**Download data csv file** link](http://bit.ly/penguin_data) and pay attention to *where* it downloads on your computer - Make sure it is a .csv file and not a "web archive" or something else. **Open the data file `penguins.csv` and look at it** - What are the columns? What are the rows? --- # About the penguins data - A data set about penguins at Palmer Station, Antarctica! More info at [github.com/allisonhorst/palmerpenguins](https://github.com/allisonhorst/palmerpenguins) - Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. - Each row is a penguin measurement - Some false missingness was induced for practice in this workshop. <center><img src="img/penguins_pic.png" width="50%" height="50%"><img src="img/penguins_bill.png" width="50%" height="50%"></center> --- # Workflow - Keep it together! **Steps for a new data analysis project or homework:** 1. Create a folder to contain all your files. 2. Move data file (`penguins.csv`) into this folder. 3. Create an RStudio project inside this folder. (next slides) 4. Create a new Rmd for your analyses/homework. ## Do steps 1 & 2 now! <center><img src="img/workflow_folder.png" width="60%" height="50%"></center> --- # R Projects (.Rproj file) & Good Practices __Use projects to keep everything together__ ([read this](https://r4ds.had.co.nz/workflow-projects.html)) - A project keeps track of your coding environment and file structure. - Create an RStudio project for each data analysis project, for each homework assignment, etc. - A project is associated with a directory folder + Keep data files there + Keep code scripts there; edit them, run them in bits or as a whole + Save your outputs (plots and cleaned data) there - Only use relative paths, never absolute paths + relative (good): `read.csv("data/mydata.csv")` + absolute (bad): `read.csv("/home/yourname/Documents/stuff/mydata.csv")` __Advantages of using projects__ - standardizes file paths - keep everything together - a whole folder can be easily shared and run on another computer - when you open the project everything is as you left it --- # Create a new R project Let's go through it together. ([Read this for more](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects)) .pull-left-60[ - Click ![](img/rproj_new.png) in top left *or* File -> New Project - Click *Existing Directory* - Browse to your folder with the data - *Optional* Click "Open in new session checkbox" - Click "Create project" ] .pull-right-40[ <center><img src="img/rproj_existing.png" width="80%" height="50%"></center> <center><img src="img/rproj_browse.png" width="80%" height="50%"></center> ] **Bonus lessons** - [Video on projects in R, most useful info in minutes 2:00-13:00](https://rstudio.com/resources/webinars/managing-part-1-projects-in-rstudio/) --- # The data file will be in your Files pane: and your workspace folder location will be showing at the top (i.e. `Home/Desktop/workshop_practice`) <center><img src="img/rproj_files.png" width="87%" height="70%"></center> --- # Data in R/Rstudio **Open `penguins.csv` in Rstudio and look at it** - Click on `penguins.csv` in the Files pane, click *View File* <center><img src="img/01_filestruc.png" width="87%" height="70%"></center> **We will show you how to store and use this data in R as a data frame** Currently it is still just a file in your folder. --- ## Now What? Coding! Recall the workflow: **Steps for a new data analysis project or homework:** 1. Create a folder to contain all your files. 2. Move data file (`penguins.csv`) into this folder. 3. Create an RStudio project inside this folder. 4. **Create a new Rmd for your analyses/homework.** <center><img src="img/new_rmd_gif.gif" width="75%" height="70%"></center> --- # To run and save your code: Create a new Rmd! - Then save it with a meaningful filename. - You will be prompted to save it in your current working folder. <center><img src="img/rproj_newrmd.png" width="87%" height="70%"></center> --- # Load the packages we need in the Rmd Add this code to the setup chunk in the Rmd and run that chunk: .pull-left[ ```r library(tidyverse) library(janitor) ``` ] .pull-right[ <center><img src="img/rmd_library.png" width="87%" height="70%"></center> ] Now we can use functions in these packages, such as `read_csv()` and `%>%` and `mutate()` and `tabyl()` ## Remove everything in the Rmd below this code - Loading library code should always be at the top of your Rmd so you can use these packages in code "lower down" --- # Load the data set into R * Create a new code chunk (Code -> insert chunk) * Read in csv file from file path with code (filepath relative to Rproj directory) * Copy this code to that code chunk and run it. ```r penguins <- read_csv("penguins.csv") ``` * Or, open saved file using Import Dataset button in Environment window: ![](img/01_Import_Dataset.png) + From Text(readr). + If you use this option, **then copy and paste the importing code to your Rmd** so you have a record of from where and how you loaded the data set. ```r View(penguins) # Run in console # Can also view the data by clicking on its name in the Environment tab ``` <!-- ![](img/01_View_data_screenshot.png) --> <img src="img/view_penguins.png" width="110%" height="110%"> --- # Your Rmd should look something like this: Try knitting it! <img src="img/rmd_penguins.png" width="60%" height="80%"> --- # Load a data set: bonus lessons - [Importing Data, Rstudio support topic](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio) ![](https://support.rstudio.com/hc/article_attachments/360017333414/data-import-rstudio-overview.gif) --- class: inverse, middle, center # Object types --- # Data frames (aka "tibbles" in tidyverse) .pull-left-60[ __Vectors__ vs. __data frames__: a data frame is a collection (or array or table) of vectors ```r penguins ``` ``` ## # A tibble: 342 x 9 ## id species island bill_length_mm bill_depth_mm flipper_length_… ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 1689 Adelie Torge… 39.1 18.7 181 ## 2 4274 Adelie Torge… NA 17.4 186 ## 3 4539 Adelie Torge… 40.3 18 195 ## 4 2435 Adelie Torge… 36.7 19.3 193 ## 5 2326 Adelie Torge… 39.3 20.6 190 ## 6 2637 Adelie Torge… 38.9 17.8 181 ## 7 4443 Adelie Torge… NA 19.6 195 ## 8 2102 Adelie Torge… 34.1 18.1 193 ## 9 2975 Adelie Torge… 42 20.2 190 ## 10 3966 Adelie Torge… 37.8 17.1 186 ## # … with 332 more rows, and 3 more variables: body_mass_g <dbl>, sex <chr>, ## # year <dbl> ``` ] .pull-right-40[ * Different columns can be of different data types (i.e. numeric vs. text) * Both numeric and text can be stored within a column (stored together as *text*). * Vectors and data frames are examples of _**objects**_ in R. + There are other types of R objects to store data, such as matrices, lists. ] --- # Variable (column) types type | description ---|--- **double/numeric** | **numbers that are decimals** **character** | **text, "strings"** integer | integer-valued numbers factor | categorical variables stored with levels (groups) logical | boolean (TRUE, FALSE) - We will focus on double & character, as most data will be of this type when using `read_csv()` to read in your data sets - If you see `int` = integer as a column type, you can treat it as a double for most intents and purposes. <!-- Each variable (column) in a data frame can be of a different type. --> <!-- * Note that the ID column is _integer_ type since the values are all whole numbers, although we likely would think of it as being a categorical variable and thus prefer it to be a factor. --> --- # Data structure * What are the different __variable types__ in this data set? * What is `NA`? ```r glimpse(penguins) # structure of data ``` ``` ## Rows: 342 ## Columns: 9 ## $ id <dbl> 1689, 4274, 4539, 2435, 2326, 2637, 4443, 2102, 297… ## $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "… ## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen",… ## $ bill_length_mm <dbl> 39.1, NA, 40.3, 36.7, 39.3, 38.9, NA, 34.1, 42.0, 3… ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.… ## $ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 1… ## $ body_mass_g <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 425… ## $ sex <chr> "male", "female", "female", "female", "male", "fema… ## $ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200… ``` --- # Data set summary ```r summary(penguins) ``` ``` ## id species island bill_length_mm ## Min. :1001 Length:342 Length:342 Min. :32.10 ## 1st Qu.:2031 Class :character Class :character 1st Qu.:39.45 ## Median :2984 Mode :character Mode :character Median :44.70 ## Mean :3031 Mean :44.00 ## 3rd Qu.:4073 3rd Qu.:48.52 ## Max. :4969 Max. :59.60 ## NA's :6 ## bill_depth_mm flipper_length_mm body_mass_g sex ## Min. :13.10 Min. :172.0 Min. :2700 Length:342 ## 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550 Class :character ## Median :17.30 Median :197.0 Median :4050 Mode :character ## Mean :17.15 Mean :200.9 Mean :4202 ## 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750 ## Max. :21.50 Max. :231.0 Max. :6300 ## ## year ## Min. :2007 ## 1st Qu.:2007 ## Median :2008 ## Mean :2008 ## 3rd Qu.:2009 ## Max. :2009 ## ``` --- # Show (print) whole data frame Tibble truncates the output to ten rows, so you can't actually see it all. ```r penguins ``` ``` ## # A tibble: 342 x 9 ## id species island bill_length_mm bill_depth_mm flipper_length_… ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 1689 Adelie Torge… 39.1 18.7 181 ## 2 4274 Adelie Torge… NA 17.4 186 ## 3 4539 Adelie Torge… 40.3 18 195 ## 4 2435 Adelie Torge… 36.7 19.3 193 ## 5 2326 Adelie Torge… 39.3 20.6 190 ## 6 2637 Adelie Torge… 38.9 17.8 181 ## 7 4443 Adelie Torge… NA 19.6 195 ## 8 2102 Adelie Torge… 34.1 18.1 193 ## 9 2975 Adelie Torge… 42 20.2 190 ## 10 3966 Adelie Torge… 37.8 17.1 186 ## # … with 332 more rows, and 3 more variables: body_mass_g <dbl>, sex <chr>, ## # year <dbl> ``` --- # View whole data frame We showed this already, very handy to see *all* data. Run in console since it's more interactive. ```r View(penguins) ``` *or* click on window pane next to data frame name in Environment tab. <img src="img/view_penguins_env.png" width="60%" height="80%"> --- # Data set info .pull-left-40[ ```r dim(penguins) ``` ``` ## [1] 342 9 ``` ```r nrow(penguins) ``` ``` ## [1] 342 ``` ```r ncol(penguins) ``` ``` ## [1] 9 ``` ] .pull-right-60[ ```r names(penguins) ``` ``` ## [1] "id" "species" "island" ## [4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" ## [7] "body_mass_g" "sex" "year" ``` ] --- # View the beginning of a data set ```r head(penguins) ``` ``` ## # A tibble: 6 x 9 ## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1689 Adelie Torge… 39.1 18.7 181 3750 ## 2 4274 Adelie Torge… NA 17.4 186 3800 ## 3 4539 Adelie Torge… 40.3 18 195 3250 ## 4 2435 Adelie Torge… 36.7 19.3 193 3450 ## 5 2326 Adelie Torge… 39.3 20.6 190 3650 ## 6 2637 Adelie Torge… 38.9 17.8 181 3625 ## # … with 2 more variables: sex <chr>, year <dbl> ``` --- # View the end of a data set ```r tail(penguins) ``` ``` ## # A tibble: 6 x 9 ## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1947 Chinst… Dream 45.7 17 195 3650 ## 2 4452 Chinst… Dream 55.8 19.8 207 4000 ## 3 2420 Chinst… Dream 43.5 18.1 202 3400 ## 4 4861 Chinst… Dream 49.6 18.2 193 3775 ## 5 4865 Chinst… Dream 50.8 19 210 4100 ## 6 4162 Chinst… Dream 50.2 18.7 198 3775 ## # … with 2 more variables: sex <chr>, year <dbl> ``` --- # Specify how many rows to view at beginning or end of a data set ```r head(penguins, 3) ``` ``` ## # A tibble: 3 x 9 ## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1689 Adelie Torge… 39.1 18.7 181 3750 ## 2 4274 Adelie Torge… NA 17.4 186 3800 ## 3 4539 Adelie Torge… 40.3 18 195 3250 ## # … with 2 more variables: sex <chr>, year <dbl> ``` ```r tail(penguins, 1) ``` ``` ## # A tibble: 1 x 9 ## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 4162 Chinst… Dream 50.2 18.7 198 3775 ## # … with 2 more variables: sex <chr>, year <dbl> ``` --- ## Data frame cells, rows, or columns (rarely used) .pull-left-60[ Specific cell: `DatSetName[row#, column#]` ```r # Second row, Third column penguins[2, 3] ``` ``` ## # A tibble: 1 x 1 ## island ## <chr> ## 1 Torgersen ``` Entire row: `DatSetName[row#, ]` ```r # Second row penguins[2,] ``` ``` ## # A tibble: 1 x 9 ## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 4274 Adelie Torge… NA 17.4 186 3800 ## # … with 2 more variables: sex <chr>, year <dbl> ``` ] .pull-right-40[ Entire col: `DatSetName[, column#]` ```r # Third column penguins[, 3] ``` ``` ## # A tibble: 342 x 1 ## island ## <chr> ## 1 Torgersen ## 2 Torgersen ## 3 Torgersen ## 4 Torgersen ## 5 Torgersen ## 6 Torgersen ## 7 Torgersen ## 8 Torgersen ## 9 Torgersen ## 10 Torgersen ## # … with 332 more rows ``` ] --- class: inverse, center, middle # Working with the data --- # The $ Suppose we want to single out the column of bill length values. .pull-left[ * How did we previously learn to do this? ```r penguins[, 4] ``` ``` ## # A tibble: 342 x 1 ## bill_length_mm ## <dbl> ## 1 39.1 ## 2 NA ## 3 40.3 ## 4 36.7 ## 5 39.3 ## 6 38.9 ## 7 NA ## 8 34.1 ## 9 42 ## 10 37.8 ## # … with 332 more rows ``` ] .pull-right[ The problem with this method, is that we need to know the column number which can change as we make changes to the data set. * Use the `$` instead: `DatSetName$VariableName` ```r penguins$bill_length_mm ``` ``` ## [1] 39.1 NA 40.3 36.7 39.3 38.9 NA 34.1 42.0 37.8 37.8 41.1 38.6 34.6 36.6 ## [16] 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5 39.5 ## [31] 37.2 39.5 40.9 36.4 39.2 38.8 NA 37.6 39.8 36.5 40.8 36.0 44.1 37.0 39.6 ## [46] 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6 35.7 ## [61] 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5 42.8 ## [76] 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 NA 36.9 38.3 38.9 35.7 ## [91] 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9 39.7 ## [106] 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1 36.2 ## [121] 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1 41.1 ## [136] 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 NA 37.8 36.0 ## [151] 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5 48.4 ## [166] 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8 48.2 ## [181] 50.0 47.3 NA 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3 49.6 ## [196] 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5 43.2 ## [211] 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2 46.5 ## [226] 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3 47.5 ## [241] 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2 49.1 ## [256] 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8 47.2 ## [271] 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0 51.3 46.6 ## [286] 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6 46.7 52.0 ## [301] 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9 53.5 49.0 ## [316] 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7 42.5 52.2 ## [331] 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2 ``` ] --- # Basic plots of numeric data: Histogram ```r hist(penguins$bill_length_mm) ``` <img src="01_intro_r_eda_part1_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> With extra features: ```r hist(penguins$bill_length_mm, xlab = "Length (mm)", main="Penguin bills") ``` <img src="01_intro_r_eda_part1_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" /> --- # Basic plots of numeric data: Boxplot .pull-left[ ```r boxplot(penguins$bill_length_mm) ``` <img src="01_intro_r_eda_part1_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r boxplot(penguins$bill_length_mm ~ penguins$sex, horizontal = TRUE, xlab = "Length (mm)", ylab = "Sex", main = "Penguin bills by sex") ``` <img src="01_intro_r_eda_part1_files/figure-html/unnamed-chunk-52-1.png" style="display: block; margin: auto;" /> ] --- # Basic plots of numeric data: Scatterplot .pull-left[ ```r plot(penguins$flipper_length_mm, penguins$bill_length_mm) ``` <img src="01_intro_r_eda_part1_files/figure-html/unnamed-chunk-53-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r plot(penguins$flipper_length_mm, penguins$bill_length_mm, xlab = "Flipper", ylab = "Bill", main = "Bill vs. flipper length") ``` <img src="01_intro_r_eda_part1_files/figure-html/unnamed-chunk-54-1.png" style="display: block; margin: auto;" /> ] --- # Summary stats of numeric data (1/3) * Standard R `summary` command ```r summary(penguins$flipper_length_mm) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 172.0 190.0 197.0 200.9 213.0 231.0 ``` * Mean and standard deviation ```r mean(penguins$flipper_length_mm) ``` ``` ## [1] 200.9152 ``` ```r sd(penguins$flipper_length_mm) ``` ``` ## [1] 14.06171 ``` --- # Summary stats of numeric data (2/3) <!-- QQ: Why is (2/2) being cut off? It's not cut off for (1/2).--> * Min, max, & median .pull-left[ ```r min(penguins$flipper_length_mm) ``` ``` ## [1] 172 ``` ```r max(penguins$flipper_length_mm) ``` ``` ## [1] 231 ``` ] .pull-right[ ```r median(penguins$flipper_length_mm) ``` ``` ## [1] 197 ``` ] * Quantiles ```r quantile(penguins$flipper_length_mm, prob=c(0, .25, .5, .75, 1)) ``` ``` ## 0% 25% 50% 75% 100% ## 172 190 197 213 231 ``` --- # Summary stats of numeric data (3/3) .pull-left-60[ * Find the mean bill length ```r mean(penguins$bill_length_mm) ``` ``` ## [1] NA ``` *Why did we get `NA` for the mean?* ] .pull-right-40[ Since there are missing values (`NA`), we need to tell R to remove them from the data when calculating the mean. ```r mean(penguins$bill_length_mm, * na.rm = TRUE) ``` ``` ## [1] 44.00387 ``` ] ```r summary(penguins$bill_length_mm) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 32.10 39.45 44.70 44.00 48.52 59.60 6 ``` --- # Practice 2 Create a new Rmd for Practice 2 or continue in your current Rmd. 1. Find the median bill length. Is the median bill length similar to the mean? 1. What is the distance between the smallest and largest bill *depths*? 1. What does the `range()` command do? Try it out on the bill depths. 1. Make a scatterplot with bill length on the x-axis and bill depth on the y-axis. What is the relationship between bill length and depth? 1. Knit your Rmd file. 1. If you have time: * install the package `skimr` * load the package * run the command `skim(penguins)` * what does the `skim` command do? --- # End of Day 1 - [Practice Solution Link](https://jminnier-berd-r-courses.netlify.app/01-intro-r-eda/01_intro_r_eda_practice_answers_part1) # Part 2 slides - Link for slides for day 2: [bit.ly/berd_intro_part2](http://bit.ly/berd_intro_part2)