R Workshop for Postgraduate Students - School of Life Sciences Prof. Ursula Scharler and Dr. Anna Bastian 1 November 2018 and 13 November 2018 ...

Page created by Donald Washington
 
CONTINUE READING
R Workshop for Postgraduate Students - School of Life Sciences Prof. Ursula Scharler and Dr. Anna Bastian 1 November 2018 and 13 November 2018 ...
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

      R Workshop for Postgraduate Students
                          School of Life Sciences
       Prof. Ursula Scharler and Dr. Anna Bastian

                            1 November 2018
                          and 13 November 2018

This workshop is designed to expose School of Life Sciences
postgraduate students to R, and provide you with basic skills in data
management, analysis and visualisation.

___________________________________________________________________________

2019:
If you are interested in carrying on working with R, and extend your skills beyond basics, we
will hold bi-weekly meetings in 2019. These will take the form of discussions and finding out
together about functions, packages, visualisations and analyses in R. Announcements for
these will follow in January.

                                                                                                    1
R – Workshop Nov 2018   School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

WELCOME to R! WELCOME to R! WELCOME to R! WELCOME to R! WELCOME to R! WELCOME to R!

   Here the basic themes that we will be working on today:
       1. DATA MANAGEMENT
       2. R AND R-STUDIO
       3. SIMPLE OPERATIONS, MAKING OBJECTS
       4. IMPORTING/EXPORTING FILES, WORKING DIRECTORY
       5. WHAT IS AN R PACKAGE?
       6. VISUALISATIONS - EXAMPLES OF SIMPLE GRAPHS
       7. EXAMPLES OF UNIVARIATE STATISTICAL ANALYSIS
       8. WHERE TO LOOK FOR HELP
       9. ADDITIONAL PLOT FUNCTIONS

                                                                                                   2
R – Workshop Nov 2018                      School of Life Sciences, UKZN                      Prof. Ursula Scharler & Dr. Anna Bastian

Contents
1     DATA MANAGEMENT:..................................................................................................................... 4
    1.1      BEST-PRACTISE: SETTING UP AND MAINTAINING A DATABASE ............................................. 4
      1.1.1         Criteria of good data management:................................................................................ 4
    1.2      DATA FORMATS (SAVE AS…) ................................................................................................... 5
2     R AND R-STUDIO: ............................................................................................................................ 6
    2.1      DOWNLOAD AND INSTALL R ................................................................................................... 6
    2.2      DOWNLOAD AND INSTALL RSTUDIO....................................................................................... 6
    2.3      RSTUDIO LAYOUT: ................................................................................................................... 6
3     SIMPLE OPERATIONS IN R, AND MAKING OBJECTS: ....................................................................... 7
    3.1      FUNCTIONS ............................................................................................................................. 9
    3.2      VECTORS, MATRICES, ARRAYS, DATA FRAMES, AND LISTS:.................................................. 10
      3.2.1         Data types: .................................................................................................................... 10
      3.2.2         Manipulating vectors and matrices: ............................................................................. 13
      3.2.3         Summary statistics for vectors and matrices: ............................................................... 16
4     IMPORTING AND EXPORTING FILES, SETTING WORKING DIRECTORY, SCRIPT: ........................... 17
    4.1      SCRIPT ................................................................................................................................... 19
5     WHAT ARE ‘R PACKAGES’? ............................................................................................................ 20
    5.1      SOME USEFUL PACKAGES ..................................................................................................... 20
6     VISUALISATION - EXAMPLES OF SIMPLE GRAPHS ......................................................................... 21
    6.1      SCATTERPLOT ........................................................................................................................ 22
    6.2      BOXPLOT ............................................................................................................................... 24
7     EXAMPLES OF UNIVARIATE STATISTICAL ANALYSIS ...................................................................... 25
    7.1      ANOVA .................................................................................................................................. 25
      7.1.1         Assumptions to check before running an ANOVA ........................................................ 26
      Testing whether a distribution is normal (Shapiro-Wilk’s test) .................................................... 26
      Testing for homogeneity of variance (Levene’s test) ................................................................... 26
    7.2      REGRESSION ANALYSIS ......................................................................................................... 28
    7.3      Assumptions to check before running a Regression Analysis: .............................................. 28
8     WHERE TO LOOK FOR HELP .......................................................................................................... 29
9     GRAPHICS VISUALISATION WITH GGPLOT2 .................................................................................. 30
    9.1      AN EXAMPLE: SCATTERPLOT ................................................................................................. 31

                                                                                                                                                        3
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

1 DATA MANAGEMENT:
Data are the basic ingredients of your research. These basic ingredients should be stored in
such a way that anyone is clear on the following points:
   1. Which type of data are they?
   2. What type of study do they relate to?
   3. When, where and why were the measurements taken/data produced?
   4. By who were the measurements taken/data produced?
   5. What units do the data have?
   6. Do the data show original measurements, or transformed measurements, or
      transformed data?
   7. Where is the backup stored?
   8. Who has knowledge of/access to the backup?
   9. … and others depending on the study.

1.1 BEST-PRACTISE: SETTING UP AND MAINTAINING A DATABASE
We are using MS Excel to capture the collected data as it is user friendly, widely used, and
can save data in various formats for downstream applications. However, any other open
source spreadsheet software will essentially perform the same tasks.

 1.1.1 Criteria of good data management:

   1. The dataset contains Metadata. This is a basic description of the data, including basic
      information on the study or programme the study is a part of.
   2. Each datapoint is identifiable in terms of when, where and by whom it was taken.
   3. Each datapoint has a unit attached to it. Make sure to use SI units and/or ISO
      standards and to be congruent.
   4. Variables (horizontal): Proper naming of columns. For example “Temperature (°C)” or
      “Diameter (m)”. Cases (vertical): unique identifiers are crucial as they link the
      different variable values to each case. For example “Population 1” or “Treatment A”.
   5. No cell contains more than one piece of information, i.e. date and sampling site, or
      measurement and unit, should NOT be in the same cell. Different bits of information
      should be captured in separate columns, e.g. sampling information such as “caught
      on 01-11-2018 at 10:45am” should appear in at least two columns “Sampling date
      (YYYY-MM-DD)” and “Sampling time (hh:mm:ss)”.
      This is important because each datapoint is not identifiable for analysis when mixed
      with other information
   6. Always keep a Mastercopy of your data, i.e. a version of your original dataset that you
      do not change.

                                                                                                    4
R – Workshop Nov 2018       School of Life Sciences, UKZN    Prof. Ursula Scharler & Dr. Anna Bastian

   7. Databases are usually updated continuously. To avoid losing crucial information, keep
      backups before any major changes. In addition, an accompanying log file is used to
      keep track of the changes that have been made.
   8. Keep track of any changes you make to the original dataset. For instance, you might
      want to change datapoints after you applied some form of quality control. Or you
      might want to exclude datapoints when you are not sure of their validity due to
      various reasons (e.g. faulty equipment, person taking measurements not
      knowledgable/sloppy, etc. ). Also list the reasons why you might not trust a datapoint,
      or why you decide to remove it altogether.
   9. Depending on the amount of data you produce, consider a proper data management
      software.

1.2 DATA FORMATS (SAVE AS…)
The different formats such as .xlsx (Excel Workbook) contain hidden formatting information
written into the file which can interfere with other software programmes. For instance, ‘Cells’
in Excel and in R-Studio, do not contain the same hidden formatting.
It is recommended to save the database in the default .xlsx format and to export the excel
workbook as a tab delimited plain text file (.txt), or comma delimited plain text file (.csv).
Plain text files contain a very small amount of hidden formatting. Plain text files can also be
imported back into Excel (and SPSS).
When you intend moving between software, first check which formats the receiving software
can read, and which format the source software can save as.

                                                                                                     5
R – Workshop Nov 2018         School of Life Sciences, UKZN    Prof. Ursula Scharler & Dr. Anna Bastian

2 R AND R-STUDIO:
2.1     DOWNLOAD AND INSTALL R

      As you know, R is entirely for free. You can download it from:
      
      If you work in Windows, go to 'Download R for Windows' (choose the most recent
      version) and follow the steps you are prompted for. Choose default answers for all
      questions. If you work on a different operating system, choose the version for the
      operating system.

      If you need to install a later version of R over an older version of R, follow the
      instructions on .

      Make sure you are working with the latest version (currently 3.5.1) as some packages
      won't install or run properly if you are using an older version of R.

2.2     DOWNLOAD AND INSTALL RSTUDIO

      You can work R from R commander, or from R-Studio, depending on the application (i.e.
      what you use R for). Whereas R commander is a GUI (Graphical User Interface) to R with
      drop-down menus for e.g. statistical analyses, R-Studio is an IDE (Integrated
      Development Environment) that allows you to develop programmes in R, and of course
      to run them. You can also run all of your statistical analyses from R-Studio.

             To install RStudio, go to:
             
             Choose the most recent version.

             If you work in Windows, click Download RStudio Desktop, and choose the free
             version of RStudio Desktop.

2.3     RSTUDIO LAYOUT:

      RStudio consists of four windows, showing your work and the output of your work. You
      can change the size of each window by dragging the dividing line.

                                                                                                       6
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

   On the bottom left is the console (or command) window. Here you can type and execute
   commands.

   On the top left is the script (or editor) window. Here you can type, edit and save your
   commands (or scripts). This is a very convenient window, since it allows you to type and
   edit your commands before you execute them. If the window does not show when you
   start RStudio, you can open it by clicking on File  New File  RScript.
   When typing a command into the script window, it does not run automatically, but you
   need to run it with Ctrl+Enter, or by clicking Run on top of the script window. Either one
   line, or several lines, or an entire script can be run at once.

   On the top right is the Environment/History window. In this workspace you can see all
   the code R has in its memory (History tab), and the data that are in the memory
   (Environment tab). You can click on any of them to view and edit. In this window, you
   can also import datasets into R (see steps below).

   On the bottom right, there is the Files/Plots/Packages/Help window. Here you can see all
   files that you had recently open, and open them, view plots that you made, load and
   install packages, and access Help.

3 SIMPLE OPERATIONS IN R, AND MAKING OBJECTS:

         Add two numbers:
         > 3 + 3
         [1] 6

         Make an object of the sum:
         > x  x
         [1] 6

         This is the same as:
         > x = 3 + 3
         > x
         [1] 6

   Note that “=” and “ 25/5
         [1] 5

         Making an object:
         > y = 25/5

                                                                                                   7
R – Workshop Nov 2018       School of Life Sciences, UKZN      Prof. Ursula Scharler & Dr. Anna Bastian

         > y
         [1] 5

         Subtract two objects:
         > x - y
         [1] 1

         Subtracting y from x and making a third object:
         > z = x - y
         > z
         [1] 1

  Having made the objects 'x' and 'y', saves us from typing the operation in a more
  cumbersome way, i.e.:
         > (3+3)-(25/5)
         [1] 1

  … and our objects are re-usable for other operations, e.g.:
         > x/y + z
         [1] 2.2

  One often creates several objects during a working session. Sometimes one cannot
  remember all the objects one has created. To find out which objects you have created,
  type:
         > ls()                                             # “ls” stands for list

  If you want to remove an object from your workspace, type:
         > rm()                                             # “rm” stands for remove

  I.e. if you want to remove the object x that we created previously, type:
         > rm(x)

  If you want to remove all objects in your workspace (the entire list), type:
         > rm(list = ls())

  What names can objects have?
     Letters in upper and lower case (e.g. x, y, X, Y), so object names are case sensitive.
     Symbols (e.g. ‘.’)
     Numbers (following a letter)
     NB object names cannot have spaces - i.e. ‘my_data
R – Workshop Nov 2018      School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

          Keep the object name short (otherwise the name is often difficult to distinguish
           from other, similar names, and it is more cumbersome to type a long name
           rather than a short one)
          Avoid using function names (e.g. data, factor, sqrt) as object names. It gets
           confusing to call up a function of a certain name on an object of the same name.
          There are reserved words and signs in programming languages. To see the
           reserved words/signs type
           > ?reserved

           and R returns a list with all reserved words.

Typical errors:
    Incomplete operation: You may notice a plus-sign at the beginning of a new line after
    you hit enter. This means the function is not complete yet and R waits for another input.

    Note that R is tolerant when it comes to spacing. “3 + 3” is the same as if we type “3+3”.
    However, this is NOT the case when it comes to functions. Names of functions are fixed
    and R will not perform a task if the name is misspelled, including added spaces:
           > sqrt(9)
           [1] 3
           > sqrt (9)
           [1] 3
           > sq rt(9)
           Error: unexpected symbol in "sq rt"

3.1 FUNCTIONS

    We can perform more complex calculations following the general order of operators in
    algebra. To do so we need to know the code for the operators in R as otherwise we will
    get an error message:
           > 10+(3X5^2+4-2)
           Error: unexpected symbol in "10+(3X5"

    To go back and correct the “X” to “*”, which is the correct operator for multiplication,
    click the arrow up key on your keyboard, then the arrow left key to go to the “X”. That
    way you can scroll back through all the commands you have used before.

           > 10+(3*5^2+4-2)
           [1] 87

                                                                                                   9
R – Workshop Nov 2018        School of Life Sciences, UKZN      Prof. Ursula Scharler & Dr. Anna Bastian

      Functions help us to avoid repetitive tasks. Functions are sequences of logical statements
      which will perform a named calculation or computation.

      The function which will take the square root of a number is “sqrt()”.There are many built-
      in functions, such as rounding a number “round()”. The number which we want to round
      will be placed inside the brackets. Everything inside the brackets is called an argument.
      For example, we can specify the output more by stating that we want two decimal
      places:
             > round(4.34567789, 2)
             [1] 4.35

      This is the same as:
             > round(x=4.34567789, digits=2)
             [1] 4.35

      Knowing the different options for specifying arguments is very helpful when it comes to
      more complex functions. Learning the names of functions and what they do is basically
      the same as learning the vocabulary of a new language, or expanding the vocabulary of
      your spoken or written language.
      Functions can be queried by using a ‘?’. This will indicate how to use the function and
      what are the arguments.
      For example:
             > ?round

3.2     VECTORS, MATRICES, ARRAYS, DATA FRAMES, AND LISTS:

      Some datasets consist of matrices and vectors. Here we will learn a few operations
      dealing with matrices and vectors, and learn about a few other data formats.

 3.2.1 Data types:
      Vectors and matrices are data structures. You can make your own vectors (enter them
      by hand) by using ‘:’ or ‘c()’. Here a vector consisting of integer numbers:

             > 1:7
             [1] 1 2 3 4 5 6 7

      Or:
             > c(1, 2, 3, 4, 5, 6, 7)                          # “c” refers to columns
             [1] 1 2 3 4 5 6 7

                                                                                                       10
R – Workshop Nov 2018      School of Life Sciences, UKZN    Prof. Ursula Scharler & Dr. Anna Bastian

  You can also make it an object:
        > x = 1:7
        > x
        [1] 1 2 3 4 5 6 7
  (We have now simply overwritten our previous object 'x', which was 3 + 3)

  You can connect vectors into columns. But first, create a second vector called y:
        > y = 1:7
        > cbind(x,y)                            # “cbind” takes a sequence of arguments
                                                  and combines it by columns.
               x   y
        [1,]   1   1
        [2,]   2   2
        [3,]   3   3
        [4,]   4   4
        [5,]   5   5
        [6,]   6   6
        [7,]   7   7

  You can connect the same vectors into rows:
        > rbind(x,y)                            # “rbind” takes a sequence of arguments
                                                  and combines it by rows.
            [,1] [,2] [,3] [,4] [,5] [,6] [,7]
        x      1    2    3    4    5    6    7
        y      1    2    3    4    5    6    7

  This is how one can create a matrix, here called M:
        > M = matrix(data = 1:16, nrow = 4, ncol = 4)
        > M
             [,1] [,2] [,3] [,4]
        [1,]    1    5    9   13
        [2,]    2    6   10   14
        [3,]    3    7   11   15
        [4,]    4    8   12   16

  Another data type is an array, which can have more dimensions than a matrix (see
  below Figure).

                                                                                                 11
R – Workshop Nov 2018      School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

  In a dataframe, columns can have different data formats, or modes, i.e. they can be
  numeric, characters, factors, and others).

  How can you check which class and mode of data you have? Here a check of our vectors
  x and y, and the matrix M.
        > class(x)
        [1] "integer"
        > mode(x)
        [1] "numeric"
        > class(y)
        [1] "integer"
        > mode(y)
        [1] "numeric"
        > class(M)
        [1] "matrix"
        > mode(M)
        [1] "numeric"

  You can change the data mode if you wish, i.e. change the numeric mode of the vector x
  to a character mode.
        > x = as.character(x)
        > x
        [1] "1" "2" "3" "4" "5" "6" "7"
        > class(x)
        [1] "character"
        > mode(x)
        [1] "character"

  Note that the inverted commas distinguish the data as character data. Be aware that if
  your data are character data you will not be able to perform functions on them which
  are based on numeric data.

                                                                                                12
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

  Here is how to create a data frame with a numeric and a character vector:
         > x = 1:5
         > y = c("a", "b", "c", "d", "e")
         > data.frame(x, y)
           x y
         1 1 a
         2 2 b
         3 3 c
         4 4 d
         5 5 e

  Another data type is a list. This is a collection of objects (components). A variety of
  objects that are possibly unrelated can be collected in a list, which is given a new name.
  You can for instance create a list of matrix M and vector x:

         > list(M, x)
         [[1]]
              [,1] [,2] [,3] [,4]
         [1,]    1    5    9   13
         [2,]    2    6   10   14
         [3,]    3    7   11   15
         [4,]    4    8   12   16

         [[2]]
         [1] 1 2 3 4 5

3.2.2 Manipulating vectors and matrices:

  Vectors and matrices can be manipulated in R. Here a few operations.
  Displaying a single value from a vector:

         > x = 1:7
         > x[4]
         [1] 4

  This displays the 4th element of the vector, which in this case is 4.

  If you want to take out several values from the vector:

         > x[c(2, 4, 6)]
         [1] 2 4 6

  If the values you want to pull out are in a sequence, you can use:

         > x[2:5]
         [1] 2 3 4 5

                                                                                                 13
R – Workshop Nov 2018       School of Life Sciences, UKZN    Prof. Ursula Scharler & Dr. Anna Bastian

  To pull out all values of the vector x that are smaller than 3:
         > x[x < 3]
         [1] 1 2

  The operations conducted above on a vector can also be conducted on a matrix, e.g. pull
  out one number from a matrix, here on our matrix M, denoting the row and column
  where the value is to be found:

         > M
              [,1] [,2] [,3] [,4]
         [1,]     1   5    9   13
         [2,]     2   6   10   14
         [3,]     3   7   11   15
         [4,]     4   8   12   16
         > M[1,4]
         [1] 13

  One can also call up a whole row or column from a matrix, demonstrated again on the
  matrix M:
         > M
              [,1] [,2] [,3] [,4]
         [1,]     1   5    9   13
         [2,]     2   6   10   14
         [3,]     3   7   11   15
         [4,]     4   8   12   16
         > M[1, ]
         [1] 1 5 9 13
         > M[, 3]
         [1] 9 10 11 12

  … and find all values of a row that are higher than 4:

         > M
              [,1] [,2] [,3] [,4]
         [1,]     1   5    9   13
         [2,]     2   6   10   14
         [3,]    3    7   11   15
         [4,]     4   8   12   16
         > M[1, ]
         [1] 1 5 9 13

  Are there values >4?

         > M[1,] >4
         [1] FALSE TRUE        TRUE     TRUE

                                                                                                  14
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

  Which are the values that are >4?

         > M[1, M[1, ] > 4]
         [1] 5 9 13

  Sorting the matrix by the first column. Order () is a function that allows you to sort
  variables:
         > order (M[,1])
         [1] 1 2 3 4

  The above result shows the current order of the values in column 1. We can see that the
  smallest value of column 1 is in position 1, the next largest value is in position 2, etc.

         > order (M[,1], decreasing = TRUE)
         [1] 4 3 2 1

  The above result shows the order of the values in column 1, now in the opposite
  direction. We can see that that the largest value of column 1 is in position 1, the next
  largest value is in position 2, etc. and the smallest value in position 4.
  Check:
         > M
                [,1] [,2] [,3] [,4]
         [1,]      1    5    9   13
         [2,]      2    6   10   14
         [3,]      3    7   11   15
         [4,]      4    8   12   16

  This function orders and displays the sorted matrix:
         > M[order(M[,1], decreasing = TRUE), ]
              [,1] [,2] [,3] [,4]
         [1,]    4    8   12   16
         [2,]    3    7   11   15
         [3,]    2    6   10   14
         [4,]    1    5    9   13

  If your matrix has column headings, you can specify the column heading of the column to
  be sorted instead of the number of the column (in this case it was column 1).

                                                                                                 15
R – Workshop Nov 2018      School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

3.2.3 Summary statistics for vectors and matrices:

  Summary statistics can be calculated for vectors and matrices. Here for instance various
  operations on vectors:

        > y = 2:8
        > length(x)
        [1] 7
        > mean(x)
        [1] 4
        > sqrt(x)
        [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.64575
        1
        > sd(x)
        [1] 2.160247
        > var(x)
        [1] 4.666667

  A quick check on their correlation:
        > cor(x,y)
        [1] 1
        > cor(x,sqrt(y))
        [1] 0.9953148

  How to calculate summary statistics for matrices? Here an example on the mean for all
  rows in our matrix M, where MARGIN = 1 indicates we want to apply the function to
  rows, and MARGIN = 2 indicates we want to apply it to columns.

        > M
             [,1] [,2] [,3] [,4]
        [1,]    1    5    9   13
        [2,]    2    6   10   14
        [3,]    3    7   11   15
        [4,]    4    8   12   16
        > apply(M, MARGIN = 1, FUN = mean)
        [1] 7 8 9 10

  And here for columns:
        > M
             [,1] [,2] [,3] [,4]
        [1,]    1    5    9   13
        [2,]    2    6   10   14
        [3,]    3    7   11   15
        [4,]    4    8   12   16
        > apply(M, MARGIN = 2, FUN = mean)
        [1] 2.5 6.5 10.5 14.5

                                                                                                16
R – Workshop Nov 2018       School of Life Sciences, UKZN    Prof. Ursula Scharler & Dr. Anna Bastian

4 IMPORTING AND EXPORTING FILES, SETTING WORKING
  DIRECTORY, SCRIPT:

   Setting a working directory means specifying a place where your files will be saved to,
   and stored. After you set a working directory, you do not have to type in, or look for, the
   directory when saving, or importing a file.

   Datasets can be imported into R, either as a text file, or an Excel file. People often prefer
   to work with their data in text format, because they do not feature fancy formatting
   (usually invisible on the screen) that may interfere with the data format. So to have your
   data in text format, make sure your matrix is in a .csv format. E.g. if you work in Excel,
   save your file as .csv, which is a text file that can be read in easily.

   There are two ways you can set your working directory:
   1. > setwd("D:/Documents/YourFavouriteDirectory")
   2. Go to Session in the top menu, go to Set Working Directory, and Choose Directory.

   To see where the working directory is, type: > getwd()

 It is good practice to make a working directory before you start a new project, and make a
 data folder within that working directory. You can save the script in the working directory, and
 canToimport
        importfrom/export
                a file, go thetotothe
                                    environment/History
                                      data folder.      panel, click Import dataset, and choose

   To import your data from Excel into R, install and load the package readxl:

          > install.packages("readxl”)
          > library(readxl)

   From Excel. Choose the file you want to import. In the panel, specify how your first row
   and column of the xlsx file will be displayed in your R matrix. Alternatively, you can type

          > read_excel("YourFile.xlsx")

   From Text. Choose the file you want to import. This interface looks slightly different to
   the one when importing an Excel file, but contains the same information. In the panel,
   specify how your first row and column of the csv file will be displayed in your R matrix.
   Alternatively, you can type

          > read.csv("YourFile.csv")

                                                                                                   17
R – Workshop Nov 2018        School of Life Sciences, UKZN    Prof. Ursula Scharler & Dr. Anna Bastian

  Since you set the working directory, you do not have to specify the entire path, including
  all the directories, where the file is found. The filename is sufficient.

  Saving your imported data as an object allows for better functionality going forward and
  prevents you from having to read it in every time you want to use it, e.g.

         > my_data  write.csv(M,'Mnew',row.names=FALSE)

  Again, since you set your working directory, you only need to specify the name of the
  .csv file, which I named here 'Mnew'.
  'Mnew' is a text file. As it is a text file of the format .csv, you can open it in Excel. There
  you need to specify which separation the file has between the values. For .csv files, that
  is commas.

  If you specify in the same function that it is a csv file, you can open the file in Excel
  directly:
     > write.csv(M,'Mnew.csv',row.names=FALSE)

  Needless to say, save your work FREQUENTLY. In R Studio, you can save not only your
  script and the figures you produce, but also the entire working environment.

                                                                                                   18
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

4.1 SCRIPT
   You might experience that the first analyses in R are based on try-and-error trials, even
   when following instructions. It is therefore good practise if you set up you own script
   before starting, which is tailored to your data types, file names, object names etc. This
   can also serve as a template for future analyses or when you need to make minor
   changes to the analyses.
   Although it might seem time consuming, it is important to keep track of our work in R.

    In order to remember what your code means, it is customary to make notes. These
    should be descriptive and remind you of what you scripted. However, they should
    also be concise. If you add a #, everything thereafter will be disregarded by R, i.e. it
    will not be part of the code. For instance:
    # This is the code for writing my matrix to a .csv format, excluding row names.
          > write.csv(M,'Mnew.csv',row.names=FALSE)

    Or:
            > 3 + 3 # addition
            [1] 6

            > x  x
            [1] 6

            > 25/5   # division
            [1] 5

    You can also write notes regarding common errors and how to avoid or fix them.

   To write a script, you can use a Text Editor, such as NotePad, NotePad++ or, the built-in
   editor in R-Studio (File > New File > R Script).
      Notepad is included in MS Windows
      Notepad++ can be downloaded for free from: https://notepad-plus-plus.org/

   At a more advanced stage, you can write a R Markdown document. Markdown is a
   simple formatting syntax for authoring HTML, PDF, and MS Word documents. You can
   also embed code in a Markdown. For more details on using R Markdown see
   .

                                                                                                  19
R – Workshop Nov 2018       School of Life Sciences, UKZN    Prof. Ursula Scharler & Dr. Anna Bastian

5 WHAT ARE ‘R PACKAGES’?

   In R, packages are amalgamations of functions which enable us to complete certain
   tasks. These can range from very general tasks such as data organisation and
   visualisation or plotting, to packages which are used in very specific niche tasks such as
   certain types of data analyses. Anyone can make a new package, and offer it free of
   charge to R users.

   Depending on your objective, you may require certain packages to conduct a certain
   analysis. Some packages come with instructional documentation, called vignettes, which
   provide the user with a breakdown of the package capabilities, complete with examples
   of functions and their outputs.

   To install a package you can make use of the GUI in R-Studio. In the bottom right panel
   select the ‘Packages’ tab. Here click on ‘Install’, and a window will pop up. The repository
   drop down is defaulted to CRAN – this will almost always be the desired repository. Type
   in the name of the package you require. Checking ‘Install Dependencies’ means that if
   there are packages that are required for your desired package to function, these will be
   installed automatically. Click install.

   Alternatively, you can type
      > install.packages(“package_name”)

   This does the same as the above explanation. Once you have a package installed, you
   need to call it from your package library in order to load it into your active R session. This
   is done by:
      >library(package_name)

   Note that quotations are not needed here.

5.1 SOME USEFUL PACKAGES

   dplyr – subsetting, summarising and joining of datasets
   tidyr – changing the layout of datasets, keeping your data tidy
   stringr – expressions and character string manipulation
   ggplot2 – sleek customisable plotting

   These are just a few examples of the many packages available, so search for the ones
   appropriate for what you are trying to do and refer to their respective vignettes for
   guidelines.

                                                                                                   20
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

6 VISUALISATION - EXAMPLES OF SIMPLE GRAPHS
   One of the major advantages of working in R is its versatility for data visualisations. Here
   we provide two simple examples.

   Take the data frame chickwts, which is available in the automatically loaded datasets
   package. The data come from an experiment which was conducted to measure and
   compare the effectiveness of various feed supplements on the growth rate of chickens.
   Newly hatched chicks were randomly allocated into six groups, and each group was given
   a different feed supplement. Their weights in grams after six weeks are given along with
   feed types.

             > library(help = "datasets")
             > chickwts

   Have a quick look at the summary data:
             > summary(chickwts)

   The basic version of R comes with a plot() function, which can create a wide variety of
   graphs (type ?plot in the command line for details), and the lattice() package is also
   helpful. However, the ggplot2 package is most commonly used for graphics.

             > install.packages("ggplot2")
             > library(ggplot2)

      Below you’ll find two different ways of generating plots.

                                                                                                  21
R – Workshop Nov 2018         School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

6.1     SCATTERPLOT

      Although the dataset is not ideally visualised through a scatterplot (boxplots are the
      better choice), you can plot it by using the function plot():

> plot(chickwts$weight,chickwts$feed,xlab="Weight",ylab="Feed")

# By using the "$" symbol, R returns all of the values in the column labelled "weight".

      Using ggplot will produce a similar scatterplot:
           > scatter  scatter + geom_point() + labs(x = "Weight (g)", y = "Feed")

                                                                                                     22
R – Workshop Nov 2018      School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

    Note that the variables are presented differently – weight is on the y-axis here and was
    on the x-axes before. The reason is that when the x-variable is a factor, the built-in
    function plot() automatically produces boxplots (This can be seen in the next section).
This would be the correct way to choose the data, with weight on the y-axis:
         > plot(chickwts$feed,chickwts$weight,xlab="Feed",ylab=" Weight")

    ggplot allows the user to choose which graph type is used.

                                                                                                  23
R – Workshop Nov 2018      School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

6.2 BOXPLOT

   Using still the same dataset and the same function, we switch the x- and y-axis:

        > plot(chickwts$feed,chickwts$weight,xlab="Feed",ylab="Weight (g)")

   Compare this function to the plot() function we used above to produce a Scatterplot.

      Using ggplot we produce a similar boxplot:

        > box  box + geom_boxplot() + labs(x = "Feed", y = "Weight (g)")

                                                                                                 24
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

7 EXAMPLES OF UNIVARIATE STATISTICAL ANALYSIS

7.1 ANOVA

   R allows you to easily construct an ANOVA table for the chick weight test using the built-
   in aov function:
        > chick.anova  summary(chick.anova)

   Note that you employ formula notation weight~feed to specify the measurement
   variable of interest, i.e. weight, as modelled by the categorical-nominal variable of
   interest, feed type.

                                                                                                  25
R – Workshop Nov 2018        School of Life Sciences, UKZN     Prof. Ursula Scharler & Dr. Anna Bastian

   By default, R annotates model-based summary output like this with significance stars.
   These show intervals of significance, and the number of stars increases as the p-value
   decreases beyond a cutoff mark of 0.1.

   In our example, the very small p-value provides strong evidence against the null that the
   mean chick weights are the same for the different diets.
   5.94e-10 is the scientific notation of 0.000000000594.

7.1.1 Assumptions to check before running an ANOVA

   Before we can test our hypothesis using an ANOVA, we have to confirm that the dataset
   is suitable for this kind of analysis.
   The assumptions under which the F-statistic is reliable are the same as for all parametric
   tests based on the normal distribution. That is, the variances in each experimental
   condition need to be fairly similar (homogeneity of variance. Note: violating this
   assumption only matters if you have unequal group sizes), observations should be
   independent and the dependent variable should be measured on at least an interval
   scale. In terms of normality, what matters is that distributions within groups are normally
   distributed.

 Testing whether a distribution is normal (Shapiro-Wilk’s test)

   If the test is non-significant (p > .05) it tells us that the distribution of the sample is not
   significantly different from a normal distribution.
        > shapiro.test(chickwts$weight)

   This will give you the overall test of normality for “weight”. To check if the residuals of
   the data for each group (“feed”) are normally distributed, we first need to create o new
   object, specify that the test is done on residuals and then extract the p-values for each
   group:
        > SW_norm  shapiro.test(residuals(SW_norm))
        > do.call("rbind", with(chickwts, tapply(weight, feed, function(x)
      unlist(shapiro.test(x)[c("statistic", "p.value")]))))

 Testing for homogeneity of variance (Levene’s test)

   If Levene’s test is significant (Pr (>F) in the R output is less than .05) then the variances
   are significantly different in different groups.

                                                                                                     26
R – Workshop Nov 2018      School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

  To use Levene’s test, we use the leveneTest() function from the car package:

       > install.packages("car")

       > library(car)

  We enter two variables into the function: first the outcome variable of which we want to
  test the variances; and second, the grouping variable, which must be a factor.

       > leveneTest(chickwts$weight, chickwts$feed)

  Run the Anova again:

       > anova(SW_norm)

                                                                                                27
R – Workshop Nov 2018       School of Life Sciences, UKZN    Prof. Ursula Scharler & Dr. Anna Bastian

7.2 REGRESSION ANALYSIS

   For this analysis, we will use another one of the loaded datasets, namely ChickWeight. It
   provides four variables, but we will only use two of them (weight, time as a proxy of age).

   First, plot your data. To do so, we can use the scatterplot function we have learned
   already. We specify the name of the dataset (ChickWeight), and of the two variables
   (Time, weight):
> scatterchicks  scatterchicks + geom_point() + labs(x = "Time", y = "weight")

   To plot your data including a regression line, use e.g.:
> scatterchicks + geom_point() + labs(x = "Time", y = "weight")+stat_smoot
h()

   For a regression analysis, you can use the following simple function:
> myregr  summary(myregr)

7.3 Assumptions to check before running a Regression Analysis:

   The assumptions for a simple linear regression are that two variables are linearly related,
   and that the residuals are normally distributed and homoscedastic.
   You can plot the residuals in different ways. The following function produces four plots.
   To view them all at once, use:
> par(mfrow = c(2, 2))
> plot(myregr)

   The four plots show the following (look up further information yourself, the plots are not
   exclusive to R):
       - Residual vs fitted, for linear relation assumption
       - Normal Q-Q – are residuals normally distributed?
       - Scale location – homogeneity of variance
       - Residual vs levelrage - Are extreme values influencing the analysis?

   You can statistically test the assumptions with the functions introduced in the Anova
   section.

                                                                                                   28
R – Workshop Nov 2018      School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

8 WHERE TO LOOK FOR HELP

   There is a lot of help for R users, which is easy to access. When you have a question on
   how to do certain operations in R, or you are looking for an explanation to an error
   message and information on how to fix it, simply go to google with your query. Your
   queries will most likely have been posted already by someone else, which means you are
   likely to find a response and a solution.
   For operations that have to do with R itself, or any of its packages, use the R website and
   read through the package vignettes. They contain the code and instructions on how to
   use the packages.
   The query function ‘?’ will provide you with information on the function or package (e.g.
   ?read.csv will give information the read.csv function). This information will appear in the
   help window (bottom right) of R Studio. Using ‘??’ will search the web for help on a
   certain function or package.

   Some websites:
   Quick-R
   Stackoverflow.com
   www.rdocumentation.org
   www.r-bloggers.com
   stat.ethz.ch
   Github:
   https://github.com/trending/r
   Ggplot2:
   https://ggplot2.tidyverse.org/

   http://manuals.bioinformatics.ucr.edu/home/programming-in-r#TOC-Debugging-Utilities

   Books:
   Springer Series: Use R! (https://www.springer.com/series/6991)

   Tutorials and course can be found online:
   DataCamp:
   https://www.datacamp.com/courses/free-introduction-to-r
   Coursera
   Udemy

   There are also video tutorials available on YouTube:
   UTSSC
   How to R
   …. and many more

                                                                                                 29
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

9 GRAPHICS VISUALISATION WITH GGPLOT2

   The graphs we have produced in the previous part can be improved!

   ggplot is very versatile and provides many ways to visualize data and to customize
   graphs.
   In ggplot a graph is made up of a series of layers. You can think of a layer as a plastic
   transparency with something printed on it such as text, data points, lines, bars and so on.
   To make a final graph, these layers are placed on top of each other.
   Each layer contains visual objects such as bars, data points, text. Visual elements are
   known as geoms (short for ‘geometric objects’).

   Most common diagram types (for a full list see http://had.co.nz/ggplot2/):

   • geom_point()
   • geom_bar()
   • geom_histogram()
   • geom_density()
   • geom_line()
   • geom_boxplot()

   • geom_text(): creates a layer with text on it.

   These geoms also have aesthetic properties that determine what they look like and
   where they are plotted. These aesthetics (aes() for short) control the appearance of
   graph elements (for example, their colour, size, style and location). Aesthetics can be
   defined in general for the whole plot, or individually for a specific layer.

   > ggplot((dataset) , aes(x=(x-coord.) , y=(y-coord.),
   + colour=(variable),
   + fill=(variable),
   + shape=(factor),
   + linetype=(factor),
   + group=(factor)))

   If you want to set an aesthetic to a specific value then you don’t specify it within the
   aes() function, but if you want an aesthetic to vary then you need to place the instruction
   within aes().

                                                                                                  30
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

9.1 AN EXAMPLE: SCATTERPLOT

   The dataset we are using is called “Exam Anxiety.xlsx” Make it a tab delimited text file
   and read it into R:

      examData
R – Workshop Nov 2018       School of Life Sciences, UKZN   Prof. Ursula Scharler & Dr. Anna Bastian

         element_rect() to change the appearance of the rectangle elements. Rectangle
          elements: plot background, panel background, legend background, etc.

  Removing the background:
         scatter + theme(panel.background = element_rect(fill = "white"),
         element_line(linetype = 'solid', colour = "black"),
         axis.line = element_line(size = 0.5, linetype = "solid",colour =
         "black")) + geom_point()+ geom_smooth(method = "lm", colour =
         "Red", fill = "Red")+ labs(x = "Exam Anxiety", y = "Exam
         Performance %")

  Larger dots:
         scatter + theme(panel.background = element_rect(fill = "white"),
         element_line(linetype = 'solid', colour = "black"),
         axis.line = element_line(size = 0.5, linetype = "solid",colour =
         "black")) + geom_point(size = 3)+ geom_smooth(method = "lm", colour
         = "Red", fill = "Red")+ labs(x = "Exam Anxiety", y = "Exam
         Performance %")

  Regression line (+CI) changed:
         scatter + theme(panel.background = element_rect(fill = "white"),
         element_line(linetype = 'solid', colour = "black"),
         axis.line = element_line(size = 0.5, linetype = "solid",colour =
         "black")) + geom_point(size = 3)+ geom_smooth(method = "lm", colour
         = "Red", fill = "Red", linetype = "dashed", alpha = 0.1)+ labs(x =
         "Exam Anxiety", y = "Exam Performance %")

  Save the graph by exporting the graph and saving it as a vector file (always the best
  format to keep backups of graphs):

  Save as image (format “.svg”) and as .pdf file. Alternatively, you can type”:
         ggsave("Exam Anxiety Plot2.pdf")

  Before quitting RStudio you can save your work by saving the History and/or by saving
  the Environment.
  The saved History file (.R format) can be loaded into RStudio and then sent to the Source
  and Run.

                                                                                                 32
You can also read