R Programming: Biginner’s introduction Leson 1.

Skills
Author

Benjamini Mpinga

Published

April 30, 2023

INTRODUCTION.

It’s possible that you’re aware of the R programming language and have come across useful tips for improving your R experience. However, I’d like to introduce another intriguing aspect of R. For beginners delving into R, there’s a delightful feature that allows you to effortlessly create a single document within the R editor, which can then be shared in various formats like PDF, Word, or HTML. This isn’t just an imagination. it’s the reality of using R. In fact, the very post you’re currently reading was generated using R. R is a truly remarkable tool that empowers users with coding skills, enhances writing abilities, facilitates the generation of diverse document formats.Promotes standardization and visualization, strengthens data analysis skills, expands statistical knowledge, and offers a multitude of other advantages for individuals working with data and research.

R Mentor

By having a mentor, you can benefit from their expertise, ask questions, and receive direct assistance tailored to your specific needs. They can help you understand complex concepts, provide examples and exercises, and ensure that you grasp the fundamental principles of R effectively. While books and online content are excellent resources for expanding your knowledge, having a mentor complements those resources by providing personalized guidance and support. It creates a dynamic learning environment where you can actively engage with someone experienced in R, fostering a deeper understanding and faster progress.

In conclusion, while reading books and searching online resources are valuable components of learning R,for beginners, having a mentor can offer significant advantages. They can provide individualized guidance, help you navigate challenges, and accelerate your learning journey. So, consider seeking a mentor who can dedicate time and effort to support you as you delve into the intricacies of R programming.

RStudio INSTALATION & MODERN R.

To start your journey with the R programming language, the first step is to install R and RStudio on your computer. R is the programming language itself, while RStudio provides a comprehensive working environment that enhances your R experience.

Here’s a breakdown of what R and RStudio are:

R is a programming language and software environment designed for statistical computing and graphics. It provides a wide range of statistical and graphical techniques and is widely used in various fields such as data analysis, research, and machine learning.

RStudio is an integrated development environment (IDE) for R. It provides a user-friendly interface with powerful tools and features that streamline your coding workflow. RStudio offers a code editor, console, data viewer, and a variety of helpful features like code completion, debugging tools, and project management.

Once you have installed R and RStudio, you’ll need to install packages that support the functionality of R and RStudio. Packages in R are collections of functions, data, and documentation that extend the capabilities of the base R system. These packages provide additional functions that are necessary for performing various tasks in your daily coding activities.

To download R, you can visit the Comprehensive R Archive Network (CRAN) website. CRAN is a network of mirror servers worldwide that distribute R and R packages. It’s recommended to use the cloud mirror, which automatically selects the appropriate mirror for you. You can access it at https://cloud.r-project.org.

It’s worth noting that a new major version of R is released annually, with minor releases occurring 2-3 times a year. It’s generally recommended to update R regularly to benefit from the latest features and improvements. Upgrading R can sometimes require reinstalling your packages, which may take some effort, particularly for major version upgrades. However, it’s advisable to keep your R installation up to date to ensure compatibility and access to the latest functionalities.

By installing R, setting up RStudio, and regularly updating your software, you’ll be well-equipped to begin your R programming journey and explore the vast possibilities the language has to offer.

R Package installation.

In R, packages are fundamental in R, and installing the necessary packages is crucial for expanding the capabilities of your R environment. One highly popular and powerful package ecosystem in R is called the “tidyverse”. The tidyverse is a collection of packages that work together seamlessly to provide a cohesive and efficient workflow for data entry, manipulation, modeling, visualization, and sharing insights derived from data analysis. Some of the key packages within the tidyverse include dplyr for data manipulation, ggplot2 for data visualization, tidyr for data tidying, readr for data import, and many more.

To install the tidyverse package, you can use the following command in R:

                         [ install.packages("tidyverse") ]

INTRODUCTION TO DATA SCIENCE IN R PROGRAMME.

Data science is a huge field, and there’s no way you can master it by reading a single book, so as i explained above it is more helpful if you get a mentor to lead you through obstacles. Also is an exciting knowledge that allows you to turn raw data into understanding, insight, and knowledge. The goal of R for Data Science is to help you to learn the most important tools in R that will allow you to do data science.

Data importation

Importing data into R is an essential step in performing data analysis. Data can be stored in various formats such as files (e.g., CSV, Excel), databases, or web APIs, and loading it into R allows you to manipulate, analyze, and visualize the data. To import data into R, you typically use functions or packages specifically designed for each data source or format.

Tyding

Once you’ve imported your data, it is required to tidy your data. Tidying your data means storing it in a consistent form that matches the semantics of the data set with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure allows you to focus your struggle on questions about the data, and not fighting the data into the right form for different functions.

Once you have tidy data, a first step is to transform it. Transformation includes narrowing in on observations of interest eg.(all people in one city, or all data from the last year), creating new variables that are functions of existing variables Together, tidying and transforming are called wrangling, this is because, getting your data in a form that’s natural to work with often, it usually feels like a fight!.

Once you have tidy your data with the variables you need, there are two main engines of knowledge generation: visualization and modeling. These have complementary strengths and weaknesses, so any real analysis will iterate between them many times.

Visualization

Visualization is a fundamentally human activity. A good visualization will show you things that you did not expect, or raise new questions about the data. A good visualization might also hint that you’re asking the wrong question, or you need to collect different data. Visualizations can surprise you, because they require a human to interpret them.

Models

Models are complementary tools to visualization. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, every model can make assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.

Communication

The last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.

flowchart LR
 A[(Import)] --> B[Tidy]
 B --> C(Transform)
 C --> D(Visualize) 
 C --> E(Model)
 D --> F{Communicate}
 E --> F
 F --> G[/HTML/]
 F --> H[/PDF/]
 F --> I[/WORD/]

 
 

Figure 1: How R programming work in analysis.

RULES OF THE GAME IN R

Like any other game or language, in order to perform or being grammatical correct you have to apply the correct verb for a particular tense. As well as correct action for a specific environment. The same applies to R programming language, for the better result of your data analysis you have to apply the correct package for a specific task, every package contains functions in which there are verbs that act to perform a particular task.

This will provides an introduction to data science and the R programming language. The goal here is to get your hands dirty right from the start!, We will walk through an entire data analysis, and along the way introduce different types of data analysis question, some fundamental programming concepts in R, and the basics of loading, cleaning, and visualizing data. we will dig into each of these steps in much more detail.

DATA TRANSFORMATION WITH DPLYR.

Introduction

Visualization is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. let’s focus on how to use the dplyr package, another core member of the tidyverse.

There are five key dplyr functions, which allow you to solve the vast majority of your data-manipulation challenges, These six functions, provide solutions for data manipulation.

VERBS

  • (filter()).
  • (arrange()).
  • (select()).
  • (mutate()).
  • (summarize()).
  • (group_by()).

Filter

Filter is one of the verb in tidyverse which usually deals with rows(records) only in your data frame,the main thing it does is to pick observations by their volume depends on the result you intend to achieve. (==,>,<).

Select

Select is one of the verb in tidyverse which usually deals with collumn (variables) only in your table data frame,what it does is to select the variables by their names depends on the result you intend to achieve.(=).

Arrange

Arrange is one of the verb in tidyverse which usually Reorder the rows(Records) of your data frame to allow you get result you intend.

Summarize

Summarize is one of the verb in tidyverse which Collapse many values down to a single summary

Group by

Group by is one of the verb in tidyverse used in changes of the scope of each function from operating on the entire data frame to operating on it group-by-group.

Mutate

Mutate is one of the verb in tidyverse Create new variables with functions of existing variables.

NOTE: All data frame contain Row and Column, therefore rows have to contains Records in it, while column have to contain variables in it (names). to successfully implement the above said verbs, you will have to ensure the following..

  • firstly you will have to be very clear on What kind of result you need, this depends on the questions.

  • Secondly is to understand data provided before importing it to R, for analysis. (To peruse them and see if they are tidy data or messy data).

  • Thirdly if they are tidy, and you have already understand a question, then you need to know which kind of verb you can use to get the result. but if data are messy, then you will need to tidy them before importation or if already imported then require tidying before further steps.

  • “Happy families are all alike; every unhappy family is unhappy in its own way.” ––- Leo Tolstoy

  • “Tidy data sets are all alike, but every messy data set is messy in its own way.” ––- Hadley Wickham

Shortly we have several types of data type….

  1. ` Character — these type of data are strings/names.
  2. ` Double/Numeric — these type are measurable data/decimals.
  3. ` Integer — these are numbered data/countable (0 to 9).
  4. ` Factor — these data are categorical/level.
  5. ` Date — Basically deal with dmy.

Abbreviation

  • int stands for integers.
  • dbl stands for doubles, or real numbers.
  • chr stands for character vectors, or strings.
  • dttm stands for date-times (a date + a time).

R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal). When you’re new in R, the easiest mistake to make is to use = instead of == when testing for equality.

R AND TIDYVERSE.

require(tidyverse)
Loading required package: tidyverse
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.1     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.2     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.1     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require(lubridate)
require(dplyr)
require(lubridate)
require(readxl)
Loading required package: readxl
require(janitor)
Loading required package: janitor

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
require(magrittr)
Loading required package: magrittr

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract
require(readr)

lets try a to create a vector of 5 kids, include their names, age, food habit, and birth date. by doing this activity, we shall cover all five types of data, which are…

  1. ` character
  2. ` numeric
  3. ` integer
  4. ` factor
  5. ` date
  6. ` logical

character vector (NAMES).

names = c("grace", "daniel", "collins", "ethan", "benjunior")
names
[1] "grace"     "daniel"    "collins"   "ethan"     "benjunior"

lets test if our vector is character

is.character(names); is.vector(names); is.factor(names);class(names)
[1] TRUE
[1] TRUE
[1] FALSE
[1] "character"
names[2:5]
[1] "daniel"    "collins"   "ethan"     "benjunior"

integer vector (FOOD)

food = c(1,2,3,4,5)
food
[1] 1 2 3 4 5

lets test if our food is integer vector

is.integer(food); is.vector(food); is.character(food); is.numeric(food); is.factor(food)
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE

now as we may see our results, show that our integer vector is FALSE, and Numeric is TRUE, then we have to change from numeric to integer. You may wonder why our integer resulted to numeric while we had created an integer vector?. well the reason is that in R, most of its understanding (default) data is Numeric, therefor is so obvious to find same result over and over.

food = as.integer(food)
food
[1] 1 2 3 4 5
is.integer(food)
[1] TRUE

Numeric vector (AGE)

age = c(11.5, 5.2, 5.3, 3.2, 1.2)
age
[1] 11.5  5.2  5.3  3.2  1.2

let’s test our numeric vector

is.numeric(age); is.integer(age); is.character(age); class(age)
[1] TRUE
[1] FALSE
[1] FALSE
[1] "numeric"

factor vector (GENDER)

gender = c("female", "male", "male", "male", "male")
gender
[1] "female" "male"   "male"   "male"   "male"  

let’s test is our vector factor?

class(gender); is.integer(gender); is.numeric(gender); is.factor("gender")
[1] "character"
[1] FALSE
[1] FALSE
[1] FALSE

Again as we can see the above result, our factor has appeared as character while we created a factor class, so let’s convert it to factor. Do not stress your self with that default when happen.

gender = as.factor(gender)
gender
[1] female male   male   male   male  
Levels: female male
class(gender); is.integer(gender); is.factor(gender); is.character(gender)
[1] "factor"
[1] FALSE
[1] TRUE
[1] FALSE
levels(gender)
[1] "female" "male"  

date vector (BIRTH DATE)

birth = c(dmy(01022011), dmy(01062017), dmy(01052017), dmy(01062019), dmy(01062021))
birth
[1] "2011-02-01" "2017-06-01" "2017-05-01" "2019-06-01" "2021-06-01"

As we can see our date vector shows date, month and year, now let’s test it.

class(birth); is.integer(birth);is.numeric(birth);is.Date(birth);is.vector(birth)
[1] "Date"
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE

Sequence

seq(0:100)
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100 101
number = seq(from = 0, to = 100, by = 2)
number
 [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36
[20]  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74
[39]  76  78  80  82  84  86  88  90  92  94  96  98 100
length(number)
[1] 51

Index

number[51]
[1] 100

Repeat an element in a vector

rep(names,each = 3)
 [1] "grace"     "grace"     "grace"     "daniel"    "daniel"    "daniel"   
 [7] "collins"   "collins"   "collins"   "ethan"     "ethan"     "ethan"    
[13] "benjunior" "benjunior" "benjunior"
rep(names,times = 3)
 [1] "grace"     "daniel"    "collins"   "ethan"     "benjunior" "grace"    
 [7] "daniel"    "collins"   "ethan"     "benjunior" "grace"     "daniel"   
[13] "collins"   "ethan"     "benjunior"

Adding an index in a vector

names [6] = "amani"
names
[1] "grace"     "daniel"    "collins"   "ethan"     "benjunior" "amani"    

removing an index in vector

names = names[1:5]
names
[1] "grace"     "daniel"    "collins"   "ethan"     "benjunior"
length(names); length(food); length(age); length(birth)
[1] 5
[1] 5
[1] 5
[1] 5

DATA FRAME.

Now that we have created our vector. Let us see how we turn our vector to data frame. We shall use the function data.frame(), then we put our vectors within the function.

data.frame(names, birth, age, food, gender )
      names      birth  age food gender
1     grace 2011-02-01 11.5    1 female
2    daniel 2017-06-01  5.2    2   male
3   collins 2017-05-01  5.3    3   male
4     ethan 2019-06-01  3.2    4   male
5 benjunior 2021-06-01  1.2    5   male

Equate the data frame

watoto = data.frame(names, birth, age, food, gender)
watoto
      names      birth  age food gender
1     grace 2011-02-01 11.5    1 female
2    daniel 2017-06-01  5.2    2   male
3   collins 2017-05-01  5.3    3   male
4     ethan 2019-06-01  3.2    4   male
5 benjunior 2021-06-01  1.2    5   male

APPLICATION OF R VERBS.

VERBS

  • (filter()).
  • (arrange()).
  • (select()).
  • (mutate()).
  • (summarize()).
  • (group_by()).

fillter

watoto %>% filter(gender == "male")
      names      birth age food gender
1    daniel 2017-06-01 5.2    2   male
2   collins 2017-05-01 5.3    3   male
3     ethan 2019-06-01 3.2    4   male
4 benjunior 2021-06-01 1.2    5   male
watoto %>% filter(birth == dmy(01022011) | birth == dmy(01062019))
  names      birth  age food gender
1 grace 2011-02-01 11.5    1 female
2 ethan 2019-06-01  3.2    4   male
watoto %>% filter(birth == dmy(01022011) & gender == "female")
  names      birth  age food gender
1 grace 2011-02-01 11.5    1 female

Select

watoto %>% select(names,gender)
      names gender
1     grace female
2    daniel   male
3   collins   male
4     ethan   male
5 benjunior   male
watoto %>% select(1:4)
      names      birth  age food
1     grace 2011-02-01 11.5    1
2    daniel 2017-06-01  5.2    2
3   collins 2017-05-01  5.3    3
4     ethan 2019-06-01  3.2    4
5 benjunior 2021-06-01  1.2    5

arrange

watoto %>% arrange(watoto)
      names      birth  age food gender
1 benjunior 2021-06-01  1.2    5   male
2   collins 2017-05-01  5.3    3   male
3    daniel 2017-06-01  5.2    2   male
4     ethan 2019-06-01  3.2    4   male
5     grace 2011-02-01 11.5    1 female
watoto %>% arrange(desc(food))
      names      birth  age food gender
1 benjunior 2021-06-01  1.2    5   male
2     ethan 2019-06-01  3.2    4   male
3   collins 2017-05-01  5.3    3   male
4    daniel 2017-06-01  5.2    2   male
5     grace 2011-02-01 11.5    1 female

Mutate

watoto %>% mutate(sport = c("footbal", "footbal","voleybal", "basketbal", "netbal"))
      names      birth  age food gender     sport
1     grace 2011-02-01 11.5    1 female   footbal
2    daniel 2017-06-01  5.2    2   male   footbal
3   collins 2017-05-01  5.3    3   male  voleybal
4     ethan 2019-06-01  3.2    4   male basketbal
5 benjunior 2021-06-01  1.2    5   male    netbal

DATA IMPOTATION.

As we have seen in the beginning of this book,in order to do activities in R,you need data, without data R is like meaningless application in your computer.So since data work as a fuel in R, step one you need to know how to import data into it. Data importation in R reefers to the process of calling your data from the directory (file) in your computer into the R studio where you intend to do your analysis.

Step 1.

You have to know what kind of data you have before importing it in R. Data are usually in different extension, so you have to ensure you put your data in the right extension in your local directory. Eg. xls, csv etc…….

Step 2.

On your computer open R studio, programming is usually done on console. But in the bellow demo i will work on the editor.

Procedures.

My data names is (Tamisemi Primary Enrollment by Age and Sex,2021.xls”). So open a chunk then within a chunk call a required package then run the chunk. Then Open another chunk then within the chunk call your data from your local/external directory. Example, If your data are in Excel or csv extension, so i will use the package (tidyverse, read_excel and read_csv).

how do we call our data from our directory?

read_csv("e:/MY_STAFFS/Minning__Folder/Blogs/teneson/posts/R-introduction/BUDGET DAUDI HEALTH CENTRE 2022.csv")
New names:
Rows: 18 Columns: 11
-- Column specification
-------------------------------------------------------- Delimiter: "," chr
(6): TARGET, ACTIVITY, GSF CODE, Unit of measure, TOTAL, Source of Fund dbl
(2): Quantity, frequencer num (2): unit price, ...11 lgl (1): ...10
i Use `spec()` to retrieve the full column specification for this data. i
Specify the column types or set `show_col_types = FALSE` to quiet this message.
* `` -> `...10`
* `` -> `...11`
# A tibble: 18 x 11
   TARGET ACTIVITY `GSF CODE` `Unit of measure` Quantity frequencer `unit price`
   <chr>  <chr>    <chr>      <chr>                <dbl>      <dbl>        <dbl>
 1 Short~ To supp~ Medical e~ Kits                     6          5       12679.
 2 <NA>   <NA>     Drugs and~ Kits                     6          5       42262.
 3 <NA>   <NA>     Hospital ~ Kits                     6          5        8452.
 4 <NA>   <NA>     PPMof med~ Kits                     6          5        4226.
 5 <NA>   <NA>     Dental eq~ Kits                     6          5        8452.
 6 <NA>   <NA>     Laborator~ Kits                     6          5        8452.
 7 <NA>   <NA>     <NA>       <NA>                    NA         NA          NA 
 8 <NA>   To supp~ wire       metre                    1          1     1309250 
 9 Organ~ To prov~ extra duty Person                   0          1      165000 
10 <NA>   <NA>     Leave tra~ Person                   3          1      120000 
11 <NA>   <NA>     Medical r~ Person                   2          4      100000 
12 <NA>   <NA>     Uniform a~ Person                   2          1      120000 
13 <NA>   <NA>     Burial Ex~ Person                   2          1     1000000 
14 <NA>   <NA>     Stationary Each                     0          4       16125 
15 <NA>   <NA>     <NA>       <NA>                    NA         NA          NA 
16 <NA>   <NA>     <NA>       <NA>                    NA         NA          NA 
17 <NA>   <NA>     <NA>       <NA>                    NA         NA          NA 
18 <NA>   <NA>     <NA>       <NA>                    NA         NA          NA 
# i 4 more variables: TOTAL <chr>, `Source of Fund` <chr>, ...10 <lgl>,
#   ...11 <dbl>

the above data frame are showing the importation of data from your directory into R, using csv extension.

read_xlsx("e:/MY_STAFFS/Minning__Folder/Blogs/teneson/posts/R-introduction/BUDGET DAUDI HEALTH CENTRE 2022.xlsx")
New names:
* `` -> `...10`
# A tibble: 32 x 10
   TARGET ACTIVITY `GSF CODE` `Unit of measure` Quantity frequencer `unit price`
   <chr>  <chr>    <chr>      <chr>                <dbl>      <dbl>        <dbl>
 1 Short~ To supp~ Medical e~ Kits                     6          5        52500
 2 <NA>   <NA>     Drugs and~ Kits                     6          5       175000
 3 <NA>   <NA>     Hospital ~ Kits                     6          5        35000
 4 <NA>   <NA>     PPMof med~ Kits                     6          5        50000
 5 <NA>   <NA>     Dental eq~ Kits                     6          5        35000
 6 <NA>   <NA>     Laborator~ Kits                     6          5        35000
 7 <NA>   <NA>     <NA>       <NA>                    NA         NA           NA
 8 Organ~ To prov~ Water cha~ bills                   64         12         2500
 9 <NA>   <NA>     Electrici~ bills                 1200          6          400
10 <NA>   <NA>     Per diem   person                   5          5        80000
# i 22 more rows
# i 3 more variables: TOTAL <dbl>, `Source of Fund` <chr>, ...10 <dbl>

the above tables are showing the importation of data from your directory into R, using Xlsx extension.

To this stage it’s so clear that we have gain an insight on how to get started with R language and data importation ready to manipulate them and find results. So in next post we shall over see how we manipulate with data and lead us to final results. The next post is the continuation of our lesson, where we will cover both data manipulation and statistics.

Related post.


Stay in touch

If you enjoyed this post, then don't miss out on any future posts by subscribing to my email newsletter

Support my work by sharing the link or by contact through +255 768 596 017.

Share