In R, you can accomplish the same task in different ways.
This R document explains functions from R package--dplyr and in some places compares those functions with base functions.

 

# import dplyr library
# we are going to work with R in built dataset airquality
library(dplyr)
head(airquality)

 

## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6



Filter function--dplyr vs subset vs [ ]

 

You might have guessed what filter function will do here. It filters/subsets/slices the data depending on one or more condition. I am discussing here the three commonly used ways for subsetting.

 

# filter data with Wind > 7.0 for the month of May
#dplyr way
filter(airquality,Wind > 7.0, Month == 5)


# base way
#first
airquality[airquality$Wind > 7.0 & airquality$Month == 5,]
#second
subset(airquality,Wind > 7.0 & Month == 5)

I won't recommend using the subset function unless you completely understand subset function.


Use caution when using subset. For further readings look at the stackoverflow thread, http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset

Mutate function

 

Mutate function is used to create new variables without affecting existing variables. In other sense it creates new variable and keeps the old variable (you will understand why this is even a thing to note). It can also be used to transform existing varibles.


Transmute function (you read it right its transmute not transmutate) does the same but it drops all the variables except the created new variable. So if you just want to create a single variable then use transmute otherwise use mutate.

# mutate transforms the variable and keep the existing variable
mutate(airquality,TempInC = ((Temp - 32) * 5 / 9))


# transmute transforms a variable and drops the existing variables( I said variables )
# it keeps only the new variable and drops all other variables
transmute(airquality,TempInC = ((Temp - 32) * 5 / 9))

# base function
# somehow i find it easy to use than mutate
airquality$TempInC<-((Temp - 32) * 5 /9)

Arrange function

 

Arrange function is used to sort variable(s).

 

# dplyr--arrange
arrange(airquality,Month,desc(Temp))


# base--order()
airquality[order(airquality$Month,-airquality$Temp),]
# Note - for ordering desc

group_by and summarise functions

 

First let me explain summarise fn then we go for group_by.


Summarise function takes vector as input and outputs a single value. You can ask min, max, mean,sd, var, median, etc from a vector and summarise fn outputs the result. Of course R base package will give you all these summary stats but there is a catch, summarise function works with group_by function but base functions don't. I will explain you with examples.

# Both base and summarise give you same output for normal df/tbl
mean(airquality$Temp)

 

## [1] 77.88235

 

summarise(airquality, mean(Temp))

 

## mean(Temp)
## 1 77.88235

 

There is a subtle difference in outputs between these two. First return double the later returns list. But that doesn't concern us, the key difference is when used with group_by function.


If you know SQL then you may be deceived by the group_by function. Here group_by doesn't return the output for each group as you might expect but it creates a new grouped table. This table can be further used to do lot of actions with that grouped variable.

grouped_table<-group_by(airquality,Month)
head(grouped_table)

 

## Source: local data frame [6 x 6]
## Groups: Month [1]
##
## Ozone Solar.R Wind Temp Month Day
##
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6

 

#Dimensions for both original dataset and grouped dataset
dim(airquality)

 

## [1] 153 6

 

dim(grouped_table)

 

## [1] 153 6

 

Both have same dimensions and you can see the header records of grouped_table which looks same as original dataset. But grouped_table is grouped on top of Month variable. You can see the 'Groups' section denoting the variable(s) used for group_by function.


Generally you can use more than one variable to group and ask Summarise fn to give output. Now we ask for average(mean) using summarise and mean fns and compare the results.

mean(grouped_table$Temp)

 

## [1] 77.88235

 

summarise(grouped_table,mean(Temp))

 

## # A tibble: 5 x 2
## Month mean(Temp)
##
## 1 5 65.54839
## 2 6 79.10000
## 3 7 83.90323
## 4 8 83.96774
## 5 9 76.90000



dplyr::distinct vs base::unique

 

From the name you can understand both unique and distinct functions. Both gives you the unique/distinct values but unique works with list too (of course).

 

distinct(airquality,Month)

 

## Month
## 1 5
## 2 6
## 3 7
## 4 8
## 5 9

 

unique(airquality$Month)

 

## [1] 5 6 7 8 9



dplyr::sample_n/sample_frac vs base::sample

 

dplyr sample is a wrapper around base sample.int function.

 

sample_n(airquality,size=2)

 

## Ozone Solar.R Wind Temp Month Day
## 124 96 167 6.9 91 9 1
## 65 NA 101 10.9 84 7 4

 

sample_frac(airquality,size=0.01)

 

## Ozone Solar.R Wind Temp Month Day
## 57 NA 127 8.0 78 6 26
## 135 21 259 15.5 76 9 12

 

Piping

 

This symbol %>% is pipe operator which is used to connect codes together and run connected codes together without saving intermediate results.


Simply put this operator sends left side parameter as first argument to right side function. You can also use .(dot) operator if you want to pass left side parameter Let me show you,

airquality %>% group_by(Month) %>% summarise(mean_wind=mean(Wind)) %>% arrange(desc(mean_wind))

 

## # A tibble: 5 x 2
## Month mean_wind
##
## 1 5 11.622581
## 2 6 10.266667
## 3 9 10.180000
## 4 7 8.941935
## 5 8 8.793548

 

airquality data is used as first argument for group_by function. Then the intermediate grouped table is passed as first argument for summarise function. At last summarised table is passed to arrange function and produces output.


This is commonly used when experimenting with data. It also helps to reduce creating number of temp variables while doing analysis. ## other functions to look for na_if - converts any suspicious value to na coalesce - picks non-missing value at each position when you input more than 1 vectors with same length. Inspired from SQL coalesce tbl - create table from data recode - replace values for both numeric and character vectors. Numeric based on position and character based on name.

This completes our introduction part for dplyr. This will help you to start working with data and have fun.!!

About Rang Technologies:
Headquartered in New Jersey, Rang Technologies has dedicated over a decade delivering innovative solutions and best talent to help businesses get the most out of the latest technologies in their digital transformation journey. Read More...