R Package dplyr Comparison of functions

R Package dplyr Comparison of functions
on 12 Jul 2016 14:35 PM
  • Rang Technologies
  • Data Science

In R, you can accomplish the same task in different ways.
This R document explains functions from R package--dplyr and in some places compares those functions with base functions.


# import dplyr library
# we are going to work with R in built dataset airquality

library(dplyr)
head(airquality)


## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6



Filter function--dplyr vs subset vs [ ]


You might have guessed what filter function will do here. It filters/subsets/slices the data depending on one or more condition. I am discussing here the three commonly used ways for subsetting.


# filter data with Wind > 7.0 for the month of May
#dplyr way

filter(airquality,Wind > 7.0, Month == 5)


# base way
#first

airquality[airquality$Wind > 7.0 & airquality$Month == 5,]
#second
subset(airquality,Wind > 7.0 & Month == 5)

I won't recommend using the subset function unless you completely understand subset function.


Use caution when using subset. For further readings look at the stackoverflow thread, http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset

Mutate function


Mutate function is used to create new variables without affecting existing variables. In other sense it creates new variable and keeps the old variable (you will understand why this is even a thing to note). It can also be used to transform existing varibles.


Transmute function (you read it right its transmute not transmutate) does the same but it drops all the variables except the created new variable. So if you just want to create a single variable then use transmute otherwise use mutate.

# mutate transforms the variable and keep the existing variable
mutate(airquality,TempInC = ((Temp - 32) * 5 / 9))


# transmute transforms a variable and drops the existing variables( I said variables )
# it keeps only the new variable and drops all other variables

transmute(airquality,TempInC = ((Temp - 32) * 5 / 9))

# base function
# somehow i find it easy to use than mutate

airquality$TempInC<-((Temp - 32) * 5 /9)

Arrange function


Arrange function is used to sort variable(s).


# dplyr--arrange
arrange(airquality,Month,desc(Temp))


# base--order()
airquality[order(airquality$Month,-airquality$Temp),]
# Note - for ordering desc

group_by and summarise functions


First let me explain summarise fn then we go for group_by.


Summarise function takes vector as input and outputs a single value. You can ask min, max, mean,sd, var, median, etc from a vector and summarise fn outputs the result. Of course R base package will give you all these summary stats but there is a catch, summarise function works with group_by function but base functions don't. I will explain you with examples.

# Both base and summarise give you same output for normal df/tbl
mean(airquality$Temp)


## [1] 77.88235


summarise(airquality, mean(Temp))


## mean(Temp)
## 1 77.88235


There is a subtle difference in outputs between these two. First return double the later returns list. But that doesn't concern us, the key difference is when used with group_by function.


If you know SQL then you may be deceived by the group_by function. Here group_by doesn't return the output for each group as you might expect but it creates a new grouped table. This table can be further used to do lot of actions with that grouped variable.

grouped_table<-group_by(airquality,Month)
head(grouped_table)


## Source: local data frame [6 x 6]
## Groups: Month [1]
##
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6


#Dimensions for both original dataset and grouped dataset
dim(airquality)


## [1] 153 6


dim(grouped_table)


## [1] 153 6


Both have same dimensions and you can see the header records of grouped_table which looks same as original dataset. But grouped_table is grouped on top of Month variable. You can see the 'Groups' section denoting the variable(s) used for group_by function.


Generally you can use more than one variable to group and ask Summarise fn to give output. Now we ask for average(mean) using summarise and mean fns and compare the results.

mean(grouped_table$Temp)


## [1] 77.88235


summarise(grouped_table,mean(Temp))


## # A tibble: 5 x 2
## Month mean(Temp)
## <int> <dbl>
## 1 5 65.54839
## 2 6 79.10000
## 3 7 83.90323
## 4 8 83.96774
## 5 9 76.90000



dplyr::distinct vs base::unique


From the name you can understand both unique and distinct functions. Both gives you the unique/distinct values but unique works with list too (of course).


distinct(airquality,Month)


## Month
## 1 5
## 2 6
## 3 7
## 4 8
## 5 9


unique(airquality$Month)


## [1] 5 6 7 8 9



dplyr::sample_n/sample_frac vs base::sample


dplyr sample is a wrapper around base sample.int function.


sample_n(airquality,size=2)


## Ozone Solar.R Wind Temp Month Day
## 124 96 167 6.9 91 9 1
## 65 NA 101 10.9 84 7 4


sample_frac(airquality,size=0.01)


## Ozone Solar.R Wind Temp Month Day
## 57 NA 127 8.0 78 6 26
## 135 21 259 15.5 76 9 12


Piping


This symbol %>% is pipe operator which is used to connect codes together and run connected codes together without saving intermediate results.


Simply put this operator sends left side parameter as first argument to right side function. You can also use .(dot) operator if you want to pass left side parameter Let me show you,

airquality %>% group_by(Month) %>% summarise(mean_wind=mean(Wind)) %>% arrange(desc(mean_wind))


## # A tibble: 5 x 2
## Month mean_wind
## <int> <dbl>
## 1 5 11.622581
## 2 6 10.266667
## 3 9 10.180000
## 4 7 8.941935
## 5 8 8.793548


airquality data is used as first argument for group_by function. Then the intermediate grouped table is passed as first argument for summarise function. At last summarised table is passed to arrange function and produces output.


This is commonly used when experimenting with data. It also helps to reduce creating number of temp variables while doing analysis. ## other functions to look for na_if - converts any suspicious value to na coalesce - picks non-missing value at each position when you input more than 1 vectors with same length. Inspired from SQL coalesce tbl - create table from data recode - replace values for both numeric and character vectors. Numeric based on position and character based on name.

This completes our introduction part for dplyr. This will help you to start working with data and have fun.!!

About Rang Technologies:
Headquartered in New Jersey, Rang Technologies has dedicated over a decade delivering innovative solutions and best talent to help businesses get the most out of the latest technologies in their digital transformation journey. Read More...