Battle of the Programming Languages: R vs Python

  • Home
  • /
  • Blog List
  • /
  • Battle of the Programming Languages: R vs Python
Battle of the Programming Languages: R vs Python
on 10 Oct 2016 20:22 PM
  • Rang Technologies
  • Data Science

We are going to compare Functions exist in both R and Python for same operations. And for this we took the Titanic dataset which contains the Passenger details.
Importing a CSV
Reading Data in both the languages is similar, but the only difference is for python we have to import pandas library for reading the Data. Once the importing is done we can look into the data by applying the below functions.













R Python
titanic <- read.csv("train.csv") import pandas as pd

titanic = pd.read_csv("train.csv")

Dimension and Shape
If we want to look the Dimension of the above imported Data. You can get it from the below functions.
















R Python
dim(titanic) titanic.shape
[1] 891 12 (891, 12)


The above code brings you the number of passengers in titanic ship and the number of columns present in data.

Head and Tail
If you want to see some of the data like top rows (Any number of rows by default it gets 5 rows) or bottom rows form the Data frame. There are functions in similar functions in both R and Python.
























R Python
head(titanic,2) titanic.head(2)
PassengerId Survived Pclass

1 1 0 3

2 2 1 1

 
PassengerId Survived Pclass
0 1 0 3
1 2 1 1
tail(titanic,2) titanic.tail(2)

PassengerId Survived Pclass

890 890 1 1

891 891 0 3

PassengerId Survived Pclass
889 890 1 1

890 891 0 3

Here head and tail functions applied on Titanic dataset to look at the first two rows of Data. If you observe clearly the index values are different in both R and Python. It is because Python index starts with '0'.

Basic Statistics of Data (Summary and Describe)




















R Python
summary(titanic) titanic.describe()
PassengerId Survived PassengerId Survived
Min. : 1.0 Min. :0.0000

1st Qu.:223.5 1st Qu.:0.0000

Median :446.0 Median :0.0000

Mean :446.0 Mean :0.3838

3rd Qu.:668.5 3rd Qu.:1.0000

Max. : 891.0 Max. :1.0000
count 891.000000 891.000000

mean 446.000000 0.383838

std 257.353842 0.486592

min 1.000000 0.000000

25% 223.500000 0.000000

50% 446.000000 0.000000

75% 668.500000 1.000000

max 891.000000 1.000000

The above two functions are for determining some basic statistics column wise. Whereas python gives two more statistic values compared to R function. The main difference between these functions is R contains Separate functions and for Python we have to call the required methods on the Data as it is more of object oriented type programming.

Slicing the Data












titanic[1:5,1:3] titanic.iloc[0:5,0:3]
PassengerId Survived Pclass

1 1 0 3

2 2 1 1

3 3 1 3

4 4 1 1

5 5 0 3
PassengerId Survived Pclass

0 1 0 3

1 2 1 1

2 3 1 3

3 4 1 1

4 5 0 3

Sub setting Data

Here in this case for sub setting the Data I took only some columns from the titanic Dataset. For the convenience of displaying the output.
Using sam_data sam_data = titanic[['PassengerId', 'Survived','Sex','Age']] for Python
I created sam_data for applying subset function.













subset(sam_data,Survived == 1& Sex == 'male') sam_data[(sam_data.Sex == 'male') & (sam_data.Survived ==1)].head(2)
PassengerId Survived Sex Age

18 18 1 male NA
22 22 1 male 34
24 24 1 male 28
37 37 1 male NA
56 56 1 male NA
66 66 1 male NA
PassengerId Survived Sex Age

17 18 1 male NaN
21 22 1 male 34.00
23 24 1 male 28.00
36 37 1 male NaN
55 56 1 male NaN
65 66 1 male NaN

The important thing here is the representation of NA in Python is NaN.

Ordering the Data
We ordered the Sample Dataset By













arrange(sam_data, Survived, desc(Age)) sam_data.sort_index(by=['Survived', 'Age'], ascending=[True, False])
PassengerId Survived Sex Age
1 852 0 male 74.0
2 97 0 male 71.0
3 494 0 male 71.0
4 117 0 male 70.5
5 673 0 male 70.0
6 746 0 male 70.0
PassengerId Survived Sex Age
851 852 0 male 74.0
493 494 0 male 71.0
96 97 0 male 71.0
116 117 0 male 70.5
672 673 0 male 70.0
745 746 0 male 70.0

Joins

For Performing join operations we created three different data frames from the titanic Dataset
df_Survived df_Sex df_Age

df_Survived = sam_data[['PassengerId', 'Survived']]
df_Sex = sam_data [['PassengerId', 'Sex']]
df_Age = sam_data [[ 'PassengerId', 'Age']]

























Inner

Join
Table_Inner_J = merge ( merge(df_Survived, df_Sex, key = "PassengerId" ),

df_Age ,

key = "PassengerId")
Table_Inner_J = pd.merge ( pd.merge(df_Survived, df_Sex, on = "PassengerId" , how = "inner" ),

df_Age ,

on = "PassengerId" , how = "inner")
Outer Join Table_Outer_J = merge ( merge(df_Survived, df_Sex, key = "PassengerId" , all =TRUE),

df_Age ,

key = "PassengerId", all = TRUE)
Table_Outer_J = pd.merge ( pd.merge(df_Survived, df_Sex, on = "PassengerId" , how = "outer"),

df_Age ,

on = "PassengerId", how = "outer")
Left

Join
Table_Left_J = merge ( merge(df_Survived, df_Sex, key = "PassengerId" , all.x =TRUE),

df_Age ,

key = "PassengerId", all.x = TRUE)
Table_Left_J = pd.merge ( pd.merge(df_Survived, df_Sex, on = "PassengerId" , how = "left"),

df_Age ,

on = "PassengerId" , how = "left")
Right

Join
Table_Right_J = merge ( merge(df_Survived, df_Sex, key = "PassengerId", all.y = TRUE),

df_Age ,

key = "PassengerId", all.y = TRUE )
Table_Right_J = pd.merge ( pd.merge(df_Survived, df_Sex, on = "PassengerId" , how = "right"),

df_Age ,

on = "PassengerId" , how = "right")

The major Difference in R and Python for joining operation is both can be done using merge function. But for python we have to import pandas library for using the merge function to perform these join functions. We can join three Data frames at a time by applying merge function two times.

Missing Values Treatment:

In Missing Values treatment first thing we have to do is identify the NA values by running the first block of code in below table. After getting the variables where missing values are there then you can impute them with the mean value of that respective column.

Here, second block of code replaces the NA values with the respective mean values.













tail(is.na(sam_data)) sam_data.isnull().tail()
PassengerId Survived Sex Age
[886,] FALSE FALSE FALSE FALSE
[887,] FALSE FALSE FALSE FALSE
[888,] FALSE FALSE FALSE FALSE
[889,] FALSE FALSE FALSE TRUE
[890,] FALSE FALSE FALSE FALSE
[891,] FALSE FALSE FALSE FALSE
PassengerId Survived Sex Age
886 False False False False
887 False False False False
888 False False False False
889 False False False True
890 False False False False














sam_data["Age"][is.na(sam_data["Age"])] <- lapply(sam_data["Age"],mean, na.rm = TRUE) meanAge = np.mean(sam_data.Age)

sam_data.Age = sam_data.Age.fillna(meanAge)
PassengerId Survived Sex Age

[886,] FALSE FALSE FALSE FALSE

[887,] FALSE FALSE FALSE FALSE

[888,] FALSE FALSE FALSE FALSE

[889,] FALSE FALSE FALSE FALSE

[890,] FALSE FALSE FALSE FALSE

[891,] FALSE FALSE FALSE FALSE
PassengerId Survived Sex Age

886 False False False False

887 False False False False

888 False False False False

889 False False False False

890 False False False False



Plotting the Data:

Python:
Import seaborn and matplot libraries for plotting the titanic dataset.

import matplotlib.pyplot as plt
import seaborn as sns
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(10,5))
sns.countplot(x='Embarked', data=titanic, ax=axis1)
sns.countplot(x='Survived', hue="Embarked", data=titanic, order=[1,0], ax=axis2)








Rang Technologies

Here by using the subplots function from matplot package assigning space for two different graphs in single row. If we want to display two graphs in two rows each we have to assign a (2, 2) matrix as subplot.

R:
Using Par function, we will split the graph display into 1* 2 matrix
par(mfrow = c(1,2))

Below is the ggplot for first graph displayed below.
ggplot(titanic, aes(x = Survived, fill = factor(Embarked))) +
geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks=c(0:10)) +
labs(x = 'Survived') +
theme_few()

We are converting embarked column into numeric to plot its count using ggplot.
titanic$Embarked2

ggplot(titanic, aes(x = Embarked2))+
geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks= titanic$Embarked2) +
labs(x = 'Embarked') +
theme_few()








Rang Technologies Rang Technologies


About Rang Technologies:
Headquartered in New Jersey, Rang Technologies has dedicated over a decade delivering innovative solutions and best talent to help businesses get the most out of the latest technologies in their digital transformation journey. Read More...