R or Python for Data Science?

Addressing the question ‘R or Python for data science’ depends mainly on the problems which is to be solved, the tools required to solve the problem and your personal preference.

Python is a general purpose programming language created by Guido Van Rossum in 1991 and R was created four years later by Ross Ihaka and Robert Gentleman keeping the statisticians in mind.

R has a steep learning curve which makes it a bit difficult for beginners but once the basics are clear it will be easy to learn advanced stuffs. On the other hand, the simplicity and readability of Python makes its learning curve relatively low and also it is a good choice for beginners.

The same functionality can be written in different ways in R but it is not the same in Python.

RStudio is the best IDE for R. Spyder, IPython, Notebook, Eric etc are some of the IDE for Python. Both R and Python have a huge number of reliable libraries. The CRAN is the biggest repository of R packages while PyPi is the Python repository.

The popular libraries in R includes caret, dplyr, data.tables, zoo, ggplot2, ggvis, stringr, lattice etc. Libraries like Pandas, Scikit Learn, SciPy, NumPy, matplotlib etc makes Python more attractive. Both R and Python have a good support and documentation.

When it comes to data visualization, R has an upper hand over Python. Packages like ggplot2 and ggvis are two incredible visualization packages in R.

Few examples of codes from both the languages which are used to get the same results.

To import a .csv dataset,
R:
dataset_name <- read.csv(“dataset_name.csv”)

Python:
import pandas
dataset_name = pandas.read_csv(“dataset_name.csv”)

To find the dimension of the dataset,
R:
dim(dataset_name)

Python:
dataset_name.shape

To obtain the first n observation in a dataframe,
R:
head(dataset_name)

Python:
dataset_name.head()

For splitting the dataset into training and test sets,
R:
RowCount <- floor(0.75 * nrow(dataset_name))
set.seed(123)
trainIndex <- sample(1:nrow(dataset_name), RowCount)
train <- dataset_name[trainIndex,]
test <- dataset_name[-trainIndex,]

Python:
train = dataset_name.sample(frac=0.75, random_state=1)
test = dataset_name.loc[~dataset_name.index.isin(train.index)]

R is more functional in nature and has a lot of build-in data analysis features. On the other hand Python is object oriented language which mostly relay on packages for data analysis. When it comes to data science, both these languages are important and it depends on the data analyst to choose between the two. If you know both, then you are definitely ahead of many others in this field.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s