Mark As Completed Discussion

R Vs Python for Machine Learning

So you're thinking of building a machine learning project, and it's time to decide on a programming language. Although programming languages R and Python offer similar capabilities, they differ in syntax, libraries and community support. Let's take a closer look at the two.

Introduction

Description

What is R?

R is a language and environment for statistical computing and graphics. It was created by statisticians for statistics, specifically for working with data. It is used by companies like Deloitte, Facebook, Instagram and Google.

What is Python?

Python is a general purpose programming language. It was developed with the goal of improving code readability. It is used by companies like Dropbox, YouTube, Instagram and Google.

Libraries

Libraries

Data Collection

R supports Excel, CSV and text files, as well as files built in Minitab or in SPSS format.

Python in comparison supports all kinds of data formats, from CSV files to JSON sourced from the web, and SQL tables. Python's request library allows you to easily grab data from the web, although modern R packages like Rvest can be used for basic web scraping.

Data Wrangling/Exploration

R allows you to build probability distributions, apply different statistical tests, and use standard ML and data mining techniques. It is optimized for statistical analysis of large datasets, and it offers a number of different options for exploring data.

Python in comparison allows you to explore data with the data analysis library Pandas. With this library you can filter, sort and display data in a matter of seconds.

Data Visualization

Since R was built to demonstrate the results of statistical analysis, you can easily create basic charts and plots with the base graphics module. Using the library ggplot2 more advanced plots can be created, such as complex scatter plots with regression lines.

Python in comparison is not as strong for data visualization. The Matplotlib library for generating basic graphs and charts. The Seaborn library allows you to draw more attractive and informative statistical graphics in Python.

Build your intuition. Is this statement true or false?

R allows for more advanced data visualization capabilities in comparison to Python.

Press true if you believe the statement is correct, or false otherwise.

Let's test your knowledge. Is this statement true or false?

R is designed to be highly readable in comparison to Python.

Press true if you believe the statement is correct, or false otherwise.

Code/Syntax

As Python was created with emphasis on code readability it is regarded as easier to pick up in comparison to R. Let's take a look at the actual coding syntax for importing a csv file and finding the mean.

R Code

SNIPPET
1library(readr)
2
3nba_data <- read_csv("nba_2013.csv")
4
5library(purr)
6library(dplyr)
7
8nba_data %>%
9  select_if(is.numeric) %>%
10  map_dbl(mean, na.rm = TRUE)

Python Code

PYTHON
1import pandas
2
3nba_data = pandas.read_csv("nba_2013.csv")
4
5nba_data.mean()

Comparing both languages, you can see why Python is regarded as easier to read and pick up in comparison to R.

Advantages & Disadvantages

Advantages & Disadvantages

R

Advantages

  • Open source.
  • Strong for statistical analysis.
  • Hundreds of well established packages/libraries devoted to analytics.
  • Easy to build visualizations.

Disadvantages

  • Steeper learning curve since it is a more challenging language to learn.
  • Need knowledge of a large amount of packages.
  • Can run slowly due to how R stores data.

Python

Advantages

  • Open source.
  • General-purpose language thus regarded as a better choice over R if your project demands more than just statistics.
  • Easy to read and learn thus programming skills can be developed faster and it is a more productive language.
  • Integrates better in comparison to R for example with lower level languages like C, C++.
  • Growing number of libraries for data analysis.

Disadvantages

  • Processing speed can be slow.
  • Uses a large amount of memory.
  • It includes fewer statistical model packages in comparison to R.

Build your intuition. Is this statement true or false?

Both R and Python are open source programming languages.

Press true if you believe the statement is correct, or false otherwise.

Conclusion

So which is better, is it Python or R? Well the honest answer is that it really depends on your ML project.

If your project is heavily statistics based then R is most suitable, whereas if you are looking to build larger scale, production ready, ML projects Python is the best match.

One Pager Cheat Sheet

  • R and Python both offer great capabilities for machine learning projects, but have different syntax, libraries, and community support.
  • R is a statistical language and environment while Python is a general purpose programming language.
  • R and Python both provide a variety of libraries for data collection, data wrangling/exploration, and data visualization.
  • R offers ggplot2 for more complex graphical representations, whereas Python relies on Matplotlib and Seaborn for basic and more advanced visualizations, respectively.
  • R is generally considered to be less readable and accessible than Python due to its complex syntax and low-level programming language abstractions.
  • Python is regarded as easier to read and pick up than R due to its emphasis on code readability.
  • Python and R are both open source, however Python is generally considered easier to read and learn with a growing number of libraries for data analysis while R has more packages available devoted to analytics but can run slowly due to data storage.
  • The software code of open source programming languages like R and Python can be freely used, modified, and distributed without any restriction, encouraging collaboration and enabling commercial and research applications.
  • It really depends on yourML project, but generally speaking, R is best for heavily statistics-based projects, while Python is better for larger-scale, production-ready projects.