Why R for data science – and not Python?

There are literally hundreds of programming languages out there, e.g. the whole alphabet of one letter programming languages is taken. In the area of data science there are two big contenders: R and Python. Now why is this blog about R and not Python?

I have to make a confession: I really wanted to like Python. I dug deep into the language and some of its extensions. Yet, it never really worked for me. I think one of the problems is that Python tries to be everybody’s darling. It can do everything… and its opposite. No, really, it is a great language to learn programming but I think it has some really serious flaws. I list some of them here:

  • It starts with which version to use! The current version has release number 3 but there is still a lot of code based on the former version number 2. The problem is that there is no backward compatibility. Even the syntax of the print command got changed!
  • The next thing is which distribution to chose! What seems like a joke to R users is a sad reality for Python users: there are all kinds of different distributions out there. The most well known for data science is Anaconda: https://www.anaconda.com/. One of the reasons for this is that the whole package system in Python is a mess. To just give you a taste, have a look at the official documentation: https://packaging.python.org/tutorials/installing-packages/ – seven (!) pages for what is basically one command in R: install.packages() (I know, this is not entirely fair, but you get the idea).
  • There are several GUIs out there and admittedly it is also a matter of taste which one to use but in my opinion when it comes to data scientific tasks – where you need a combination of on-line work and scripts – there is no better GUI than RStudio (there is now Rodeo, free download here: https://www.yhat.com/products/rodeo, but I don’t know how mature it is).
  • There is no general rule when to use a function and when to use a method on an object. The reason for this problem is what I stated above: Python wants to be everybody’s darling and tries to achieve everything at the same time. That it is not only me can be seen in this illuminating discussion where people scramble to find criteria when to use which: https://stackoverflow.com/questions/8108688/in-python-when-should-i-use-a-function-instead-of-a-method. A concrete example can be found here, where it is explained why the function any(df2 == 1) gives the wrong result and you have to use e.g. the method (df2 == 1).any(). Very confusing and error prone.
  • More sophisticated data science data structures are not part of the core language. For example you need the NumPy package for vectors and the pandas package for data.frames. That in itself is not the problem but the inconsistencies that this brings. To give you just one example: whereas vectorized code is supported by NumPy and pandas it is not supported in base Python and you have to use good old loops instead.
  • Both Python and R are not the fastest of languages but the integration with one of the fastest, namely C++, is so much better in R (via Rcpp by Dirk Eddelbuettel) than in Python that it can by now be considered a standard approach. All R data structures are supported by corresponding C++ classes and there is a generic way to write ultra fast C++ functions that can be called like regular R functions:
  • library(Rcpp)
    
    bmi_R <- function(weight, height) {
      weight / (height * height)
    }
    bmi_R(80, 1.85) # body mass index of person with 80 kg and 185 cm
    ## [1] 23.37473
    
    cppFunction("
      float bmi_cpp(float weight, float height) {
        return weight / (height * height);
      }
    ")
    bmi_cpp(80, 1.85) # same with cpp function
    ## [1] 23.37473
    

    One of the main reasons for using Python in the data science arena is shrinking by the day: Neural Networks. The main frameworks like Tensorflow and APIs like Keras used to be controlled by Python but there are now excellent wrappers available for R too (https://tensorflow.rstudio.com/ and https://keras.rstudio.com/).

    All in all I think that R is really the best choice for most data science applications. The learning curve may be a little bit steeper at the beginning but when you get to more sophisticated concepts it becomes easier to use than Python.

    29 thoughts on “Why R for data science – and not Python?”

    1. Why do you even need to introduce a competitive stance, Python OR R, Python v.s. R? Each has pros and cons, but it’s comparing apples and oranges because the use cases are so different. Someone who likes Python could just as easily write a “Why Python for Data Science and not R” post, and it serves no good other than to get people arguing, akin to how one might over “vim vs. emacs.” Maybe it’s better to do data science, and share what you do, instead of fueling another “My favorite software is better than your favorite software” fire?

      1. We are not talking about religion but about tools which are either better suited for a job… or worse. It is important to have this conversation, also to help people make informed decisions about which software to chose.

        “Why Python for Data Science and not R” – I would seriously challenge anybody to write this post and post the link here.

        “Maybe it’s better to do data science, and share what you do” – no worries, I will do that – so stay tuned!

    2. Thanks for your post. I use R extensively for data science and machine learning, and I love it!

      Just wondering, other than the fact that more people know Python than R, can you come up with good technical reasons to learn / use Python for data science / machine learning projects (instead of R)?

      I’d appreciate your thoughts on this. Thanks!

    3. I have also used Python for data analysis before, for a class whose instructor is a huge Python fan, but I still prefer R. In addition to the reasons you wrote about (I haven’t gone into as much depth as you did into Python), I prefer R because I prefer RStudio and RMarkdown to Jupyter notebook (I’ve also used Jupyter with R before). RMarkdown is integrated in RStudio, so we still get to use the environment, git, help, and Terminal panes, and it’s plain text so it’s easy to see what changed through git. Jupyter notebook is json when viewed as plain text so is harder to see what changed through git. Another thing I like about R is that CRAN and Bioconductor have stricter requirements for packages than PyPI; I’ve seen packages from pip that have terrible documentation and no unit tests. I also find ggplot2 easier to reason with than matplotlib.

      1. Hadley’s “tidy whitey verse” packages are actually one of the reasons why R gets a bad reputation. dplyr (for example) is a slow package for data frame manipulation compared to pandas. The data.table R package for data frame manipulation is significantly faster than pandas in Python so it should be touted more. Hadley Wickham’s dplyr package is definitely not an edge over pandas, so using it as a selling point for why R is better than Python will have you laughed off in front of a bunch of Python developers.

        1. I %>% really() %>% dont() %>% think() %>% so() is quite more understandable in code than so(think(dont(really(I)))), don’t you think so? Not everything is speed..

            1. You have made a rule : THE FASTER THE BETTER. It’s probably true, but for what? for the CPU or for the human brain? Its a tradeoff actually, I prefer the latter.

      1. Yes, we see an interesting development with Julia too. The only problem at the moment is that it is no comparison to R package-wise. But this might change in the future. We will see…

    4. I’d add amazing `data.table` R package that outperform any operations with data.frames and AFAIK has no analogy in python.

    5. I like **Python** for lots of stuff, including some libraries that are a bit ahead of their R counterparts when I last used them, in particular, the NLTK NLP package and the networkx graph package. I’m sure when I go looking I’ll find **R** implementations in those domains that work just as well.

      But why I mainly like Python is that it’s an imperative/procedural language, which is the model I grew up on. It’s also what gave me steep learning curve curse. I kept trying to make R work like what I was used to and it refused, for the most part. It even made the comprehensive help pages incomprehensible.

      But then I dabbled in Haskell enough to come to the realization that **R** is a *functional* language.

      No one who was able to grasp `f(x) = y` should have trouble making the transition to `y <- f(x)` and, with that realize that **R** is a treasure trove of tested functions that would take years to program in Python or C++, let alone test comprehensively.

      That said, the very strength of **R** will be its competitive disadvantage until large organizations start employing programmers and data engineers comfortable in languages like Haskell. They'll pick an easy-to-translate **Python** solution over a hard-to-parse **R** implementation every time.

    6. I’m like you I tried Python but it was too much of a hassle. My only issue with R is the “everything in RAM” problem. I am working on the Kaggle Christmas Traveling Salesman Problem with 197K cities when you get to the optimize route matrix R chokes and asks for another 150GB of memory. I’d like to stay with R but I think the majority of kagglers have gone to Python because of the memory issue. If you have any ideas about R libraries to use, I would appreciate any suggestions.

    7. Not sure any non pro R comment will have some credit here, but I guess you knew this would be troll bait.

      Few of your arguments really hold, some are considered even as R weaknesses.

      The version 2 vs. version 3 topic is mostly over. If you begin now with Python for data science you may even not notice it. Python Software Foundation had the courage to improve the language while others stay with their flaws…

      Anaconda is the de facto Python distribution for data science, it comes with 200+ packages installed, plus more than 1000+ installable with one command: conda install .

      R Studio is really cool, and certainly one of the main plus for R. But it is not R, it would be great with many other languages. But apart R Studio, you have no choice, especially when you have to work on a serious data science code base. Jupyter/JupyterLab is used a lot by data scientists (just check the numbers, no subjective assertion). Visual Studio Code is also getting large adoption, Atom + Hydrogen is very interesting, etc.

      The function or method topic is the same with R & Python. It comes from the fact that R & Python are multiple paradigm languages, i.e. they supports both a functional and an object-oriented writing. It happens that object-oriented programming is more often used in Python, but it has nothing to do with the properties of Python or R. If some R packages developers would have used object orientation, you would have the same questions in R. It happens it is not the case with the packages you are using, but should someone discovers the advantage of method chaining with objects, the same question would appear.

      Having data structures as external packages is certainly an advantage from a performance point of view. While R is known for its low performance and poor resources management such as memory, external packages provide the capability to improve performance without requiring changing the core engine. Your own answer to the question about memory issues with R was to point to external packages.

      About integration with C, you have interfaces both ways between Python & C. You should have a real look at how easy it is to call C from Python. Also Cython allows to closely embed Python and C extensions and to compile executables for maximum speed.

      And finally, Python is more vivid on the side of tools and packages. Many new tools (in deep learning, NLP, etc.) are released with a Python API / wrapper, not so often with R. Once again, check numbers on github for example.

      But you forgot R Shiny, one of the REAL strength of R ecosystem for prototyping data science applications. Too bad…

      Thus, except considering old, biais arguments, R is not that much an obvious choice. I would even say it is getting harder. In a recent hackathon of data science with 200+ data scientists competing, I had the chance to do some stats and 1 out 10 was using R, others were using Python. I was not looking to count this, but I was surprised how many Python screens I saw, how few R screens there were. So I did the stats.
      Also, in its latest poll, KD Nuggets concluded about the changes over time, Python getting more & more used for data science. See: https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html (no reason to consider KD Nuggets has a Python biais, not your case for R).

      Hope this helps to have a better evaluation of pros & cons of each platform.

      1. “Not sure any non pro R comment will have some credit here” – your comment is well appreciated, thank you!

    8. One big advantage python has over R is a single object oriented system instead of THREE in the base language (S3, S4, ReferenceClasses) and a fourth, R6, as the tidyverse approach. I also think python is better at manipulating raw text files in a manner that doesn’t blow up memory. Being able to iterate over a file line by line is very intuitive in python and, while possible in R, isn’t as natural.

      But like many in these comments, I use both and am very much a R AND Python proponent. Especially when it’s so easy to interop between the two!

    9. As a statistician, I will provide my opinion. And…
      I think R is made by statistician and for statistician.
      I like R because you just need to open a console in the right directory, type R DF<-read.csv('mytable.csv') et voilà! You can play with the data with the provided packages .
      I use python and you need more effort to obtain the same result.
      You can use in R common methods with data scientists but I think this is not the point.

      Best
      Don't forget plumber, opencpu, rmarkdown

      1. I have used both R and Python for Data Science project but I still prefer R to Python. There are 2 main reasons:
        – There is no “Pipe operator %>%” in Python. Using Pipe make the analysis much easier as it make the process of writing code and the thinking flow closer.
        – RStudio is the outstanding GUI for Data Science, which is better than any GUI for Python (Pycharm, Jupyter Notebook, Rodeo, Spyder…)

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    This site uses Akismet to reduce spam. Learn how your comment data is processed.