Why R for Data Science – and not Python?

There are literally hundreds of programming languages out there, e.g. the whole alphabet of one letter programming languages is taken. In the area of data science, there are two big contenders: R and Python. Now, why is this blog about R and not Python?

I have to make a confession: I really wanted to like Python. I dug deep into the language and some of its extensions. Yet, it never really worked for me. I think one of the problems is that Python tries to be everybody’s darling. It can do everything… and its opposite. No, really, it is a great language to learn programming but I think it has some really serious flaws. I list some of them here:

  • It starts with which version to use! The current version has release number 3 and is gaining traction but there is still a lot of code based on the former version number 2. The problem is that there is no backward compatibility. Even the syntax of the print command got changed!
  • The next thing is which distribution to choose! What seems like a joke to R users is a sad reality for Python users: there are all kinds of different distributions out there. The most well known for data science is Anaconda: https://www.anaconda.com/. One of the reasons for this is that the whole package system in Python is a mess. To just give you a taste, have a look at the official documentation: https://packaging.python.org/tutorials/installing-packages/ – seven (!) pages for what is basically one command in R: install.packages() (I know, this is not entirely fair, but you get the idea).
  • Talking of packages: the basis of much of data science is statistics and visualizations. It is an indisputable fact that nothing comes even close to R in this respect. Many of the standard statistical techniques are adequately covered by Python but try more unconventional stuff and you are quickly lost. There are now nearly 18,000 R packages in the official repository CRAN alone! The same is true for more sophisticated visualizations: it is no coincidence that all of the renowned news organizations create their impressive infographics with R!
  • There are several GUIs out there and admittedly it is also a matter of taste which one to use but in my opinion when it comes to data scientific tasks – where you need a combination of on-line work and scripts – there is no better GUI than RStudio.
  • There is no general rule when to use a function and when to use a method on an object. The reason for this problem is what I stated above: Python wants to be everybody’s darling and tries to achieve everything at the same time. That it is not only me can be seen in this illuminating discussion where people scramble to find criteria when to use which: https://stackoverflow.com/questions/8108688/in-python-when-should-i-use-a-function-instead-of-a-method. A concrete example can be found here, where it is explained why the function any(df2 == 1) gives the wrong result and you have to use e.g. the method (df2 == 1).any(). Very confusing and error-prone.
  • More sophisticated data science data structures are not part of the core language. For example, you need the NumPy package for vectors and the pandas package for data frames. That in itself is not the problem but the inconsistencies that this brings. To give you just one example: whereas vectorized code is supported by NumPy and pandas it is not supported in base Python and you have to use good old loops instead.
  • Both Python and R are not the fastest of languages but the integration with one of the fastest, namely C++, is so much better in R (via Rcpp by Dirk Eddelbuettel) than in Python that it can by now be considered a standard approach. All R data structures are supported by corresponding C++ classes and there is a generic way to write ultra fast C++ functions that can be called like regular R functions:
  • library(Rcpp)
    bmi_R <- function(weight, height) {
      weight / (height * height)
    bmi_R(80, 1.85) # body mass index of person with 80 kg and 185 cm
    ## [1] 23.37473
      float bmi_cpp(float weight, float height) {
        return weight / (height * height);
    bmi_cpp(80, 1.85) # same with cpp function
    ## [1] 23.37473

    One of the main reasons for using Python in the data science arena is shrinking by the day: Neural Networks. The main frameworks like TensorFlow and torch and APIs like Keras used to be called by Python (much of their code is written in C for performance reasons anyway) but there are now excellent wrappers available for R too (see also Teach R to see by Borrowing a Brain).

    All in all, I think that R is really the best choice for most data science applications. The learning curve may be a little bit steeper at the beginning but when you get to more sophisticated concepts it becomes easier to use than Python.

    If you now want to learn R see this post of mine: Learning R: The Ultimate Introduction (incl. Machine Learning!)

    I constantly update the post, if there is anything that is not up-to-date please let me know in the comments.

    88 thoughts on “Why R for Data Science – and not Python?”

        1. I guess you are confusing Rstudio with Microsoft R Open (which was Revolution Analytics product until Microsoft bought them).

    1. Why do you even need to introduce a competitive stance, Python OR R, Python v.s. R? Each has pros and cons, but it’s comparing apples and oranges because the use cases are so different. Someone who likes Python could just as easily write a “Why Python for Data Science and not R” post, and it serves no good other than to get people arguing, akin to how one might over “vim vs. emacs.” Maybe it’s better to do data science, and share what you do, instead of fueling another “My favorite software is better than your favorite software” fire?

      1. We are not talking about religion but about tools which are either better suited for a job… or worse. It is important to have this conversation, also to help people make informed decisions about which software to chose.

        “Why Python for Data Science and not R” – I would seriously challenge anybody to write this post and post the link here.

        “Maybe it’s better to do data science, and share what you do” – no worries, I will do that – so stay tuned!

    2. Thanks for your post. I use R extensively for data science and machine learning, and I love it!

      Just wondering, other than the fact that more people know Python than R, can you come up with good technical reasons to learn / use Python for data science / machine learning projects (instead of R)?

      I’d appreciate your thoughts on this. Thanks!

      1. The choice of Python could be reasoned with learning through kaggle competitions, where Python dominates and learning is done via sharing of solutions. This learning process is very fast (and competitive) compared to academic learning.

      2. It’s good if you want to apply your data science code to do things, since Python is a much broader language.

      3. The development pipeline is streamlined with Python. R is better for the actual data science work, in my opinion, but then you might want to deploy your software on a live server, for example, a realm where R doesn’t compete at all. It might, therefore, be more productive to write everything in Python.

    3. I have also used Python for data analysis before, for a class whose instructor is a huge Python fan, but I still prefer R. In addition to the reasons you wrote about (I haven’t gone into as much depth as you did into Python), I prefer R because I prefer RStudio and RMarkdown to Jupyter notebook (I’ve also used Jupyter with R before). RMarkdown is integrated in RStudio, so we still get to use the environment, git, help, and Terminal panes, and it’s plain text so it’s easy to see what changed through git. Jupyter notebook is json when viewed as plain text so is harder to see what changed through git. Another thing I like about R is that CRAN and Bioconductor have stricter requirements for packages than PyPI; I’ve seen packages from pip that have terrible documentation and no unit tests. I also find ggplot2 easier to reason with than matplotlib.

      1. Hadley’s “tidy whitey verse” packages are actually one of the reasons why R gets a bad reputation. dplyr (for example) is a slow package for data frame manipulation compared to pandas. The data.table R package for data frame manipulation is significantly faster than pandas in Python so it should be touted more. Hadley Wickham’s dplyr package is definitely not an edge over pandas, so using it as a selling point for why R is better than Python will have you laughed off in front of a bunch of Python developers.

        1. I %>% really() %>% dont() %>% think() %>% so() is quite more understandable in code than so(think(dont(really(I)))), don’t you think so? Not everything is speed..

            1. You have made a rule : THE FASTER THE BETTER. It’s probably true, but for what? for the CPU or for the human brain? Its a tradeoff actually, I prefer the latter.

        2. This is some grade A garbage. So you mention the tidyverse and the only use case you mention is dplyr … ignoring the fact that the point of the tidyverse is that it ‘s more a paradigm/ecosystem to have a much more robust & standardized way to approach many tasks including ML, by leveraging multiple packages (purrr, broom, ggplot, tidyr and WIP like tidymodels, recipes, … and everything that is built on top) in a more standardized, easy to work with way.

          And to not stoop to your level of idiocy with blanket statements:
          – whether it’s an “edge” on Python per se, is up for debate. I don’t use Python enough to judge. But to call the tidyverse a reason why “R gets a bad reputation” ludicrous
          – data.table is great

          1. While your opinion is appreciated I would advise you to tone down your language. Qualifications like “garbage” and “idiocy” are not well received here.

        3. Motivation is the key driver to programming speed, so TidyVerse might be faster than less cosy coding prose. Depends on the programmer.

        4. Hi Michael, I disagree.

          If you are a novice user, you will jump in into DataScience faster thanks to Tidyverse in R.

          If you are an advanced user, you will clearly be able to learn fast R and find the fastest route.

          I believe your comment is absolutely biassed.

      1. Yes, we see an interesting development with Julia too. The only problem at the moment is that it is no comparison to R package-wise. But this might change in the future. We will see…

    4. I’d add amazing `data.table` R package that outperform any operations with data.frames and AFAIK has no analogy in python.

    5. I like **Python** for lots of stuff, including some libraries that are a bit ahead of their R counterparts when I last used them, in particular, the NLTK NLP package and the networkx graph package. I’m sure when I go looking I’ll find **R** implementations in those domains that work just as well.

      But why I mainly like Python is that it’s an imperative/procedural language, which is the model I grew up on. It’s also what gave me steep learning curve curse. I kept trying to make R work like what I was used to and it refused, for the most part. It even made the comprehensive help pages incomprehensible.

      But then I dabbled in Haskell enough to come to the realization that **R** is a *functional* language.

      No one who was able to grasp `f(x) = y` should have trouble making the transition to `y <- f(x)` and, with that realize that **R** is a treasure trove of tested functions that would take years to program in Python or C++, let alone test comprehensively.

      That said, the very strength of **R** will be its competitive disadvantage until large organizations start employing programmers and data engineers comfortable in languages like Haskell. They'll pick an easy-to-translate **Python** solution over a hard-to-parse **R** implementation every time.

    6. I’m like you I tried Python but it was too much of a hassle. My only issue with R is the “everything in RAM” problem. I am working on the Kaggle Christmas Traveling Salesman Problem with 197K cities when you get to the optimize route matrix R chokes and asks for another 150GB of memory. I’d like to stay with R but I think the majority of kagglers have gone to Python because of the memory issue. If you have any ideas about R libraries to use, I would appreciate any suggestions.

    7. Not sure any non pro R comment will have some credit here, but I guess you knew this would be troll bait.

      Few of your arguments really hold, some are considered even as R weaknesses.

      The version 2 vs. version 3 topic is mostly over. If you begin now with Python for data science you may even not notice it. Python Software Foundation had the courage to improve the language while others stay with their flaws…

      Anaconda is the de facto Python distribution for data science, it comes with 200+ packages installed, plus more than 1000+ installable with one command: conda install .

      R Studio is really cool, and certainly one of the main plus for R. But it is not R, it would be great with many other languages. But apart R Studio, you have no choice, especially when you have to work on a serious data science code base. Jupyter/JupyterLab is used a lot by data scientists (just check the numbers, no subjective assertion). Visual Studio Code is also getting large adoption, Atom + Hydrogen is very interesting, etc.

      The function or method topic is the same with R & Python. It comes from the fact that R & Python are multiple paradigm languages, i.e. they supports both a functional and an object-oriented writing. It happens that object-oriented programming is more often used in Python, but it has nothing to do with the properties of Python or R. If some R packages developers would have used object orientation, you would have the same questions in R. It happens it is not the case with the packages you are using, but should someone discovers the advantage of method chaining with objects, the same question would appear.

      Having data structures as external packages is certainly an advantage from a performance point of view. While R is known for its low performance and poor resources management such as memory, external packages provide the capability to improve performance without requiring changing the core engine. Your own answer to the question about memory issues with R was to point to external packages.

      About integration with C, you have interfaces both ways between Python & C. You should have a real look at how easy it is to call C from Python. Also Cython allows to closely embed Python and C extensions and to compile executables for maximum speed.

      And finally, Python is more vivid on the side of tools and packages. Many new tools (in deep learning, NLP, etc.) are released with a Python API / wrapper, not so often with R. Once again, check numbers on github for example.

      But you forgot R Shiny, one of the REAL strength of R ecosystem for prototyping data science applications. Too bad…

      Thus, except considering old, biais arguments, R is not that much an obvious choice. I would even say it is getting harder. In a recent hackathon of data science with 200+ data scientists competing, I had the chance to do some stats and 1 out 10 was using R, others were using Python. I was not looking to count this, but I was surprised how many Python screens I saw, how few R screens there were. So I did the stats.
      Also, in its latest poll, KD Nuggets concluded about the changes over time, Python getting more & more used for data science. See: https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html (no reason to consider KD Nuggets has a Python biais, not your case for R).

      Hope this helps to have a better evaluation of pros & cons of each platform.

      1. “Not sure any non pro R comment will have some credit here” – your comment is well appreciated, thank you!

    8. One big advantage python has over R is a single object oriented system instead of THREE in the base language (S3, S4, ReferenceClasses) and a fourth, R6, as the tidyverse approach. I also think python is better at manipulating raw text files in a manner that doesn’t blow up memory. Being able to iterate over a file line by line is very intuitive in python and, while possible in R, isn’t as natural.

      But like many in these comments, I use both and am very much a R AND Python proponent. Especially when it’s so easy to interop between the two!

    9. As a statistician, I will provide my opinion. And…
      I think R is made by statistician and for statistician.
      I like R because you just need to open a console in the right directory, type R DF<-read.csv('mytable.csv') et voilà! You can play with the data with the provided packages .
      I use python and you need more effort to obtain the same result.
      You can use in R common methods with data scientists but I think this is not the point.

      Don't forget plumber, opencpu, rmarkdown

      1. I have used both R and Python for Data Science project but I still prefer R to Python. There are 2 main reasons:
        – There is no “Pipe operator %>%” in Python. Using Pipe make the analysis much easier as it make the process of writing code and the thinking flow closer.
        – RStudio is the outstanding GUI for Data Science, which is better than any GUI for Python (Pycharm, Jupyter Notebook, Rodeo, Spyder…)

        1. Though I’m not exactly a scholar in Python since I just started using it since January of this year (2021), I miss R already and I think that I will use it exclusively for analysis work in the future.

    10. I have been dealing with virtual environments, conda, anaconda, iPython, Jupyter, PyCharm, IDLE and you name it. Now multiply this times the number of versions of each package of interest ant their compatibility and it quickly becomes a mess. I felt so much relief after reading this, R and RStudio it is for me too.

    11. `sparklyr` is easier than pyspark, if you know dplyr already.
      `data.table` is faster than pandas, and widely used in finance field.
      And many other baysian tools like `gstat`,`inla`,`lme4`,`brms`,`shinystan`,`bnlearn` beat the Python counterpart in functionality and usability.
      Last, Python is a submodule in RStudio by `reticulate` R package, in that, no more discussion about who is winner.

    12. As an individual making a living with data science (“only”) you should evtl. be more concerned about the evolution of auto ML than this battle.

      1. Thank you for your comment, René!

        Well, at the moment this question is relevant for many aspiring data scientists and on top of that the question doesn’t go away even in the area of Auto ML (where I agree that this topic is going to gain a lot of momentum in the future).

        R is also well positioned in this area… I am planning a post on this topic, so stay tuned! (Or even better: would you like to contribute a guest post?)

        1. Thank you for asking to be a contributor. I would not pretend to have enough expertise. I am a seasoned business and marketing professional with solid knowledge and experience in stats and marketing research and some experience in ML using R (and SPSS), not really more. I have registered for a post degree course in data science and it is fully based on Python 🙁 (partly on Knime and Gephi).

    13. Python is catching up with C integration with its Cython library, but your remaining concerns stays as is. Inconsistency between Pandas and NumPy, and function vs method!

    14. Nice Content!
      I am learning Django recently for my Project work thank you for highlighting that R is the preferable language to go for when it comes to Data Science. Can you also please make a blog post on the advantages and disadvantages of r and Python, I would appreciate it.
      Keep on Updating us with such great content, Cheers!.

    15. Wow its a great article. But I have a question What does it take to become a data scientist? because I want to become a data scientist.

      1. I have led decision science teams for big companies and I get this question a lot. I answer it this way. Data science is the combination of domain expertise (like an industry focus), programming, and mathematics. While data science is an awesome field and can be very fulfilling, it can be as frustrating and disappointing as any other career choice, if it isn’t right for you. Many people seem to think about this as a way to get a good job quickly because demand is so high right now. This is not necessarily the case. I encourage people to ask themselves why they want to become a data scientist or any other career choice they may not know much about. Once you answer that for yourself, you can find an awesome career whether it is in data science, engineering, analytics, or something else all together. Here is a link to a “day in the life” post for data scientists who work in corporate settings, that may be helpful.


    16. c’mon, writing in 2018 you should left 2.x vs 3.x dilemma behind… this is over and 99% packages are 3.x compatible… seems like you desperately need to proof any Python’s weak points which is difficult vs R 😉

    17. How to developer must learn both. If you work for a company that has R implemented in its platform you will not come to change that.

    18. I am a beginner and want to start learning some advance analytics. I am a business student and I see marketing trends, reach analysis, big data analysis and likewise things. I am intending to do PhD in business analytics as well. For a complete starter like me, who used SPSS and JASP, what would u suggest? After reading your article, comments and discussion, I thought r for myself. But I need to take endorsement. Inshort, I have to be a data analysis guy. R or Python? Please. And it’s 2020.

    19. My background: 40 years experience with programming, starting with Fortran, Algol, Pascal, and later on C, C++, MATLAB. Started with R about 20 years ago and do a lot of data analysis both with academic and consulting focus. Also competed on Kaggle with good results a couple of years ago. Did some programming with Python, but not much.

      My simple answer to the question: it depends on what is your actual background and expertise. The best data science language and programming environment for you is the one you know the most and in depth.

      For people more related to CS in general, I guess Python is the natural way to approach Data Science and this is, I believe, because Python became a very popular general purpose computer language used within CS during last decade or so. If you have a lot of non-data science Python expertise, lots of friends that know and use Python, data science with Python is most likely the way to go.

      People with a background more related to Statistics and/or long term expertise on R will feel more comfortable staying with R, as they know how to solve data science problems with R. And this is not because tidyverse or RStudio. It is due to their knowledge on R and on how to solve problems with R and available packages. Things like RCpp, R Shiny, data.table, ff*, big_*, sparklyr, are, I believe, more relevant for nowadays problem solving than tidyverse or RStudio.

      My guess is that comparing the solution given to a data science problem by two real experts, one on Python and the other on R, the final result from both approaches will be very similar in general terms. Existing differences, if any, will be due more to their level of expertise on both programming environments (and techniques used) than on the features of each environment.

      I think the original question is more relevant for a person without any knowledge on Python and/or R who wants to start a career on Data Science. For those, both alternatives look comparable to me. Flip a coin? or, better, have some exposure to both worlds and see which is more attractive to you. Both are powerful but have a step learning curve if you want to reach a level of expertise enough for solving large scale problems. There are no easy shortcuts.

    20. Hello,

      Read you post and comments…it is also dilemma for me – R or Python. I have started both, I like R syntax more than Python (maybe it is because I didn’t have programming experience before) but seems that Python is evolving little bit faster that R.
      But the question is, what would you suggest for sport data analytics? There are few key areas, which has to be covered: web scraping, sport models, machine learning, forecasting, visualization and dashboards.
      Maybe there are another important areas, which I don’t know yet…

    21. Are the reasons still valid after 3 years since the post was released? Hasn’t been there any improvement in Python that makes it better now?

      A post update would be very much appreciated!

      Thanks in advance.

    22. I’m an economist and I’ve been working with Data Visualization tools like Tableau and Power BI to support my choices regarding project management. I’m thinking about getting a more deep understanding of data analysis (perhaps become a data analyst) and I’m considering taking a course in this. I have 2 choices: The course from Google that uses R and the other from IBM that teaches Python. To a person that has no programming background and want to enter in the data analysis world, what would you recommend?

      1. Dear Juan, Thank you for your question.

        I would clearly recommend R! Especially when you have no programming background Python might seem tempting because the learning curve seems flatter at the beginning. But the big problem is that when you start doing data analytics with Python you will have to learn a second language on top of the first one because of the necessary packages (NumPy, pandas, etc.)

        When you learn R everything is consistent because it was originally built for doing data analytics. So you should go for the Google course! Could you perhaps post the link to that course?

      2. Hello Juan! I would like to recommend the HarvardX course R-Basics for Data Science, a first of a series of courses from Harvard via EDx taught by the amazing Professor Rafael Irrizary. It was really very well taught as an introductory course and I thoroughly enjoyed it. I also took the Python course from IBM and I have to say I was very disappointed so I’d steer clear of it. If you are a beginner I would also like to recommend Hands on Programming with R (O’Reilly) as a first text, followed by Learning R (O’Reilly), and then R for Data Science – Import, Tidy, Transform, Visualize, and Model Data (O’Reilly). These are the best texts in my opinion after going through dozens of books for breadth, depth, and structure.

        Best of luck!

    23. I wholeheartedly agree with you that R is better suited for data science than Python exactly for the reasons you have described. I spent the past three months learning R, Python, as well as visualization platforms like Tableau. I worked on all three religiously with no previous background in any whatsoever. In fact, when I started, I was a little wary of R, and was very excited about Python based on how I kept seeing that “it is really easy to learn”/”it is really fun to learn and work with Python” type comments all the time. At the end of these three months, I absolutely LOVE working with R. The syntax is intuitive, the way it works with its parts is pretty straight forward. It took me almost three weeks to just get started with Python because of the many, MANY options available with their various strengths and limitations, which then creates variations in how to get started with any of them. While Python continues to try to be, as you said, “everyones darling”, I am going to continue to work towards mastering R and not Python. PS: Awesome blog!!!

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.

    This site uses Akismet to reduce spam. Learn how your comment data is processed.