Why R for Data Science – and Not Python!


There are literally hundreds of programming languages out there, e.g. the whole alphabet of one letter programming languages is taken. In the area of data science, there are two big contenders: R and Python. Now, why is this blog about R and not Python?

I have to make a confession: I really wanted to like Python. I dug deep into the language and some of its extensions. Yet, it never really worked for me. I think one of the problems is that Python tries to be everybody’s darling. It can do everything… and its opposite. No, really, it is a great language to learn programming but I think it has some really serious flaws. I list some of them here:

  • It starts with which version to use! The current version has release number 3 and is gaining traction but there is still a lot of code based on the former version number 2. The problem is that there is no backward compatibility. Even the syntax of the print command got changed!
  • The next thing is which distribution to choose! What seems like a joke to R users is a sad reality for Python users: there are all kinds of different distributions out there. The most well known for data science is Anaconda: https://www.anaconda.com/. One of the reasons for this is that the whole package system in Python is a mess. To just give you a taste, have a look at the official documentation: https://packaging.python.org/tutorials/installing-packages/ – eight (!) pages for what is basically one command in R: install.packages() (I know, this is not entirely fair, but you get the idea).
  • Talking of packages: the basis of much of data science is statistics and visualizations. It is an indisputable fact that nothing comes even close to R in this respect. Many of the standard statistical techniques are adequately covered by Python but try more unconventional stuff and you are quickly lost. There are now nearly 18,000 R packages in the official repository CRAN alone! The same is true for more sophisticated visualizations: it is no coincidence that all of the renowned news organizations create their impressive infographics with R!
  • There are several GUIs out there and admittedly it is also a matter of taste which one to use but in my opinion when it comes to data scientific tasks – where you need a combination of online work and scripts – there is no better GUI than RStudio. And even if you want to use Jupyter notebooks, you can do this with R too.
  • More sophisticated data science data structures are not part of the core language. For example, you need the NumPy package for vectors and the pandas package for data frames. That in itself is not the problem but the inconsistencies that this brings. To give you just one example: whereas vectorized code is supported by NumPy and pandas it is not supported in base Python and you have to use good old loops instead.
  • There is no general rule when to use a function and when to use a method on an object. The reason for this problem is what I stated above: Python wants to be everybody’s darling and tries to achieve everything at the same time. That it is not only me can be seen in this illuminating discussion where people scramble to find criteria when to use which: https://stackoverflow.com/questions/8108688/in-python-when-should-i-use-a-function-instead-of-a-method. A concrete example can be found here, where it is explained why the function any(df2 == 1) gives the wrong result and you have to use e.g. the method (df2 == 1).any(). Very confusing and error-prone.
  • Both Python and R are not the fastest of languages but the integration with one of the fastest, namely C++, is so much better in R (via Rcpp by Dirk Eddelbuettel) than in Python that it can by now be considered a standard approach. All R data structures are supported by corresponding C++ classes and there is a generic way to write ultra fast C++ functions that can be called like regular R functions:
  • library(Rcpp)
    
    bmi_R <- function(weight, height) {
      weight / (height * height)
    }
    bmi_R(80, 1.85) # body mass index of person with 80 kg and 185 cm
    ## [1] 23.37473
    
    cppFunction("
      float bmi_cpp(float weight, float height) {
        return weight / (height * height);
      }
    ")
    bmi_cpp(80, 1.85) # same with cpp function
    ## [1] 23.37473
    

    One of the main reasons for using Python in the data science arena is shrinking by the day: Neural Networks. The main frameworks like TensorFlow and torch and APIs like Keras used to be called by Python (much of their code is written in C for performance reasons anyway) but there are now excellent wrappers available for R too (see also Teach R to see by Borrowing a Brain).


    All in all, I think that R is really the best choice for most data science applications. The learning curve may be a little bit steeper at the beginning but when you get to more sophisticated concepts it becomes easier to use than Python.

    If you now want to learn R see this post of mine: Learning R: The Ultimate Introduction (incl. Machine Learning!)


    UPDATE
    I constantly update the post, if there is anything that is not up-to-date please let me know in the comments.


    UPDATE September 23, 2021
    I created a video for this post (in German):

    100 thoughts on “Why R for Data Science – and Not Python!”

        1. I guess you are confusing Rstudio with Microsoft R Open (which was Revolution Analytics product until Microsoft bought them).

      1. A software program residence is a company that on the whole presents software program products. These agencies might also focus on enterprise or patron software merchandise wherein the organization is in particular invested in growing and dispensing software program merchandise.

    1. Why do you even need to introduce a competitive stance, Python OR R, Python v.s. R? Each has pros and cons, but it’s comparing apples and oranges because the use cases are so different. Someone who likes Python could just as easily write a “Why Python for Data Science and not R” post, and it serves no good other than to get people arguing, akin to how one might over “vim vs. emacs.” Maybe it’s better to do data science, and share what you do, instead of fueling another “My favorite software is better than your favorite software” fire?

      1. We are not talking about religion but about tools which are either better suited for a job… or worse. It is important to have this conversation, also to help people make informed decisions about which software to chose.

        “Why Python for Data Science and not R” – I would seriously challenge anybody to write this post and post the link here.

        “Maybe it’s better to do data science, and share what you do” – no worries, I will do that – so stay tuned!

    2. Thanks for your post. I use R extensively for data science and machine learning, and I love it!

      Just wondering, other than the fact that more people know Python than R, can you come up with good technical reasons to learn / use Python for data science / machine learning projects (instead of R)?

      I’d appreciate your thoughts on this. Thanks!

      1. The choice of Python could be reasoned with learning through kaggle competitions, where Python dominates and learning is done via sharing of solutions. This learning process is very fast (and competitive) compared to academic learning.

      2. It’s good if you want to apply your data science code to do things, since Python is a much broader language.

      3. The development pipeline is streamlined with Python. R is better for the actual data science work, in my opinion, but then you might want to deploy your software on a live server, for example, a realm where R doesn’t compete at all. It might, therefore, be more productive to write everything in Python.

    3. I have also used Python for data analysis before, for a class whose instructor is a huge Python fan, but I still prefer R. In addition to the reasons you wrote about (I haven’t gone into as much depth as you did into Python), I prefer R because I prefer RStudio and RMarkdown to Jupyter notebook (I’ve also used Jupyter with R before). RMarkdown is integrated in RStudio, so we still get to use the environment, git, help, and Terminal panes, and it’s plain text so it’s easy to see what changed through git. Jupyter notebook is json when viewed as plain text so is harder to see what changed through git. Another thing I like about R is that CRAN and Bioconductor have stricter requirements for packages than PyPI; I’ve seen packages from pip that have terrible documentation and no unit tests. I also find ggplot2 easier to reason with than matplotlib.

      1. Hadley’s “tidy whitey verse” packages are actually one of the reasons why R gets a bad reputation. dplyr (for example) is a slow package for data frame manipulation compared to pandas. The data.table R package for data frame manipulation is significantly faster than pandas in Python so it should be touted more. Hadley Wickham’s dplyr package is definitely not an edge over pandas, so using it as a selling point for why R is better than Python will have you laughed off in front of a bunch of Python developers.

        1. I %>% really() %>% dont() %>% think() %>% so() is quite more understandable in code than so(think(dont(really(I)))), don’t you think so? Not everything is speed..

            1. You have made a rule : THE FASTER THE BETTER. It’s probably true, but for what? for the CPU or for the human brain? Its a tradeoff actually, I prefer the latter.

        2. This is some grade A garbage. So you mention the tidyverse and the only use case you mention is dplyr … ignoring the fact that the point of the tidyverse is that it ‘s more a paradigm/ecosystem to have a much more robust & standardized way to approach many tasks including ML, by leveraging multiple packages (purrr, broom, ggplot, tidyr and WIP like tidymodels, recipes, … and everything that is built on top) in a more standardized, easy to work with way.

          And to not stoop to your level of idiocy with blanket statements:
          – whether it’s an “edge” on Python per se, is up for debate. I don’t use Python enough to judge. But to call the tidyverse a reason why “R gets a bad reputation” ludicrous
          – data.table is great

          1. While your opinion is appreciated I would advise you to tone down your language. Qualifications like “garbage” and “idiocy” are not well received here.

        3. Motivation is the key driver to programming speed, so TidyVerse might be faster than less cosy coding prose. Depends on the programmer.

        4. Hi Michael, I disagree.

          If you are a novice user, you will jump in into DataScience faster thanks to Tidyverse in R.

          If you are an advanced user, you will clearly be able to learn fast R and find the fastest route.

          I believe your comment is absolutely biassed.

      1. Yes, we see an interesting development with Julia too. The only problem at the moment is that it is no comparison to R package-wise. But this might change in the future. We will see…

    4. I’d add amazing `data.table` R package that outperform any operations with data.frames and AFAIK has no analogy in python.

    5. I like **Python** for lots of stuff, including some libraries that are a bit ahead of their R counterparts when I last used them, in particular, the NLTK NLP package and the networkx graph package. I’m sure when I go looking I’ll find **R** implementations in those domains that work just as well.

      But why I mainly like Python is that it’s an imperative/procedural language, which is the model I grew up on. It’s also what gave me steep learning curve curse. I kept trying to make R work like what I was used to and it refused, for the most part. It even made the comprehensive help pages incomprehensible.

      But then I dabbled in Haskell enough to come to the realization that **R** is a *functional* language.

      No one who was able to grasp `f(x) = y` should have trouble making the transition to `y <- f(x)` and, with that realize that **R** is a treasure trove of tested functions that would take years to program in Python or C++, let alone test comprehensively.

      That said, the very strength of **R** will be its competitive disadvantage until large organizations start employing programmers and data engineers comfortable in languages like Haskell. They'll pick an easy-to-translate **Python** solution over a hard-to-parse **R** implementation every time.

    6. I’m like you I tried Python but it was too much of a hassle. My only issue with R is the “everything in RAM” problem. I am working on the Kaggle Christmas Traveling Salesman Problem with 197K cities when you get to the optimize route matrix R chokes and asks for another 150GB of memory. I’d like to stay with R but I think the majority of kagglers have gone to Python because of the memory issue. If you have any ideas about R libraries to use, I would appreciate any suggestions.

    7. Not sure any non pro R comment will have some credit here, but I guess you knew this would be troll bait.

      Few of your arguments really hold, some are considered even as R weaknesses.

      The version 2 vs. version 3 topic is mostly over. If you begin now with Python for data science you may even not notice it. Python Software Foundation had the courage to improve the language while others stay with their flaws…

      Anaconda is the de facto Python distribution for data science, it comes with 200+ packages installed, plus more than 1000+ installable with one command: conda install .

      R Studio is really cool, and certainly one of the main plus for R. But it is not R, it would be great with many other languages. But apart R Studio, you have no choice, especially when you have to work on a serious data science code base. Jupyter/JupyterLab is used a lot by data scientists (just check the numbers, no subjective assertion). Visual Studio Code is also getting large adoption, Atom + Hydrogen is very interesting, etc.

      The function or method topic is the same with R & Python. It comes from the fact that R & Python are multiple paradigm languages, i.e. they supports both a functional and an object-oriented writing. It happens that object-oriented programming is more often used in Python, but it has nothing to do with the properties of Python or R. If some R packages developers would have used object orientation, you would have the same questions in R. It happens it is not the case with the packages you are using, but should someone discovers the advantage of method chaining with objects, the same question would appear.

      Having data structures as external packages is certainly an advantage from a performance point of view. While R is known for its low performance and poor resources management such as memory, external packages provide the capability to improve performance without requiring changing the core engine. Your own answer to the question about memory issues with R was to point to external packages.

      About integration with C, you have interfaces both ways between Python & C. You should have a real look at how easy it is to call C from Python. Also Cython allows to closely embed Python and C extensions and to compile executables for maximum speed.

      And finally, Python is more vivid on the side of tools and packages. Many new tools (in deep learning, NLP, etc.) are released with a Python API / wrapper, not so often with R. Once again, check numbers on github for example.

      But you forgot R Shiny, one of the REAL strength of R ecosystem for prototyping data science applications. Too bad…

      Thus, except considering old, biais arguments, R is not that much an obvious choice. I would even say it is getting harder. In a recent hackathon of data science with 200+ data scientists competing, I had the chance to do some stats and 1 out 10 was using R, others were using Python. I was not looking to count this, but I was surprised how many Python screens I saw, how few R screens there were. So I did the stats.
      Also, in its latest poll, KD Nuggets concluded about the changes over time, Python getting more & more used for data science. See: https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html (no reason to consider KD Nuggets has a Python biais, not your case for R).

      Hope this helps to have a better evaluation of pros & cons of each platform.

      1. “Not sure any non pro R comment will have some credit here” – your comment is well appreciated, thank you!

      2. I think this is the more objective view. Or at least provides more balance.

        I have used Fortran, Matlab, R, and Python for scientific computing; and used Python, JavaScript, and Java outside of scientific realm. Both R and Python are fantastic for data science for sure.

        The real strength of R is its statistical packages made by statisticians.

        But R’s advantages more or less ends there. If packages are the answer for everything – package for viz, package for speed, etc. – well, Python has a larger pool of choices (perhaps that is a negative). One can have advantage over the other on any given year, but that can change by the next year.

        I know this post is old, but comparing ggplot2 with matplotlib is like comparing seaborn with base R plotting. Altair is a better implementation of declarative visualisation than ggplot2. Seaborn looks better than both. Plotly Express is easier to use than any. Every time I see these blog posts about Python vs R, and see ggplot2 vs matplotlib is a dead giveaway that the author doesn’t use Python. Much like when I see a university lecturer’s R code that doesn’t use the assignment ( <- ) or piping, I know they don't really know how to program (shame on anybody who still teaches with Stata or Eviews).

        In any case, I don't think packages are advantages of the language but more about the ecosystem – and fortunately, the ecosystem for both continue to improve and expand. And if you are only proficient in one, make an effort to learn the other – not just watch a couple of udemy videos, but actually.

        I guess if you are a statistician, then you don't need to: R is sufficient. If you are a data scientist… nothing would be sufficient on its own, but R is not necessary while Python most certainly is a requirement.

        A data scientist's role is fluid and quite expansive. Are you going to orchestrate a data engineering pipeline on the cloud with R? No, one would need something like Airflow, Prefect, boto3 – Python would be better suited. Would you trust fitting a GLM with statsmodels over R? Probably not, maybe more complicated models will cause statsmodels to freak out and not converge to a solution while R handles it fine. Would I use either Python or R for a heavy duty Physics simulation? No. Maybe Python, but I'd try out Julia, and in some cases Fortran can be faster than C/C++. What if I was a machine learning engineer in the gaming industry? I'd use C#'s ML.NET or wrestle with C++ perhaps. Maybe I need to custom visualise something using JavaScript and D3.js. Perhaps you need to work with truly massive data and need Spark (using any one of Scala, Python, or R) or Dask (Python).

        I'm not saying one person would need to do all of this all the time, but learning how to learn the ecosystem and code idiomatically will take you further than merely learning to write code.

        I dunno, but if you are a monolingual, or advocate for being one, you'll have a hard time being a data scientist. That's my two cents.

    8. One big advantage python has over R is a single object oriented system instead of THREE in the base language (S3, S4, ReferenceClasses) and a fourth, R6, as the tidyverse approach. I also think python is better at manipulating raw text files in a manner that doesn’t blow up memory. Being able to iterate over a file line by line is very intuitive in python and, while possible in R, isn’t as natural.

      But like many in these comments, I use both and am very much a R AND Python proponent. Especially when it’s so easy to interop between the two!

    9. As a statistician, I will provide my opinion. And…
      I think R is made by statistician and for statistician.
      I like R because you just need to open a console in the right directory, type R DF<-read.csv('mytable.csv') et voilà! You can play with the data with the provided packages .
      I use python and you need more effort to obtain the same result.
      You can use in R common methods with data scientists but I think this is not the point.

      Best
      Don't forget plumber, opencpu, rmarkdown

      1. I have used both R and Python for Data Science project but I still prefer R to Python. There are 2 main reasons:
        – There is no “Pipe operator %>%” in Python. Using Pipe make the analysis much easier as it make the process of writing code and the thinking flow closer.
        – RStudio is the outstanding GUI for Data Science, which is better than any GUI for Python (Pycharm, Jupyter Notebook, Rodeo, Spyder…)

        1. Though I’m not exactly a scholar in Python since I just started using it since January of this year (2021), I miss R already and I think that I will use it exclusively for analysis work in the future.

    10. I have been dealing with virtual environments, conda, anaconda, iPython, Jupyter, PyCharm, IDLE and you name it. Now multiply this times the number of versions of each package of interest ant their compatibility and it quickly becomes a mess. I felt so much relief after reading this, R and RStudio it is for me too.

    11. `sparklyr` is easier than pyspark, if you know dplyr already.
      `data.table` is faster than pandas, and widely used in finance field.
      And many other baysian tools like `gstat`,`inla`,`lme4`,`brms`,`shinystan`,`bnlearn` beat the Python counterpart in functionality and usability.
      Last, Python is a submodule in RStudio by `reticulate` R package, in that, no more discussion about who is winner.

    12. As an individual making a living with data science (“only”) you should evtl. be more concerned about the evolution of auto ML than this battle.

      1. Thank you for your comment, René!

        Well, at the moment this question is relevant for many aspiring data scientists and on top of that the question doesn’t go away even in the area of Auto ML (where I agree that this topic is going to gain a lot of momentum in the future).

        R is also well positioned in this area… I am planning a post on this topic, so stay tuned! (Or even better: would you like to contribute a guest post?)

        1. Thank you for asking to be a contributor. I would not pretend to have enough expertise. I am a seasoned business and marketing professional with solid knowledge and experience in stats and marketing research and some experience in ML using R (and SPSS), not really more. I have registered for a post degree course in data science and it is fully based on Python 🙁 (partly on Knime and Gephi).

    13. Python is catching up with C integration with its Cython library, but your remaining concerns stays as is. Inconsistency between Pandas and NumPy, and function vs method!

    14. Nice Content!
      I am learning Django recently for my Project work thank you for highlighting that R is the preferable language to go for when it comes to Data Science. Can you also please make a blog post on the advantages and disadvantages of r and Python, I would appreciate it.
      Keep on Updating us with such great content, Cheers!.

    15. Wow its a great article. But I have a question What does it take to become a data scientist? because I want to become a data scientist.

      1. I have led decision science teams for big companies and I get this question a lot. I answer it this way. Data science is the combination of domain expertise (like an industry focus), programming, and mathematics. While data science is an awesome field and can be very fulfilling, it can be as frustrating and disappointing as any other career choice, if it isn’t right for you. Many people seem to think about this as a way to get a good job quickly because demand is so high right now. This is not necessarily the case. I encourage people to ask themselves why they want to become a data scientist or any other career choice they may not know much about. Once you answer that for yourself, you can find an awesome career whether it is in data science, engineering, analytics, or something else all together. Here is a link to a “day in the life” post for data scientists who work in corporate settings, that may be helpful.

        https://therandomvariable.com/what-does-a-data-scientist-really-do/

    16. c’mon, writing in 2018 you should left 2.x vs 3.x dilemma behind… this is over and 99% packages are 3.x compatible… seems like you desperately need to proof any Python’s weak points which is difficult vs R ?

    17. How to developer must learn both. If you work for a company that has R implemented in its platform you will not come to change that.

    18. I am a beginner and want to start learning some advance analytics. I am a business student and I see marketing trends, reach analysis, big data analysis and likewise things. I am intending to do PhD in business analytics as well. For a complete starter like me, who used SPSS and JASP, what would u suggest? After reading your article, comments and discussion, I thought r for myself. But I need to take endorsement. Inshort, I have to be a data analysis guy. R or Python? Please. And it’s 2020.

    19. My background: 40 years experience with programming, starting with Fortran, Algol, Pascal, and later on C, C++, MATLAB. Started with R about 20 years ago and do a lot of data analysis both with academic and consulting focus. Also competed on Kaggle with good results a couple of years ago. Did some programming with Python, but not much.

      My simple answer to the question: it depends on what is your actual background and expertise. The best data science language and programming environment for you is the one you know the most and in depth.

      For people more related to CS in general, I guess Python is the natural way to approach Data Science and this is, I believe, because Python became a very popular general purpose computer language used within CS during last decade or so. If you have a lot of non-data science Python expertise, lots of friends that know and use Python, data science with Python is most likely the way to go.

      People with a background more related to Statistics and/or long term expertise on R will feel more comfortable staying with R, as they know how to solve data science problems with R. And this is not because tidyverse or RStudio. It is due to their knowledge on R and on how to solve problems with R and available packages. Things like RCpp, R Shiny, data.table, ff*, big_*, sparklyr, are, I believe, more relevant for nowadays problem solving than tidyverse or RStudio.

      My guess is that comparing the solution given to a data science problem by two real experts, one on Python and the other on R, the final result from both approaches will be very similar in general terms. Existing differences, if any, will be due more to their level of expertise on both programming environments (and techniques used) than on the features of each environment.

      I think the original question is more relevant for a person without any knowledge on Python and/or R who wants to start a career on Data Science. For those, both alternatives look comparable to me. Flip a coin? or, better, have some exposure to both worlds and see which is more attractive to you. Both are powerful but have a step learning curve if you want to reach a level of expertise enough for solving large scale problems. There are no easy shortcuts.

    20. Hello,

      Read you post and comments…it is also dilemma for me – R or Python. I have started both, I like R syntax more than Python (maybe it is because I didn’t have programming experience before) but seems that Python is evolving little bit faster that R.
      But the question is, what would you suggest for sport data analytics? There are few key areas, which has to be covered: web scraping, sport models, machine learning, forecasting, visualization and dashboards.
      Maybe there are another important areas, which I don’t know yet…

    21. Are the reasons still valid after 3 years since the post was released? Hasn’t been there any improvement in Python that makes it better now?

      A post update would be very much appreciated!

      Thanks in advance.

    22. I’m an economist and I’ve been working with Data Visualization tools like Tableau and Power BI to support my choices regarding project management. I’m thinking about getting a more deep understanding of data analysis (perhaps become a data analyst) and I’m considering taking a course in this. I have 2 choices: The course from Google that uses R and the other from IBM that teaches Python. To a person that has no programming background and want to enter in the data analysis world, what would you recommend?

      1. Dear Juan, Thank you for your question.

        I would clearly recommend R! Especially when you have no programming background Python might seem tempting because the learning curve seems flatter at the beginning. But the big problem is that when you start doing data analytics with Python you will have to learn a second language on top of the first one because of the necessary packages (NumPy, pandas, etc.)

        When you learn R everything is consistent because it was originally built for doing data analytics. So you should go for the Google course! Could you perhaps post the link to that course?

      2. Hello Juan! I would like to recommend the HarvardX course R-Basics for Data Science, a first of a series of courses from Harvard via EDx taught by the amazing Professor Rafael Irrizary. It was really very well taught as an introductory course and I thoroughly enjoyed it. I also took the Python course from IBM and I have to say I was very disappointed so I’d steer clear of it. If you are a beginner I would also like to recommend Hands on Programming with R (O’Reilly) as a first text, followed by Learning R (O’Reilly), and then R for Data Science – Import, Tidy, Transform, Visualize, and Model Data (O’Reilly). These are the best texts in my opinion after going through dozens of books for breadth, depth, and structure.

        Best of luck!

    23. I wholeheartedly agree with you that R is better suited for data science than Python exactly for the reasons you have described. I spent the past three months learning R, Python, as well as visualization platforms like Tableau. I worked on all three religiously with no previous background in any whatsoever. In fact, when I started, I was a little wary of R, and was very excited about Python based on how I kept seeing that “it is really easy to learn”/”it is really fun to learn and work with Python” type comments all the time. At the end of these three months, I absolutely LOVE working with R. The syntax is intuitive, the way it works with its parts is pretty straight forward. It took me almost three weeks to just get started with Python because of the many, MANY options available with their various strengths and limitations, which then creates variations in how to get started with any of them. While Python continues to try to be, as you said, “everyones darling”, I am going to continue to work towards mastering R and not Python. PS: Awesome blog!!!

    24. Hi, I’m a python user (I can’t call myself an expert, let’s say I’m confident) and a newbie (maybe something more than a newbie) in R.
      I won’t say if one is better than the other (I am not qualified for that). The reality is that python is attracting way more new users than R and it may make the difference in the long run.
      My humble opinion on why so many new users adopt python instead of R is that python is incredibly easy and fast to learn and, mostly, it is consistent. Yes, the base language (a general purpose language) lacks some features like vectors and matrices but for that there is numpy that is the de facto standard for this task. For tabular data there is pandas (built on top of numpy -consistency, do you remember? – with a lot of additional packages that extend its core functionalities; think about geospatial data for example). Many new packages developed to overcome the limits of pandas (big data, speed and so on) use the same APIs as pandas so to switch to the new library means (mostly) simply change the import line! Scikit-learn is the standard for machine learning. All this means that after a short training you can open a Kaggle notebook and understand what’s going on; and it works, at least for the vast majority of use cases.
      This is not true for (base) R! In my humble opinion, (base) R is not as fast and easy to learn as python and, most important, it has many functions that do (almost) the same thing with different syntax and (often) different returns. A nightmare for a new learner (even though I realise that for an experienced developer it may be an advantage; but to become experienced you must start! If you give up at the beginning because of frustration -and switch to python – you will never become an expert).This is where the tidy universe (tidyverse and tidymodels; this comments replies also to your post about dplyr) came into play. “Tidy up” the mess in R world with a simple, straightforward and consistent interface to do things. You can build upon it to extend and improve its functionalities but it aims to become the de facto standard to operate (at least for the vast majority of use-cases). Think about ggplot2 (something that as python user I highly envy to R; data visualisation in python is messy with several different libraries and different syntax): ggplot2 is the base upon which build up adding new functions to extend its core capabilities. Easy to learn and to use (and consistent). Will the tidy universe succeed? I don’t know, what I think (my humble opinion) is that either they will succeed or R will lose its battle.

      1. Thank you for your extensive comment, Andrea!

        It is interesting to see how different users experience learning a language. For me, it was the exact opposite! I am coming from assembler, Basic, a little bit of Mathematica and Pascal, and later C++. Then I started with Matlab/Octave, Python, and R, at about the same time. I didn’t like the overly commercial approach of Matlab and Octave is just not the same, so I quit. I agree that (base) Python is easy to learn, almost like pseudocode, but I find the vector- and matrix-oriented packages totally inconsistent with it. Also, the sometimes strangely different behaviour of functions and methods irritates me.

        Concerning graphics, I agree that R is still THE standard. This is also the reason why all of the big media companies still use it for sophisticated infographics. Although I personally don’t use ggplot2 I can understand its appeal.

        Oh yeah, and the tidyverse… it might be that I am an “expert” in your sense but what I don’t like and don’t understand is why you would need quite another language on top of R! I am a big fan of functional packages (I developed one myself, OneR) but I don’t like packages which totally change the character of a language. It seems like a hostile takeover to me, some kind of land grab of a commercial company! It might be that had I come later to R I would totally be in love with the tidyverse because I wouldn’t know the elegance and strength of base R. But I do, so I am not a big fan.

        I wouldn’t be so pessimistic about the future of R. Much of the hype of Python is, in my opinion, based on neural networks which are themselves totally overhyped. As an AI “veteran” I can tell you that all those AI technologies come and go in waves (this is not the first time that neural networks are the latest craze). When there is a new kid on the block (whatever it might be) it will be a new game. It might very well be that R will become the latest craze again (btw something like that happened during the peak of the Covid pandemic because statistical modelling was important again!).

        I am optimistic that R is here to stay and will become more important in the future. Concerning the tidyverse: I don’t know, we will see…

        1. Thank you very much for you reply. That’s the point: you are an R “expert” and you don’t understand why you should change your mind to adapt to the tidyverse model. But if you were new you would find the tidyverse approach natural and consistent, easy and quick to learn. And it will work in most situations. With the time you can become an expert and find out more efficient ways to solve the problems or how to fix them when tidyverse approach does not work.
          Regarding the “inconsistencies” in python I would not call them inconsistencies: different libraries may have different settings for default parameters. The fact that you can use both a functional approach (numpy.any( df ==1)) or method approach ( (df==1).any() ) does not mean the two functions are exactly the same (even though they will give the same result if you correctly set the default parameters); this may happen in R too, doesn’t it?

          1. R is much more consistent here, objects do not have methods associated with them in typical R parlance. The class of an object determines what function-methods will be applied to it (= polymorphism).

    25. I am just learning a little of R. It was very nice to read your ultimate introduction to R. Thank you very much for it! I suspect the real reason to use R is the lot of statistical packages, also that it is a standard for statistics. Why not python? Because I prefer Tcl/Tk and do not like python! Python is so absurd that one must include the whole Tcl/Tk in it for having tkinter and widgets. It is the product of an inflation of scripting languages. I do not agree with you that it is a good language for learning programming, specially because the semantic of blanks. FORTRAN IV is perhaps the right language for learning programming. But I do see an advantage of using a general purpose scripting language: to easily write prototypes and glue many programs.

    26. I just began dabbling in data science with a paid course on a website called https://pickl.ai/ , https://www.edvanza.com/ and reading your take has been a bittersweet experience (they’re teaching using Python).
      Nonetheless, it was extremely good and eye opening! I intend to learn R, while working with Python. Your website has a good vibe, I’ll go on and read from the source in updates.

    27. After 16 years I can tell that:
      Python syntax is better then vanilla R syntax (mayby because Python is general purpose language –
      dictionary, enumerate, list comprehension)
      dplyr alone is better then pandas
      but tidyverse is mistake that weakened R community
      (btw. pandas have pipe(), but no one uses it because Python code looks great, so you don’t excape from writing it)
      it’s easier to build model in R, then Python
      statistical testing is easier in R and cross-validation based approaches in Python
      in R you have functions for everything in Python you create your own more often
      in Python you write less code, but you can’t create readable long scripts and manage project structure properly
      install.packages is great
      people do not manage versions of R packages and R itself, so after 5 years nothing works – no good tools for that in R
      so Python pip/conda data packages management systems are big plus (you also install Python itself with them)
      but there are constant dependency problems in Python, which R doesn’t have
      R is good for academia when you do not deploy or replicate you results (I know academia too well)
      in Python you can do a lot of pure programming stuff with flask, sqlalchemy, dash, airflow
      in R only shiny (which I hate)
      R deployment is not mature – compare python-setup and r-setup github actions building time for example and quality of R repositories at GitHub is also better (I believe you don’t use GH)
      scikit-learn does not provide p-values even for linear model (schame) – this tells a lot
      about difference between two communities…
      R’ people discuss R vs. Python and Python’ don’t – and they often never seen it at all
      R has all possible packages for statistical analysis and Python for anything else then SA 😉
      in R people still create libraries in FORTRAN
      GO is better then Python
      R as language is more flexible (NSA etc.) but goes nowhere, because its functional language
      R in numerically more precise then base Python
      but in pure R you can’t (unless you are …) implement ML algos like random forest
      in Python you can
      there are too many package choices you have to make in R
      in Python you make 90% of standard data analysis in pandas, numpy, scikit, matplotlib/plotly, seaborn and statsmodels and in R you don’t (for example broom vs lapply for lm in subgroups)
      no OOP in R, R6 is a mistake in comparison to Python’ OOP system
      in Python you can’t overload object constructor …
      Python community is more open – and produce more open source packages, R community write
      more of them
      more companies support Python (are porting their solutions/packages in Python)
      Python language is changing, R is not
      Python have PEP8,
      R help system and documentation is better, but in Python you can read readable internals
      Python as a language have different implementations so there is hope for change
      Python is part of Ubuntu distribution
      ggplot2 and data.table are example that R is more flexible language, but they create their own sublanguages, so it’s not very good for R itself

    Leave a Reply to Duc Anh Hoang Cancel reply

    Your email address will not be published. Required fields are marked *

    I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.

    This site uses Akismet to reduce spam. Learn how your comment data is processed.