Why R for Data Science – and Not Python!

There are literally hundreds of programming languages out there, e.g. the whole alphabet of one letter programming languages is taken. In the area of data science, there are two big contenders: R and Python. Now, why is this blog about R and not Python?

I have to make a confession: I really wanted to like Python. I dug deep into the language and some of its extensions. Yet, it never really worked for me. I think one of the problems is that Python tries to be everybody’s darling. It can do everything… and its opposite. No, really, it is a great language to learn programming but I think it has some really serious flaws. I list some of them here:

It starts with which version to use! The current version has release number 3 and is gaining traction but there is still a lot of code based on the former version number 2. The problem is that there is no backward compatibility. Even the syntax of the print command got changed!

The next thing is which distribution to choose! What seems like a joke to R users is a sad reality for Python users: there are all kinds of different distributions out there. The most well known for data science is Anaconda: https://www.anaconda.com/. One of the reasons for this is that the whole package system in Python is a mess. To just give you a taste, have a look at the official documentation: https://packaging.python.org/tutorials/installing-packages/ – eight (!) pages for what is basically one command in R: install.packages() (I know, this is not entirely fair, but you get the idea).

Talking of packages: the basis of much of data science is statistics and visualizations. It is an indisputable fact that nothing comes even close to R in this respect. Many of the standard statistical techniques are adequately covered by Python but try more unconventional stuff and you are quickly lost. There are now nearly 18,000 R packages in the official repository CRAN alone! The same is true for more sophisticated visualizations: it is no coincidence that all of the renowned news organizations create their impressive infographics with R!

There are several GUIs out there and admittedly it is also a matter of taste which one to use but in my opinion when it comes to data scientific tasks – where you need a combination of online work and scripts – there is no better GUI than RStudio. And even if you want to use Jupyter notebooks, you can do this with R too.

More sophisticated data science data structures are not part of the core language. For example, you need the NumPy package for vectors and the pandas package for data frames. That in itself is not the problem but the inconsistencies that this brings. To give you just one example: whereas vectorized code is supported by NumPy and pandas it is not supported in base Python and you have to use good old loops instead.

There is no general rule when to use a function and when to use a method on an object. The reason for this problem is what I stated above: Python wants to be everybody’s darling and tries to achieve everything at the same time. That it is not only me can be seen in this illuminating discussion where people scramble to find criteria when to use which: https://stackoverflow.com/questions/8108688/in-python-when-should-i-use-a-function-instead-of-a-method. A concrete example can be found here, where it is explained why the function any(df2 == 1) gives the wrong result and you have to use e.g. the method (df2 == 1).any(). Very confusing and error-prone.

Both Python and R are not the fastest of languages but the integration with one of the fastest, namely C++, is so much better in R (via Rcpp by Dirk Eddelbuettel) than in Python that it can by now be considered a standard approach. All R data structures are supported by corresponding C++ classes and there is a generic way to write ultra fast C++ functions that can be called like regular R functions:

library(Rcpp)

bmi_R <- function(weight, height) {
  weight / (height * height)
}
bmi_R(80, 1.85) # body mass index of person with 80 kg and 185 cm
## [1] 23.37473

cppFunction("
  float bmi_cpp(float weight, float height) {
    return weight / (height * height);
  }
")
bmi_cpp(80, 1.85) # same with cpp function
## [1] 23.37473

One of the main reasons for using Python in the data science arena is shrinking by the day: Neural Networks. The main frameworks like TensorFlow and torch and APIs like Keras used to be called by Python (much of their code is written in C for performance reasons anyway) but there are now excellent wrappers available for R too (see also Teach R to see by Borrowing a Brain).

All in all, I think that R is really the best choice for most data science applications. The learning curve may be a little bit steeper at the beginning but when you get to more sophisticated concepts it becomes easier to use than Python.

If you now want to learn R see this post of mine: Learning R: The Ultimate Introduction (incl. Machine Learning!)

UPDATE
I constantly update the post, if there is anything that is not up-to-date please let me know in the comments.

UPDATE September 23, 2021
I created a video for this post (in German):

100 thoughts on “Why R for Data Science – and Not Python!”

Owe Jessen says:

December 2, 2018 at 9:56 pm

Minor nitpick – RStudio is not a Microsoft product.

Reply
1. Learning Machines says:
  
  December 2, 2018 at 10:37 pm
  
  Well, Revolution Analytics was bought by Microsoft a few years ago.
  
  Reply
  1. Alex says:
    
    December 2, 2018 at 11:01 pm
    
    I guess you are confusing Rstudio with Microsoft R Open (which was Revolution Analytics product until Microsoft bought them).
    
    Reply
    1. Learning Machines says:
      
      December 2, 2018 at 11:13 pm
      
      Oh yes, my fault… Thank you. Will change it in the post.
      
      Reply
2. Zrm solutions says:
  
  April 6, 2022 at 9:13 am
  
  A software program residence is a company that on the whole presents software program products. These agencies might also focus on enterprise or patron software merchandise wherein the organization is in particular invested in growing and dispensing software program merchandise.
  
  Reply
3. tom stockfisch says:
  
  June 4, 2022 at 7:14 am
  
  Not being a microsoft product is a huge plus in my book.
  
  Reply
Sebastian says:

December 3, 2018 at 1:09 am

Very good

Reply
Vanessa says:

December 3, 2018 at 1:48 am

Why do you even need to introduce a competitive stance, Python OR R, Python v.s. R? Each has pros and cons, but it’s comparing apples and oranges because the use cases are so different. Someone who likes Python could just as easily write a “Why Python for Data Science and not R” post, and it serves no good other than to get people arguing, akin to how one might over “vim vs. emacs.” Maybe it’s better to do data science, and share what you do, instead of fueling another “My favorite software is better than your favorite software” fire?

Reply
1. Learning Machines says:
  
  December 3, 2018 at 7:43 am
  
  We are not talking about religion but about tools which are either better suited for a job… or worse. It is important to have this conversation, also to help people make informed decisions about which software to chose.
  
  “Why Python for Data Science and not R” – I would seriously challenge anybody to write this post and post the link here.
  
  “Maybe it’s better to do data science, and share what you do” – no worries, I will do that – so stay tuned!
  
  Reply
2. David says:
  
  October 13, 2020 at 12:34 pm
  
  Finally some common sense
  
  Reply
  1. Sebastian Muller says:
    
    January 2, 2021 at 9:54 pm
    
    Agree.
    
    Reply
Jean-Marc Patenaude says:

December 3, 2018 at 1:50 am

Thanks for your post. I use R extensively for data science and machine learning, and I love it!

Just wondering, other than the fact that more people know Python than R, can you come up with good technical reasons to learn / use Python for data science / machine learning projects (instead of R)?

I’d appreciate your thoughts on this. Thanks!

Reply
1. Learning Machines says:
  
  December 10, 2019 at 11:21 am
  
  In fact, I cannot!
  
  Reply
2. Niels Kristian Schmidt says:
  
  March 10, 2020 at 7:31 am
  
  The choice of Python could be reasoned with learning through kaggle competitions, where Python dominates and learning is done via sharing of solutions. This learning process is very fast (and competitive) compared to academic learning.
  
  Reply
3. Someone says:
  
  June 9, 2020 at 8:13 am
  
  It’s good if you want to apply your data science code to do things, since Python is a much broader language.
  
  Reply
4. Alvaro Neto says:
  
  March 22, 2021 at 6:31 pm
  
  The development pipeline is streamlined with Python. R is better for the actual data science work, in my opinion, but then you might want to deploy your software on a live server, for example, a realm where R doesn’t compete at all. It might, therefore, be more productive to write everything in Python.
  
  Reply
  1. Mohamed Jelassi says:
    
    October 8, 2021 at 12:38 am
    
    Have you heared about R shiny?
    
    Reply
Lambda Moses says:

December 3, 2018 at 2:09 am

I have also used Python for data analysis before, for a class whose instructor is a huge Python fan, but I still prefer R. In addition to the reasons you wrote about (I haven’t gone into as much depth as you did into Python), I prefer R because I prefer RStudio and RMarkdown to Jupyter notebook (I’ve also used Jupyter with R before). RMarkdown is integrated in RStudio, so we still get to use the environment, git, help, and Terminal panes, and it’s plain text so it’s easy to see what changed through git. Jupyter notebook is json when viewed as plain text so is harder to see what changed through git. Another thing I like about R is that CRAN and Bioconductor have stricter requirements for packages than PyPI; I’ve seen packages from pip that have terrible documentation and no unit tests. I also find ggplot2 easier to reason with than matplotlib.

Reply
Chen says:

December 3, 2018 at 6:31 am

I have heared that one of major edges that R has comparing to Python is Hadley Wickham, I think it makes senese.

Reply
1. Michael says:
  
  December 3, 2018 at 4:08 pm
  
  Hadley’s “tidy whitey verse” packages are actually one of the reasons why R gets a bad reputation. dplyr (for example) is a slow package for data frame manipulation compared to pandas. The data.table R package for data frame manipulation is significantly faster than pandas in Python so it should be touted more. Hadley Wickham’s dplyr package is definitely not an edge over pandas, so using it as a selling point for why R is better than Python will have you laughed off in front of a bunch of Python developers.
  
  Reply
  1. Bernardo Lares says:
    
    December 3, 2018 at 6:15 pm
    
    I %>% really() %>% dont() %>% think() %>% so() is quite more understandable in code than so(think(dont(really(I)))), don’t you think so? Not everything is speed..
    
    Reply
    1. Michael says:
      
      December 3, 2018 at 6:45 pm
      
      The %>% would be piping, which is a product of the magrittr package. Not dplyr. So my point still stands.
      
      Reply
      1. Chen says:
        
        December 4, 2018 at 12:51 am
        
        You have made a rule : THE FASTER THE BETTER. It’s probably true, but for what? for the CPU or for the human brain? Its a tradeoff actually, I prefer the latter.
    2. ag says:
      
      April 14, 2019 at 11:06 pm
      
      I
      dont
      think
      either
      is
      particularly
      readable
      
      Reply
    3. Yuri says:
      
      March 3, 2020 at 12:12 am
      
      Excellent point! 🙂
      
      Reply
    4. Alvaro Neto says:
      
      March 22, 2021 at 6:46 pm
      
      Not if you’re German.
      
      Reply
    5. Learning Machines says:
      
      July 25, 2021 at 12:16 pm
      
      From version 4.1 on R also has a native pipe which is even more concise:
      
      Your example:
      I |> really() |> dont() |> think() |> so()
      
      Reply
  2. foeffa says:
    
    December 18, 2018 at 5:19 pm
    
    This is some grade A garbage. So you mention the tidyverse and the only use case you mention is dplyr … ignoring the fact that the point of the tidyverse is that it ‘s more a paradigm/ecosystem to have a much more robust & standardized way to approach many tasks including ML, by leveraging multiple packages (purrr, broom, ggplot, tidyr and WIP like tidymodels, recipes, … and everything that is built on top) in a more standardized, easy to work with way.
    
    And to not stoop to your level of idiocy with blanket statements:
    – whether it’s an “edge” on Python per se, is up for debate. I don’t use Python enough to judge. But to call the tidyverse a reason why “R gets a bad reputation” ludicrous
    – data.table is great
    
    Reply
    1. Learning Machines says:
      
      December 18, 2018 at 5:30 pm
      
      While your opinion is appreciated I would advise you to tone down your language. Qualifications like “garbage” and “idiocy” are not well received here.
      
      Reply
    2. Clancy says:
      
      June 26, 2020 at 12:44 am
      
      Triggered?
      
      Reply
  3. Learning Machines says:
    
    December 10, 2019 at 11:24 am
    
    Thank you for your comment, I have written a blog post on the topic:
    Why I don’t use the Tidyverse
    
    Reply
  4. Niels Kristian Schmidt says:
    
    March 10, 2020 at 7:36 am
    
    Motivation is the key driver to programming speed, so TidyVerse might be faster than less cosy coding prose. Depends on the programmer.
    
    Reply
  5. Sebastian Muller says:
    
    January 2, 2021 at 9:57 pm
    
    Hi Michael, I disagree.
    
    If you are a novice user, you will jump in into DataScience faster thanks to Tidyverse in R.
    
    If you are an advanced user, you will clearly be able to learn fast R and find the fastest route.
    
    I believe your comment is absolutely biassed.
    
    Reply
philip says:

December 3, 2018 at 6:46 am

Best argument for Julia I’ve read in a while

Reply
1. Learning Machines says:
  
  December 3, 2018 at 7:48 am
  
  Yes, we see an interesting development with Julia too. The only problem at the moment is that it is no comparison to R package-wise. But this might change in the future. We will see…
  
  Reply
Alex Zolotovitski says:

December 3, 2018 at 6:52 am

I’d add amazing `data.table` R package that outperform any operations with data.frames and AFAIK has no analogy in python.

Reply
1. Franz Hochinger says:
  
  December 5, 2018 at 4:42 pm
  
  https://github.com/h2oai/datatable
  
  Reply
Richard Careaga says:

December 3, 2018 at 7:51 am

I like **Python** for lots of stuff, including some libraries that are a bit ahead of their R counterparts when I last used them, in particular, the NLTK NLP package and the networkx graph package. I’m sure when I go looking I’ll find **R** implementations in those domains that work just as well.

But why I mainly like Python is that it’s an imperative/procedural language, which is the model I grew up on. It’s also what gave me steep learning curve curse. I kept trying to make R work like what I was used to and it refused, for the most part. It even made the comprehensive help pages incomprehensible.

But then I dabbled in Haskell enough to come to the realization that **R** is a *functional* language.

No one who was able to grasp `f(x) = y` should have trouble making the transition to `y <- f(x)` and, with that realize that **R** is a treasure trove of tested functions that would take years to program in Python or C++, let alone test comprehensively.

That said, the very strength of **R** will be its competitive disadvantage until large organizations start employing programmers and data engineers comfortable in languages like Haskell. They'll pick an easy-to-translate **Python** solution over a hard-to-parse **R** implementation every time.

Reply
1. Learning Machines says:
  
  December 3, 2018 at 8:05 am
  
  Very well put!
  
  Reply
  1. Sebastian Muller says:
    
    January 2, 2021 at 9:58 pm
    
    Totally agree!
    
    Reply
Randall Williamson says:

December 3, 2018 at 6:05 pm

I remembered saving an infographic DataCamp produced on this topic in 2015: https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis. I can’t speak to the accuracy of the analysis. Also, 3.5 years is a long time in this field.

Reply
JD says:

December 3, 2018 at 8:05 pm

I’m like you I tried Python but it was too much of a hassle. My only issue with R is the “everything in RAM” problem. I am working on the Kaggle Christmas Traveling Salesman Problem with 197K cities when you get to the optimize route matrix R chokes and asks for another 150GB of memory. I’d like to stay with R but I think the majority of kagglers have gone to Python because of the memory issue. If you have any ideas about R libraries to use, I would appreciate any suggestions.

Reply
1. Learning Machines says:
  
  December 3, 2018 at 8:25 pm
  
  There are several ways to work with large datasets – just a few pointers:
  – https://CRAN.R-project.org/package=bigmemory
  – https://CRAN.R-project.org/package=iotools
  – https://spark.rstudio.com/
  – https://www.datacamp.com/tracks/big-data-with-r
  
  Hope that gets you up to speed, no pun intended 😉
  
  Reply
Harvey says:

December 4, 2018 at 1:56 am

Not sure any non pro R comment will have some credit here, but I guess you knew this would be troll bait.

Few of your arguments really hold, some are considered even as R weaknesses.

The version 2 vs. version 3 topic is mostly over. If you begin now with Python for data science you may even not notice it. Python Software Foundation had the courage to improve the language while others stay with their flaws…

Anaconda is the de facto Python distribution for data science, it comes with 200+ packages installed, plus more than 1000+ installable with one command: conda install .

R Studio is really cool, and certainly one of the main plus for R. But it is not R, it would be great with many other languages. But apart R Studio, you have no choice, especially when you have to work on a serious data science code base. Jupyter/JupyterLab is used a lot by data scientists (just check the numbers, no subjective assertion). Visual Studio Code is also getting large adoption, Atom + Hydrogen is very interesting, etc.

The function or method topic is the same with R & Python. It comes from the fact that R & Python are multiple paradigm languages, i.e. they supports both a functional and an object-oriented writing. It happens that object-oriented programming is more often used in Python, but it has nothing to do with the properties of Python or R. If some R packages developers would have used object orientation, you would have the same questions in R. It happens it is not the case with the packages you are using, but should someone discovers the advantage of method chaining with objects, the same question would appear.

Having data structures as external packages is certainly an advantage from a performance point of view. While R is known for its low performance and poor resources management such as memory, external packages provide the capability to improve performance without requiring changing the core engine. Your own answer to the question about memory issues with R was to point to external packages.

About integration with C, you have interfaces both ways between Python & C. You should have a real look at how easy it is to call C from Python. Also Cython allows to closely embed Python and C extensions and to compile executables for maximum speed.

And finally, Python is more vivid on the side of tools and packages. Many new tools (in deep learning, NLP, etc.) are released with a Python API / wrapper, not so often with R. Once again, check numbers on github for example.

But you forgot R Shiny, one of the REAL strength of R ecosystem for prototyping data science applications. Too bad…

Thus, except considering old, biais arguments, R is not that much an obvious choice. I would even say it is getting harder. In a recent hackathon of data science with 200+ data scientists competing, I had the chance to do some stats and 1 out 10 was using R, others were using Python. I was not looking to count this, but I was surprised how many Python screens I saw, how few R screens there were. So I did the stats.
Also, in its latest poll, KD Nuggets concluded about the changes over time, Python getting more & more used for data science. See: https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html (no reason to consider KD Nuggets has a Python biais, not your case for R).

Hope this helps to have a better evaluation of pros & cons of each platform.

Reply
1. Learning Machines says:
  
  December 4, 2018 at 7:56 am
  
  “Not sure any non pro R comment will have some credit here” – your comment is well appreciated, thank you!
  
  Reply
2. Min says:
  
  August 24, 2021 at 12:34 am
  
  I think this is the more objective view. Or at least provides more balance.
  
  I have used Fortran, Matlab, R, and Python for scientific computing; and used Python, JavaScript, and Java outside of scientific realm. Both R and Python are fantastic for data science for sure.
  
  The real strength of R is its statistical packages made by statisticians.
  
  But R’s advantages more or less ends there. If packages are the answer for everything – package for viz, package for speed, etc. – well, Python has a larger pool of choices (perhaps that is a negative). One can have advantage over the other on any given year, but that can change by the next year.
  
  I know this post is old, but comparing ggplot2 with matplotlib is like comparing seaborn with base R plotting. Altair is a better implementation of declarative visualisation than ggplot2. Seaborn looks better than both. Plotly Express is easier to use than any. Every time I see these blog posts about Python vs R, and see ggplot2 vs matplotlib is a dead giveaway that the author doesn’t use Python. Much like when I see a university lecturer’s R code that doesn’t use the assignment ( <- ) or piping, I know they don't really know how to program (shame on anybody who still teaches with Stata or Eviews).
  
  In any case, I don't think packages are advantages of the language but more about the ecosystem – and fortunately, the ecosystem for both continue to improve and expand. And if you are only proficient in one, make an effort to learn the other – not just watch a couple of udemy videos, but actually.
  
  I guess if you are a statistician, then you don't need to: R is sufficient. If you are a data scientist… nothing would be sufficient on its own, but R is not necessary while Python most certainly is a requirement.
  
  A data scientist's role is fluid and quite expansive. Are you going to orchestrate a data engineering pipeline on the cloud with R? No, one would need something like Airflow, Prefect, boto3 – Python would be better suited. Would you trust fitting a GLM with statsmodels over R? Probably not, maybe more complicated models will cause statsmodels to freak out and not converge to a solution while R handles it fine. Would I use either Python or R for a heavy duty Physics simulation? No. Maybe Python, but I'd try out Julia, and in some cases Fortran can be faster than C/C++. What if I was a machine learning engineer in the gaming industry? I'd use C#'s ML.NET or wrestle with C++ perhaps. Maybe I need to custom visualise something using JavaScript and D3.js. Perhaps you need to work with truly massive data and need Spark (using any one of Scala, Python, or R) or Dask (Python).
  
  I'm not saying one person would need to do all of this all the time, but learning how to learn the ecosystem and code idiomatically will take you further than merely learning to write code.
  
  I dunno, but if you are a monolingual, or advocate for being one, you'll have a hard time being a data scientist. That's my two cents.
  
  Reply
Eric Graves says:

December 4, 2018 at 2:01 am

One big advantage python has over R is a single object oriented system instead of THREE in the base language (S3, S4, ReferenceClasses) and a fourth, R6, as the tidyverse approach. I also think python is better at manipulating raw text files in a manner that doesn’t blow up memory. Being able to iterate over a file line by line is very intuitive in python and, while possible in R, isn’t as natural.

But like many in these comments, I use both and am very much a R AND Python proponent. Especially when it’s so easy to interop between the two!

Reply
1. Learning Machines says:
  
  December 4, 2018 at 7:58 am
  
  Good points, thank you for your comment.
  
  Reply
Guillaume says:

December 5, 2018 at 3:10 pm

As a statistician, I will provide my opinion. And…
I think R is made by statistician and for statistician.
I like R because you just need to open a console in the right directory, type R DF<-read.csv('mytable.csv') et voilà! You can play with the data with the provided packages .
I use python and you need more effort to obtain the same result.
You can use in R common methods with data scientists but I think this is not the point.

Best
Don't forget plumber, opencpu, rmarkdown

Reply
1. Duc Anh Hoang says:
  
  December 6, 2018 at 2:52 am
  
  I have used both R and Python for Data Science project but I still prefer R to Python. There are 2 main reasons:
  – There is no “Pipe operator %>%” in Python. Using Pipe make the analysis much easier as it make the process of writing code and the thinking flow closer.
  – RStudio is the outstanding GUI for Data Science, which is better than any GUI for Python (Pycharm, Jupyter Notebook, Rodeo, Spyder…)
  
  Reply
  1. Alvaro Neto says:
    
    March 22, 2021 at 6:38 pm
    
    Though I’m not exactly a scholar in Python since I just started using it since January of this year (2021), I miss R already and I think that I will use it exclusively for analysis work in the future.
    
    Reply
Santi says:

February 12, 2019 at 7:45 am

I have been dealing with virtual environments, conda, anaconda, iPython, Jupyter, PyCharm, IDLE and you name it. Now multiply this times the number of versions of each package of interest ant their compatibility and it quickly becomes a mess. I felt so much relief after reading this, R and RStudio it is for me too.

Reply
1. Learning Machines says:
  
  July 25, 2021 at 12:13 pm
  
  Better late than never (my answer): Thank you for your great feedback, Santi! I really appreciate your view and of course, wholeheartedly agree with you.
  
  Reply
Harry Zhu says:

April 15, 2019 at 3:14 am

`sparklyr` is easier than pyspark, if you know dplyr already.
`data.table` is faster than pandas, and widely used in finance field.
And many other baysian tools like `gstat`,`inla`,`lme4`,`brms`,`shinystan`,`bnlearn` beat the Python counterpart in functionality and usability.
Last, Python is a submodule in RStudio by `reticulate` R package, in that, no more discussion about who is winner.

Reply
1. Learning Machines says:
  
  April 15, 2019 at 1:40 pm
  
  Very valid points – Thank you, Harry!
  
  Reply
Pingback: Learning R: The Ultimate Introduction (incl. Machine Learning!) – Learning Machines
Van Co Than De says:

July 5, 2019 at 9:57 am

I’m surprised that R is more popular than python in the field of data science today, can you tell me some advantages of R compared to python? Many Thanks!

Reply
1. Learning Machines says:
  
  July 5, 2019 at 10:02 am
  
  I thought I had given the advantages in the article above… or what do you mean exactly?
  
  Reply
Pingback: Summer Break: A Look back… and ahead – Learning Machines
René Baumann says:

September 28, 2019 at 1:28 pm

As an individual making a living with data science (“only”) you should evtl. be more concerned about the evolution of auto ML than this battle.

Reply
1. Learning Machines says:
  
  September 28, 2019 at 1:55 pm
  
  Thank you for your comment, René!
  
  Well, at the moment this question is relevant for many aspiring data scientists and on top of that the question doesn’t go away even in the area of Auto ML (where I agree that this topic is going to gain a lot of momentum in the future).
  
  R is also well positioned in this area… I am planning a post on this topic, so stay tuned! (Or even better: would you like to contribute a guest post?)
  
  Reply
  1. René Baumann says:
    
    September 30, 2019 at 10:49 am
    
    Thank you for asking to be a contributor. I would not pretend to have enough expertise. I am a seasoned business and marketing professional with solid knowledge and experience in stats and marketing research and some experience in ML using R (and SPSS), not really more. I have registered for a post degree course in data science and it is fully based on Python 🙁 (partly on Knime and Gephi).
    
    Reply
Pingback: Why I don’t use the Tidyverse – Learning Machines
Maruf Hossain says:

December 10, 2019 at 10:37 pm

Python is catching up with C integration with its Cython library, but your remaining concerns stays as is. Inconsistency between Pandas and NumPy, and function vs method!

Reply
1. Learning Machines says:
  
  December 10, 2019 at 11:03 pm
  
  Thank you, Maruf, I really appreciate your feedback!
  
  Reply
Logan Reed says:

January 23, 2020 at 12:08 pm

Nice Content!
I am learning Django recently for my Project work thank you for highlighting that R is the preferable language to go for when it comes to Data Science. Can you also please make a blog post on the advantages and disadvantages of r and Python, I would appreciate it.
Keep on Updating us with such great content, Cheers!.

Reply
1. Learning Machines says:
  
  January 23, 2020 at 12:51 pm
  
  Thank you for your great feedback!
  
  I am always open to suggestions. Advantages and disadvantages with respect to what (because I already covered data science)?
  
  Reply
Mathew B. Bowers says:

January 28, 2020 at 11:58 am

Wow its a great article. But I have a question What does it take to become a data scientist? because I want to become a data scientist.

Reply
1. Learning Machines says:
  
  January 28, 2020 at 12:06 pm
  
  Thank you… well, I write most of my posts in such a way that they can be used to learn about data science. Good starting points are the following categories: https://blog.ephorie.de/category/learning-r and https://blog.ephorie.de/category/machine-learning. I also use many of the posts as teaching material for my data science classes. If you or your company need more customized training or coaching please let me know…
  
  Reply
2. Rob Reynolds says:
  
  February 3, 2020 at 6:19 am
  
  I have led decision science teams for big companies and I get this question a lot. I answer it this way. Data science is the combination of domain expertise (like an industry focus), programming, and mathematics. While data science is an awesome field and can be very fulfilling, it can be as frustrating and disappointing as any other career choice, if it isn’t right for you. Many people seem to think about this as a way to get a good job quickly because demand is so high right now. This is not necessarily the case. I encourage people to ask themselves why they want to become a data scientist or any other career choice they may not know much about. Once you answer that for yourself, you can find an awesome career whether it is in data science, engineering, analytics, or something else all together. Here is a link to a “day in the life” post for data scientists who work in corporate settings, that may be helpful.
  
  https://therandomvariable.com/what-does-a-data-scientist-really-do/
  
  Reply
  1. Learning Machines says:
    
    February 3, 2020 at 8:03 am
    
    While this is certainly interesting, what is your take on the R vs. Python debate?
    
    Reply
Marek says:

February 15, 2020 at 7:20 pm

c’mon, writing in 2018 you should left 2.x vs 3.x dilemma behind… this is over and 99% packages are 3.x compatible… seems like you desperately need to proof any Python’s weak points which is difficult vs R ?

Reply
1. Learning Machines says:
  
  February 16, 2020 at 8:51 am
  
  You are only referring to the supposedly weakest argument… what about the others?
  
  Reply
Enrique says:

February 24, 2020 at 2:26 pm

How to developer must learn both. If you work for a company that has R implemented in its platform you will not come to change that.

Reply
1. Learning Machines says:
  
  February 25, 2020 at 8:45 am
  
  It might be the case that you have to learn both languages in a company setting but this doesn’t answer which language is better suited for data science.
  
  Reply
xsmb123.com says:

March 11, 2020 at 5:22 am

Python is really difficult for programmers!

Reply
1. Learning Machines says:
  
  March 11, 2020 at 6:15 am
  
  In what way? Could you be more specific?
  
  Reply
Osama says:

May 4, 2020 at 3:43 pm

I am a beginner and want to start learning some advance analytics. I am a business student and I see marketing trends, reach analysis, big data analysis and likewise things. I am intending to do PhD in business analytics as well. For a complete starter like me, who used SPSS and JASP, what would u suggest? After reading your article, comments and discussion, I thought r for myself. But I need to take endorsement. Inshort, I have to be a data analysis guy. R or Python? Please. And it’s 2020.

Reply
1. Learning Machines says:
  
  May 4, 2020 at 3:49 pm
  
  Well, I would still full-heartedly root for R, especially in the areas you mentioned. In any case, have a look at Learning R: The Ultimate Introduction (incl. Machine Learning!) which shows you how easy to learn and powerful R is!
  
  Reply
Adriano Azevedo says:

July 17, 2020 at 8:00 pm

My background: 40 years experience with programming, starting with Fortran, Algol, Pascal, and later on C, C++, MATLAB. Started with R about 20 years ago and do a lot of data analysis both with academic and consulting focus. Also competed on Kaggle with good results a couple of years ago. Did some programming with Python, but not much.

My simple answer to the question: it depends on what is your actual background and expertise. The best data science language and programming environment for you is the one you know the most and in depth.

For people more related to CS in general, I guess Python is the natural way to approach Data Science and this is, I believe, because Python became a very popular general purpose computer language used within CS during last decade or so. If you have a lot of non-data science Python expertise, lots of friends that know and use Python, data science with Python is most likely the way to go.

People with a background more related to Statistics and/or long term expertise on R will feel more comfortable staying with R, as they know how to solve data science problems with R. And this is not because tidyverse or RStudio. It is due to their knowledge on R and on how to solve problems with R and available packages. Things like RCpp, R Shiny, data.table, ff*, big_*, sparklyr, are, I believe, more relevant for nowadays problem solving than tidyverse or RStudio.

My guess is that comparing the solution given to a data science problem by two real experts, one on Python and the other on R, the final result from both approaches will be very similar in general terms. Existing differences, if any, will be due more to their level of expertise on both programming environments (and techniques used) than on the features of each environment.

I think the original question is more relevant for a person without any knowledge on Python and/or R who wants to start a career on Data Science. For those, both alternatives look comparable to me. Flip a coin? or, better, have some exposure to both worlds and see which is more attractive to you. Both are powerful but have a step learning curve if you want to reach a level of expertise enough for solving large scale problems. There are no easy shortcuts.

Reply
1. Learning Machines says:
  
  July 17, 2020 at 8:16 pm
  
  Thank you very much for your balanced view on the matter, Adriano.
  
  Reply
Pingback: 2 years old but still relevant today | Un poco logico y un poco loco
Normunds says:

September 18, 2020 at 11:18 am

Hello,

Read you post and comments…it is also dilemma for me – R or Python. I have started both, I like R syntax more than Python (maybe it is because I didn’t have programming experience before) but seems that Python is evolving little bit faster that R.
But the question is, what would you suggest for sport data analytics? There are few key areas, which has to be covered: web scraping, sport models, machine learning, forecasting, visualization and dashboards.
Maybe there are another important areas, which I don’t know yet…

Reply
1. Learning Machines says:
  
  September 18, 2020 at 6:51 pm
  
  Thank you for your comment! Concerning sports analytics: I am not an expert here but the things I have seen so far were mostly written in R… but again, I may be biased.
  
  Reply
2. Learning Machines says:
  
  July 1, 2021 at 10:12 pm
  
  You might be interested in my latest post: Euro 2020: Will Switzerland kick out Spain too?
  
  Reply
Jose Luis says:

June 8, 2021 at 3:30 pm

Are the reasons still valid after 3 years since the post was released? Hasn’t been there any improvement in Python that makes it better now?

A post update would be very much appreciated!

Thanks in advance.

Reply
1. Learning Machines says:
  
  July 1, 2021 at 10:11 pm
  
  I constantly update the post so all points should be valid. If there is anything that is not up-to-date please let me know!
  
  Reply
Juan Diego Castrillón Rosales says:

July 10, 2021 at 7:15 pm

I’m an economist and I’ve been working with Data Visualization tools like Tableau and Power BI to support my choices regarding project management. I’m thinking about getting a more deep understanding of data analysis (perhaps become a data analyst) and I’m considering taking a course in this. I have 2 choices: The course from Google that uses R and the other from IBM that teaches Python. To a person that has no programming background and want to enter in the data analysis world, what would you recommend?

Reply
1. Learning Machines says:
  
  July 10, 2021 at 7:28 pm
  
  Dear Juan, Thank you for your question.
  
  I would clearly recommend R! Especially when you have no programming background Python might seem tempting because the learning curve seems flatter at the beginning. But the big problem is that when you start doing data analytics with Python you will have to learn a second language on top of the first one because of the necessary packages (NumPy, pandas, etc.)
  
  When you learn R everything is consistent because it was originally built for doing data analytics. So you should go for the Google course! Could you perhaps post the link to that course?
  
  Reply
2. B Zafar says:
  
  July 25, 2021 at 10:53 am
  
  Hello Juan! I would like to recommend the HarvardX course R-Basics for Data Science, a first of a series of courses from Harvard via EDx taught by the amazing Professor Rafael Irrizary. It was really very well taught as an introductory course and I thoroughly enjoyed it. I also took the Python course from IBM and I have to say I was very disappointed so I’d steer clear of it. If you are a beginner I would also like to recommend Hands on Programming with R (O’Reilly) as a first text, followed by Learning R (O’Reilly), and then R for Data Science – Import, Tidy, Transform, Visualize, and Model Data (O’Reilly). These are the best texts in my opinion after going through dozens of books for breadth, depth, and structure.
  
  Best of luck!
  
  Reply
B Zafar says:

July 25, 2021 at 10:48 am

I wholeheartedly agree with you that R is better suited for data science than Python exactly for the reasons you have described. I spent the past three months learning R, Python, as well as visualization platforms like Tableau. I worked on all three religiously with no previous background in any whatsoever. In fact, when I started, I was a little wary of R, and was very excited about Python based on how I kept seeing that “it is really easy to learn”/”it is really fun to learn and work with Python” type comments all the time. At the end of these three months, I absolutely LOVE working with R. The syntax is intuitive, the way it works with its parts is pretty straight forward. It took me almost three weeks to just get started with Python because of the many, MANY options available with their various strengths and limitations, which then creates variations in how to get started with any of them. While Python continues to try to be, as you said, “everyones darling”, I am going to continue to work towards mastering R and not Python. PS: Awesome blog!!!

Reply
1. Learning Machines says:
  
  July 25, 2021 at 11:30 am
  
  Dear B: Wow!! That is a hell of a story. Thank you for sharing your learning journey with us and keep up the spirit!
  
  Reply
Pingback: Learning Path for “Data Science with R” – Part I – Learning Machines
Andrea Dalseno says:

August 20, 2021 at 2:35 pm

Hi, I’m a python user (I can’t call myself an expert, let’s say I’m confident) and a newbie (maybe something more than a newbie) in R.
I won’t say if one is better than the other (I am not qualified for that). The reality is that python is attracting way more new users than R and it may make the difference in the long run.
My humble opinion on why so many new users adopt python instead of R is that python is incredibly easy and fast to learn and, mostly, it is consistent. Yes, the base language (a general purpose language) lacks some features like vectors and matrices but for that there is numpy that is the de facto standard for this task. For tabular data there is pandas (built on top of numpy -consistency, do you remember? – with a lot of additional packages that extend its core functionalities; think about geospatial data for example). Many new packages developed to overcome the limits of pandas (big data, speed and so on) use the same APIs as pandas so to switch to the new library means (mostly) simply change the import line! Scikit-learn is the standard for machine learning. All this means that after a short training you can open a Kaggle notebook and understand what’s going on; and it works, at least for the vast majority of use cases.
This is not true for (base) R! In my humble opinion, (base) R is not as fast and easy to learn as python and, most important, it has many functions that do (almost) the same thing with different syntax and (often) different returns. A nightmare for a new learner (even though I realise that for an experienced developer it may be an advantage; but to become experienced you must start! If you give up at the beginning because of frustration -and switch to python – you will never become an expert).This is where the tidy universe (tidyverse and tidymodels; this comments replies also to your post about dplyr) came into play. “Tidy up” the mess in R world with a simple, straightforward and consistent interface to do things. You can build upon it to extend and improve its functionalities but it aims to become the de facto standard to operate (at least for the vast majority of use-cases). Think about ggplot2 (something that as python user I highly envy to R; data visualisation in python is messy with several different libraries and different syntax): ggplot2 is the base upon which build up adding new functions to extend its core capabilities. Easy to learn and to use (and consistent). Will the tidy universe succeed? I don’t know, what I think (my humble opinion) is that either they will succeed or R will lose its battle.

Reply
1. Learning Machines says:
  
  August 20, 2021 at 3:24 pm
  
  Thank you for your extensive comment, Andrea!
  
  It is interesting to see how different users experience learning a language. For me, it was the exact opposite! I am coming from assembler, Basic, a little bit of Mathematica and Pascal, and later C++. Then I started with Matlab/Octave, Python, and R, at about the same time. I didn’t like the overly commercial approach of Matlab and Octave is just not the same, so I quit. I agree that (base) Python is easy to learn, almost like pseudocode, but I find the vector- and matrix-oriented packages totally inconsistent with it. Also, the sometimes strangely different behaviour of functions and methods irritates me.
  
  Concerning graphics, I agree that R is still THE standard. This is also the reason why all of the big media companies still use it for sophisticated infographics. Although I personally don’t use ggplot2 I can understand its appeal.
  
  Oh yeah, and the tidyverse… it might be that I am an “expert” in your sense but what I don’t like and don’t understand is why you would need quite another language on top of R! I am a big fan of functional packages (I developed one myself, OneR) but I don’t like packages which totally change the character of a language. It seems like a hostile takeover to me, some kind of land grab of a commercial company! It might be that had I come later to R I would totally be in love with the tidyverse because I wouldn’t know the elegance and strength of base R. But I do, so I am not a big fan.
  
  I wouldn’t be so pessimistic about the future of R. Much of the hype of Python is, in my opinion, based on neural networks which are themselves totally overhyped. As an AI “veteran” I can tell you that all those AI technologies come and go in waves (this is not the first time that neural networks are the latest craze). When there is a new kid on the block (whatever it might be) it will be a new game. It might very well be that R will become the latest craze again (btw something like that happened during the peak of the Covid pandemic because statistical modelling was important again!).
  
  I am optimistic that R is here to stay and will become more important in the future. Concerning the tidyverse: I don’t know, we will see…
  
  Reply
  1. Andrea Dalseno says:
    
    August 20, 2021 at 4:51 pm
    
    Thank you very much for you reply. That’s the point: you are an R “expert” and you don’t understand why you should change your mind to adapt to the tidyverse model. But if you were new you would find the tidyverse approach natural and consistent, easy and quick to learn. And it will work in most situations. With the time you can become an expert and find out more efficient ways to solve the problems or how to fix them when tidyverse approach does not work.
    Regarding the “inconsistencies” in python I would not call them inconsistencies: different libraries may have different settings for default parameters. The fact that you can use both a functional approach (numpy.any( df ==1)) or method approach ( (df==1).any() ) does not mean the two functions are exactly the same (even though they will give the same result if you correctly set the default parameters); this may happen in R too, doesn’t it?
    
    Reply
    1. Learning Machines says:
      
      August 20, 2021 at 5:46 pm
      
      R is much more consistent here, objects do not have methods associated with them in typical R parlance. The class of an object determines what function-methods will be applied to it (= polymorphism).
      
      Reply
Roderich says:

December 7, 2021 at 9:57 pm

I am just learning a little of R. It was very nice to read your ultimate introduction to R. Thank you very much for it! I suspect the real reason to use R is the lot of statistical packages, also that it is a standard for statistics. Why not python? Because I prefer Tcl/Tk and do not like python! Python is so absurd that one must include the whole Tcl/Tk in it for having tkinter and widgets. It is the product of an inflation of scripting languages. I do not agree with you that it is a good language for learning programming, specially because the semantic of blanks. FORTRAN IV is perhaps the right language for learning programming. But I do see an advantage of using a general purpose scripting language: to easily write prototypes and glue many programs.

Reply
saanjh mehra says:

March 9, 2022 at 4:56 pm

I just began dabbling in data science with a paid course on a website called https://pickl.ai/ , https://www.edvanza.com/ and reading your take has been a bittersweet experience (they’re teaching using Python).
Nonetheless, it was extremely good and eye opening! I intend to learn R, while working with Python. Your website has a good vibe, I’ll go on and read from the source in updates.

Reply
p says:

May 20, 2022 at 3:02 pm

After 16 years I can tell that:
Python syntax is better then vanilla R syntax (mayby because Python is general purpose language –
dictionary, enumerate, list comprehension)
dplyr alone is better then pandas
but tidyverse is mistake that weakened R community
(btw. pandas have pipe(), but no one uses it because Python code looks great, so you don’t excape from writing it)
it’s easier to build model in R, then Python
statistical testing is easier in R and cross-validation based approaches in Python
in R you have functions for everything in Python you create your own more often
in Python you write less code, but you can’t create readable long scripts and manage project structure properly
install.packages is great
people do not manage versions of R packages and R itself, so after 5 years nothing works – no good tools for that in R
so Python pip/conda data packages management systems are big plus (you also install Python itself with them)
but there are constant dependency problems in Python, which R doesn’t have
R is good for academia when you do not deploy or replicate you results (I know academia too well)
in Python you can do a lot of pure programming stuff with flask, sqlalchemy, dash, airflow
in R only shiny (which I hate)
R deployment is not mature – compare python-setup and r-setup github actions building time for example and quality of R repositories at GitHub is also better (I believe you don’t use GH)
scikit-learn does not provide p-values even for linear model (schame) – this tells a lot
about difference between two communities…
R’ people discuss R vs. Python and Python’ don’t – and they often never seen it at all
R has all possible packages for statistical analysis and Python for anything else then SA 😉
in R people still create libraries in FORTRAN
GO is better then Python
R as language is more flexible (NSA etc.) but goes nowhere, because its functional language
R in numerically more precise then base Python
but in pure R you can’t (unless you are …) implement ML algos like random forest
in Python you can
there are too many package choices you have to make in R
in Python you make 90% of standard data analysis in pandas, numpy, scikit, matplotlib/plotly, seaborn and statsmodels and in R you don’t (for example broom vs lapply for lm in subgroups)
no OOP in R, R6 is a mistake in comparison to Python’ OOP system
in Python you can’t overload object constructor …
Python community is more open – and produce more open source packages, R community write
more of them
more companies support Python (are porting their solutions/packages in Python)
Python language is changing, R is not
Python have PEP8,
R help system and documentation is better, but in Python you can read readable internals
Python as a language have different implementations so there is hope for change
Python is part of Ubuntu distribution
ggplot2 and data.table are example that R is more flexible language, but they create their own sublanguages, so it’s not very good for R itself

Reply