Why I don’t use the Tidyverse


There seems to be some revolution going on in the R sphere… people seem to be jumping at what is commonly known as the tidyverse, a collection of packages developed and maintained by the Chief Scientist of RStudio, Hadley Wickham.

In this post, I explain what the tidyverse is and why I resist using it, so read on!

Ok, so this post is going to be controversial, I am fully aware of that. The easiest way to deal with it if you are a fan of the tidyverse is to put it into the category “this guy is a dinosaur and hasn’t yet got the point of it all”… Fine, this might very well be the case and I cannot guarantee that I will change my mind in the future, so bear with me as I share some of my musings on the topic as I feel about it today… and do not hesitate to comment below!

According to his own website, the tidyverse is an opinionated collection of R packages designed for data science [highlighting my own]. “Opinionated”… when you google that word it says:

characterized by conceited assertiveness and dogmatism.
“an arrogant and opinionated man”

If you ask me it is no coincidence that this is the first statement on the webpage!

Before continuing, I want to make clear that I believe that Hadley Wickham does what he does out of a strong commitment to the R community and that his motivations are well-meaning. He obviously is also a person who is almost eerily productive (and to add the obvious: RStudio is a fantastic integrated development environment (IDE) which is looking for its equal in the Python world!). Having said that I think the tidyverse is creating some conflict within the community which at the end could have detrimental ramifications:

The tidyverse is creating some meta-layer on top of Base R, which changes the character of the language considerably. Just take the highly praised pipe operator %>%:

# Base R
temp <- mean(c(123, 987, 756))
temp
## [1] 622

# tidyverse
library(dplyr)
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
temp <- c(123, 987, 756) %>% mean
temp
## [1] 622

The problem I have with this is that the direction of the data flow is totally inconsistent: it starts in the middle with the numeric vector, goes to the right into the mean function (by the pipe operator %>%) and after that to the left into the variable (by the assignment operator <-). It is not only longer but also less clear in my opinion.

I know fans of the tidyverse will hasten to add that it can make code clearer when you have many nested functions but I would counter that there are also other ways to make your code clearer in this regard, e.g. by separating the functions into different lines of code, each with an assignment operator… which used to be the standard way!

But I guess my main point is that R is becoming a different beast this way: we all know that R – as any programming language – has its quirks and idiosyncrasies. The same holds true for the tidyverse (remember: any!). My philosophy has always been to keep any programming language as pure as possible, which doesn’t mean that you have to program everything from scratch… it means that you should only e.g. add packages for functional requirements and only very cautiously for structural ones.

This is, by the way, one of my criticisms on Python: you have the basic language but in order to do serious data science need all kinds of additional packages, which change the structure of the language (to read more on that see here: Why R for Data Science – and not Python?)!

At the end you will in most cases have some kind of strange mixture of the differnt data and programming approaches which makes the whole thing even more messy. As a professor, I also see the difficulties in teaching that stuff without totally confusing my students. This is often the problem with Python + NumPy + SciPy + PANDAS + SciKit-Learn + Matplotlib and I see the same kind of problems with R + ggplot2 + dplyr + tidyr + readr + purrr + tibble + stringr + forcats!

On top of that is the ever-growing complexity a problem because of all the dependencies. I am always skeptical of code where dozens of packages have to be installed and loaded first. Even in the simple code above just by loading the dplyr package (which is only one out of the eight tidyverse packages), several base R functions are being overwritten: filter, lag, intersect, setdiff, setequal and union.

In a way, the tidyverse feels (at least to me) like some kind of land grab, some kind of takeover. It is almost like a religion… and that I do not like! This is different with other popular packages, like Rcpp: with Rcpp you do the same stuff but faster… with the tidyverse you do the same stuff but only differently (I know, in some cases it is faster as well but that is often not the reason for using it… contrary to the excellent data.table package)!

One final thought: Hadley Wickham was asked the following question in 2016 (source: Quora):

Do you expect the tidyverse to be the part of core R packages someday?

His answer is telling:

It’s extremely unlikely because the core packages are extremely conservative so that base R code is stable, and backward compatible. I prefer to have a more utopian approach where I can be quite aggressive about making backward-incompatible changes while trying to figure out a better API.

Wow, that is something! To be honest with you: when it comes to software I like conservative! I like stable! I like backward compatible, especially in a production environment!

Everybody can (and should) do their own experiments “to figure out a better [whatever]” but please never touch a running system (if it ain’t broke don’t fix it!), especially not when millions of critical business and science applications depend on it!

Ok, so this was my little rant… now it is your turn to shoot!

110 thoughts on “Why I don’t use the Tidyverse”

    1. tidyverse is built entirely on base R. Fanboys are kinda blind to the fact that much of the internal code of tidyverse is also base R. So base R is the C of data analysis: not that fancy, but unmissable. Unlike IE…

      1. WHOOOSH!

        One should only use Internet Explorer to download a better browser.
        One should use base R to run tidyverse on top of it.

        1. I can wget any browser. Try using tidyverse without the base code in the packages. And give me a call when you do, I love a good bit of comedy.

    2. Tidyverse is the Barbie doll of data analysis, pretty (and popular with the kids!) but you can only have limited conversations with it.

  1. Cards on the table – I am a tidyverse user. That said, I can understand why someone would not be interested in learning the tidyverse – particularly if they have an existing solid foundational understanding of base R.

    I find the tidyverse useful for a number of reasons:
    It makes code readable. The names of functions and arguments, and indeed the adoption of the pipe operator (which I am encouraged to read as ‘and then…’), along with choosing reasonable names for my own variables/datasets/functions, make it easier for future-me or a colleague to pick up my code and make sense of it quickly.
    It makes R easier to teach. While I think R users should gradually get to grips with the base R world, as it is filled with useful functions not existing in the tidyverse, every time I have tried to teach it I am reminded that base R is not intuitive to new users. The intentionally consistent design choices of the tidyverse impose some order on the R world, and the available resources (R for data science & the cheatsheets) are a great asset to new users (and indeed, myself).

    Personally, I think the tidyverse comes from trying to solve a specific problem: Analysts want to use packages, as it saves writing your own functions, packages are easy to create in R – but different teams working in neighbouring areas do not necessarily build packages that are easy to integrate. This forces the analyst to routinely build work arounds between sections of their analysis. The tidyverse builds a core collection of packages to do a significant proportion of most analysis without the need for work arounds. It also builds a common touch point for other packages to build upon, to fill in the other areas (e.g. gganimate).

    Your point regarding instability is certainly valid. There are various methods to help alleviate the instability (package managers, docker containers, good coding practice etc.), but these themselves add complexity. Ultimately of course, this is a balancing act – do you want stability, or do you want innovation? This is not a question with a yes or no answer, the solution for each project will fall in a different place on the spectrum.

    The programming language and packages are just a single source of instability. That is to say, yes I have to deal with language changes – but I also have to modify my scripts to deal with other external changes anyway, such as regularly changing user requirements. As ever, I am managing are trade offs.
    To be clear, these scripts are run not in a production environment, but kicked off on a weekly/monthly basis – so I am not working in the scenario you warned about, which would tip the trade-offs towards a more stable environment, but I imagine my situation is fairly common.

    I don’t expect to dissuade you from sticking base R, as I doubt you have any illusions (or intentions) of persuading me to renounce my use of the tidyverse. But I thought you might appreciate that for persons working what I would guess is quite a different environment to your own, the tidyverse has proven to be very useful. In the world of analysis, there isn’t just room for differing opinions – it seems to be what the analytic world is made of.

    Thank you for your write up, it was a pleasure to read.

    1. Thank you very much, David, for your very well thought-out comment. I was hoping to attract arguments like this and to your point: I would not even rule out to get persuaded some time in the future to give it a serious try in a bigger project…

      1. Thanks for the read. I think you arguments are valid. My main pro argument for tidyverse is coding speed. The consistency just allows me to sketch thoughts out really fast for DS projects. That also goes for tibbles, mentioned in another comment: As long as one keeps things ‘tidy’, they make life much easier. Plus, the grammar nerd in me finds the whole project very pleasing, and that’s why I don’t mind the occasional radical changes.

        If I write packages, I try to keep it base. Obviously, I don’t want my packages to break over dependencies either.

        1. “tibbles” – if one bit of base R deserves a well-deserved retirement it is data frames’ “stringsasfactors” option. The number of hours that I have spent re-re-discovering this frustrating “feature”. And I know, I should set it up I-forget-where so that it never bothers me again, but I’m just an aging, self-taught hacker.

          1. You’ll be glad to know that the default as of 4.0.0 is stringsasfactors=FALSE
            I too was frustrated by it, especially since the first tutorials I read some years back had no mention of it. And then I couldn’t understand why my early attempts failed.

    1. Which was exactly my reaction. Yes, that makes it harder to “skim” for the result of each line because we sort of expect to see the “output” on the left margin, but all pipe operations, whether in R or in bash, have the final result on the far right.

  2. Pipes flowing left to right can simply be replaced with %% or right-assignment ->. What a silly objection.

    As to the other critiques, they aren’t miles off. R is developing more and more as an abstraction layer on top of other technologies (since it’s slow, has poor support for deep learning, etc etc), which is wonderful because it preserves the other quirks about R that make you and I love it.

    I fully expected a “I use data.table”-style post, and you’d have quite a few more arrows in your quiver…

        1. I have a different criteria for utopian. Backwards compatibility so that my code doesn’t stop working is my utopia. Looks like tidyverse is in alpha. Once he figures out what the interface should be then there would be a stable platform. I though python suffered because Guido kept changing the spec and developers were really beta testers. There is reason why people study formal language development. Personally I spend little time in coding. I spend more time thinking about the best way to plot data. Lattice graphics was a big help for me.

  3. Interesting that you end your post with admitting it’s a rant – since that ‘s what it boils down to. What is your objective here? Having read through the post I doubt you ‘ll convince any tidyverse user.

    Tidyverse is not a religion, but it is a paradigm. And proponents of a different paradigm will not be convinced by any of this … Language “purity”, limiting dependencies, a community fracture, reads more clearly/ consistently vs. piping? These mostly just opinions and attitudes, apart from the purity bit. And what does this purity add in terms of value?

    I could point out that base R code in my experience is:
    * Hieroglyphs by comparison – takes a lot more team to read for stakeholders unfamiliar with it. Especially since a lot of technical stakeholders will know SQL, so they’ll recognize the verbiage and concepts of what you’re doing in tidyverse
    * Generally slower
    * Clunky if you have to write the equivalent of scoped verbs in tidyverse
    * Less elegant without things like map

    At least in the data.table vs. tidyverse debate, one can argue that there are obvious and irrefutable advantages to using data.table, which is 1/ incredible speed 2/ functionality like theta joins. Even then, one can mix and match.

    So … glad you have it out of your system, but I ‘m inclined to believe that’s all this post accomplishes really. 🙂 Good on you for preferring base R, let’s all just use what works for us until any REAL structural issue comes up.

      1. Every script I open after at least one week from when I have written it shows to me an example in which time spent to restart working on it is less if it was in tidyverse than in pure base R. Moreover, what can be measure in minutes or seconds to me, will be in hours or days if that script is open by a new R student. The huge amount of hours me and my collaborators are saving every months counting the time from the kick off to the report are countless!! The huge amount of time I have saved from start teaching Tidyverse to a newR respect starting from base R is beaten only by the joy of see them to be quite autonomous and enthusiast after a single day of training from scratch in starting to conduct simple but complete data analyses from import to report. Maybe the “expert you” may prefer base R, and sometimes or often even I prefer base R when I know I prefer it for that very specific lines of code… but on every day, every starting project, every beginning of script or package, by default I start with tidyverse. And that save me tons of hours. And save tons of hours even to other who will read my programs, script, analyses.

        It is not a metter of taste to me (sandwich or sequential nesting notation). It is not a metter of stability (which has to be proven if it is worst r3.5.3 -> r3.6.0 (with serialize standard changed) or dplyr0.7.6 -> dplyr0.8.0…). It is not a metter of old Vs new. It is not a metter of computational speed… it is a matter of respect other people time (which means optimize team productivity over single programmers productivities!). It is a metter to encourage other people to understand easily you code to allow them to easily replicate or expand or correct it. It could be a metter to prefer more effort in expert programmer to develop tools that require less effort for user to use them or other programmer to understand them.

        I can provide my R solutions for an homework to a student that is not able to accomplish it and I am absolutely sure that, thanks to the tidyverse, they understands what I did (this does not mean they are able to replicate, but they can simply “read” my code!). I can provide the script for some clinical analyses to a medical doctor and with the only effort to convince they to “read” it they can understand even my preprocess intent! I can sit on the keyboard with people on a side telling me their needs for the analyses and I can implement them in a way they understand I did what they asked, and I can do it in real time!

        This is on daily base. Tidyverse permit me to save tons of other people time, which means my time as well! Then, when I know, as an expert, that, for a specific task, base R is better… I can decide to change paradigm trading other and my time for computer time. There are occasion in which it has sense, more than sometimes… but I am sure that that is the added value from an expert: understand when this worth, and how to prevent this to create wasting time in the future (their or other people time!)

        Thanks to tidyverse not only nerds can read and understand code, things that in turn means that more people can more easily learn and contribute to science…with their own trial and errors like mine…but easily! Moreover tidyverse fail fast and loud, and e newR’s tidyverse code that fail, often, produce less damage than an expeRt’baseR code that fail…which in more than one occasion fail not only silently but providing a silent wrong but compatible results…I hope that all the radical baseRs adopt at least TDD to prevent “consistent and stable errors” in their code 😉

        I love R, I love base R, and I thanks HW and all the Tidyverse developers and community for that: tidyverse was the reason and the tools through which I have deeply understand base R, especially in knowing exactly when why and how to use it in my R code!

        When I develop a package I use base R…and devtools, and roxygen2, and testthat… Don’t you? Don’t they implement a different paradigm in the way you can create a package respect to base R?! And what are their aims?! Make packaging easier, funnier, less prone to distraction’s errors, more possibility for other to “read” and contribute in your source code… don’t you use base R to create a package? I do, but not alone! The same happen for tidyverse and data analyses, force yourself to use only and ever the tidyverse means that possibly you are not yet that much expert…force yourself to don’t use it means that possibly …you are not yet that much expert as well: it is a wonderful tools to save (your and especially other people) time. It is to save other people time that we have invented and we develop softwares, doesn’t it?! 🙂

  4. FWIW: “opinionated” has a very specific meaning in software context, that is different from its meaning in plain English. (https://stackoverflow.com/questions/802050/what-is-opinionated-software)

    The tidyverse is “opinionated” in that it is fairly rigid about input types and output types, and most functions are designed to accomplish one specific task. This makes it easier for users to avoid some of the sneakier bugs that are hard to notice (for example: different output types of the apply functions). It also tends to make analyses more reproducible.

    Opinionanted-ness definitely does make it harder for advanced developers to branch out and build crazy new things, which is why most devs (including the tidyverse team) tend to build out new tools on primarily base functionality.

    Basically, the tidyverse is built for users, not developers. (With the exception of a few dev-specific packages.) You seem to be a user for whom the tidy style doesn’t jive, which of course is fine. 🙂

  5. I like pipe and dplyr syntax. Base R merge() and reshape() were really cumbersome until the “reshape” package came around, and later dplyr. Pipe is nice too, although I feel it makes me a one-dimensional programmer – from left to right, so to speak.

    > Basically, the tidyverse is built for users, not developers. (With the exception of a few dev-specific packages.) You seem to be a user for whom the tidy style doesn’t jive, which of course is fine.

    That’s an important point above. I was drawn to R almost 20 years ago because of its smooth (if perhaps steep) ramp from being user to being a developer. Data analysis software popular at the time, like SPSS or Stata, segregate users from developers to the extreme — It did not allow you to do what you want, just what the producer implemented. This was different for R. You can program what you want or need and shape the output as you like. If you first write a script, you can later parameterize it and, boom!, you have a function others can use.

    I think `tidyverse` segregates users from developers, just as it explicitly differentiates code for interactive use vs code for package use (e.g. pipe is discouraged in the latter, I’ll add a reference once I recall it). I see that in people who come to my workshops after having an R intro somewhere else. They have basic knowledge of, say, dplyr, but don’t know, say, square brackets. They are not capable of using a CRAN package if the package is not “tidy”. Until somebody “smart” writes a thin tidy wrapper and advertises it as a new thing. BTW, the segregation not figuratively, but literally: http://www.pieceofk.fr/exploring-the-cran-social-network/ see section “Focus on the core network”. IMHO this is hurting the R ecosystem and community as a whole.

  6. I always die a little when I see someone come to R and start to code like it’s C.

    When using modern languages it makes sense to use their actual features and don’t try to write C. For R and Python this means using packages. That’s where there power lies. Otherwise you are just reinventing the wheel.

    There are different ways to write R, and this is one challenge of R compared with Python. Tidyverse, data.table, and others are valid approaches. But writing R like C makes no sense to me.

    1. “There are different ways to write R”. As someone who forgets all the time, this aspect has haunted me. Python’s mantra of “only one way to do it” often seems like a seductive mirage. In R, in contrast, there are usually six ways to do it … and I’ve forgotten all of them.

  7. I’m also skeptical of anything “tidy,” despite being a huge proponent of the general idea of “tidy data” and happy user of pipes. I refuse to give up my %>% connecting 10 lines of string cleaning.

    The part I dislike the most is “tidy evaluation.”

    – First off, “tidy” here has no meaning beyond the brand. And anything with a needless brand name raises red flags.

    – Second, non-standard evaluation should not be the go-to method of writing code. It requires all kinds of code gymnastics to get it to work like regular code, and it carries some pretty hefty technical details most R users don’t understand. And that’s not saying they can’t understand it, but that they never had to because it’s so in-the-weeds.

    – Finally, it encourages a bad practice of writing packages with functions that take whole datasets plus arbitrary names or expressions as arguments. These work fine in `dplyr` pipelines when the data fits the package author’s assumptions, but that’s needlessly narrow and fragile. Most of the functions can be rewritten to use vectors so others don’t have to create a chimeric data.frame just to use them. If a certain type of data.frame is expected for most of your package, make a new class. Code, not documentation, should tell me if my data follows the rules.

    The tidyverse is getting to be like Excel: amazing at many tasks, but overused because it’s what’s familiar, and sometimes using it is way more complicated than the alternatives its followers decry as complicated.

    1. As Q suggested, would you be willing to expand on this and create a guest post here? Perhaps by giving some more examples and best practices? That would be very helpful for many readers, think about it…

  8. Without the tidyverse R would continue to be a niche language and platform, ultimately losing to Python. IMHO, tidyverse dramatically improves productivity, code readability, and the overall language. Honestly, for me it’s synonymous with R and I literally wouldn’t use R without it. It simply makes the R experience vastly better.

      1. That’s an imaginary problem. Take some time browsing R tutorials and looking at online data science courses. Tidyverse is presented to new R users as a de facto standard. There will be only one faction in the future.

        1. True. I’m new to the R community and most of what I’ve learned is Tidyverse. Now I’m curious about this argument here and looking forward to learning more about the differences/similarities between Base R and Tidyverse.

  9. Fantastic article. My problem with the Tidyverse is that Tidyverse code works great, as long as it’s supported in the Tidyverse. For a recent project, I did a ton of work using R for GIS applications. Packages like sp, rgdal and the like are amazing for my needs… and completely incompatible with Tidyverse.

    1. I had this painful experience a couple of years ago. So try the recent developments on simple features (sf package etc) for GIS and Tidyverse. Geocomputation for R is a useful transition document for moving onto ‘sf’.

  10. I think you expressed an important viewpoint. I will admit, I am (slowly) coming around to the tidyverse approach mostly because I am finding it is much easier to teach. From my experience, students grasp subsetting and recoding (using filter and mutate) much faster in the tidy verse than in base R (using [] and $ notation).

    My one hangup are tibbles for a reason you mention above, it overwrites base R functions and changes things in unpredictable ways. I have a package written before tidyverse was still Hadleyverse. In base R, mydataframe[,1] will return a vector. If you want a data.frame, then you need to use mydataframe[,1,drop=FALSE]. (Sidenote: I think this is a terrible default up there with stringsAsFactors=TRUE, but that is a different rant). My function expects to get a vector. And I am a defensive programmer who checks the parameters, in this case I verify that is.data.frame(mydataframe) == TRUE. If I pass in a tibble, is.data.frame(mydataframe) returns TRUE but mydataframe[,1] returns a data.frame (well, technically a tibble)!!! IT IS NOT A DATA FRAME if it does not behave like a data frame. This is the most egregious example IMHO, but happens in other places too.

    One of the reasons I let this go is that I was there when ggplot2 came out. It is what I use for 98% of my figures and tell students to learn that approach first, it is much better and easier than base graphics IMHO. The tidyverse argument seems to be the same. Like you, I may just be too old now 😉

      1. I have a pro and cons slide for using R when I give talks on R or Intro to R workshops.

        One of my pros is: “There are multiple ways to do things.”
        One of my cons is: “There are multiple ways to do things.”

      2. If I was going to be more critical, I’d point out that any code which was using positional reference into a data frame isn’t very clean code. What if position changes because of something elsewhere in your code? And if you’re extracting using the name, then why not use [["columnname"]] and guarantee a vector.

        Blaming tibbles seems a bit of a stretch if you ask me. And where’s your defence of the language purity around stringsAsFactors?

        If you’re going to be using tidyverse, then purrr::pluck will help you get the vector you want and nicely return NULL if not. Or if you want it to fail, then there’s purrr::chuck. Nobody saying you have to use them but the nice thing is that people are thinking about the possible issues .

  11. It’s easy to imagine discussions involving other languages that would by now have devolved into an inferno. You set a good tone by taking a supporting a principled position.

    From the perspective of developing and maintaining packages, I agree with you that {base} provides huge advantages in stability compared to an alliance of packages advertising themselves as `tidy`, but not all of which are under coordinated authorship.

    I also agree with other commentators that for too many learners, `tidy` has become a universal hammer, limiting them to problems in the class of nails.

    Yet, “programs must be executed for humans to read, only incidentally for machines to execute.” There is a large class of non-programmer users of `R`, for whom “logic legibility” is essential. Earlier this year I worked with a graduate researcher who had been given an `R` program consisting largely of for loops. There was nothing wrong with the code. It did what it was supposed to do, but the user, who was preparing their first paper for publication refused to take that on faith.

    I refactored the code in tidyese and we were able to trace through the data wrangling, detect and cleanse some issues with the records, and arrive to the point of being able to run the models that were the objective. The researcher, by the end of the process, owned the program behind the analysis.

    I think that was a beneficent result that makes a foundational case for `tidy`.

    Of course, there is a long way to go for all of `R` to preserve reproducibility. The same shifting standards that trapped vast warehouses of EBSDIC in proprietary systems trap analyses in point releases.

  12. With tidyverse and pipes, can you examine intermediate output ?

    Being able to view intermediate output is one reason why I like a series of function calls to an R object.

  13. I totally agree with you.
    It feels like the grammar of graphics was applied to every thing else.

    it’s ok for graphics but I’m not convinced for the rest.

    I think dplyr finally encourages long pipe chains that are like tunnels with no escapes.
    The group() and later ungroup() function that you see sometimes in long chains is emblematic of a convoluted construction to make it fit in the grammar.

    To make the code maintainable I think ten lines functions and multiple dispatch is the best way.

    and data table do feels better despite it’s sometimes cryptic syntax

  14. temp %>% group_by() %>% summarise() %>% ggplot2()

    It may look sore to the eyes but I do not have to create a variable for each step. Eventually, you will run out of meaningful names and have to do like:

    my.data1
    my.data.2
    my.data.3
    

    If you consider you are not going to use my.data1-2 for any other purpose except to hold the temp variables, you are thankful you have the pipe package.

    I must admit, your argument is logical, only very inconvenient.

    I just hate to populate my space with variables which only have a transient purpose.

      1. Yes, that would be it, except that, `temp` has gone from one object type to the other. The problem may be that, after years of using C#, that is unacceptable to my eyes; that temp will switch types.

  15. Hi, I came across your blog by accident. I work in a large asset manager (>500Bln USD) and we employ a lot of quants, data science guys. Here is my experience:

    We have guys come in who use Python, R, C++, we don’t care (we had one with Lua!). During the interviews we just want a solution to a problem, discuss an idea, etc. It doesn’t matter if you end up with a solution, or how you get there. We want to see the way you approach the problem.

    Those who come and use C++ will say ‘my code will take some time to be written, be patient, but it will be lightning fast’. We say ‘crack-on’
    Those with Python say ‘my code is structured, object oriented, run as part of a system, easily maintained, etc’. We say ‘crack-on’.
    Those with R as a different beast. They fall in two categories. The ‘tidyverse-obsessed’, and the other ones. I call them, the ‘normal ones’.
    The ‘normal ones’, sit down, install a couple of packages, start writing the code, explain intermediate steps, do other analysis as they go along after each step, visualize stuff, etc. During he interview you know every step of the way what is happening. They also tend to show an interest in how to use our functions (data is coming as output from functions we provide) and are more comfortable to adjust what and how they write it to fit.
    The ‘tidyverse-obsessed’ are completely obsessed. Completely. All of them. Always. Before they start, they ask/announce ‘Do you know tidyverse?’. Then they start a rant on how this is a revolution to how R is used/written, etc. We don’t pay attention (we know what tidyverse is). Then as they write their code, it becomes a string of completely unreadable sequence of lines simply because they want to use this st(@)^&*d magrittr operator. This leads to 10 steps being done in 1 line (ok, with newlines it becomes 10, tidyverse-obsessed guys here) after which no one (not even the candidate!) can check what on earth happened. The data we give them are not university-like. They are real-life-like which are not as they would expect them. They are not dreamworld data with unicorns and pink clouds. They are real data. After the code is written and we test (we first ask them to write the code, run it, and then we spend our time discussing or fixing bugs. This makes it more interesting for both parties), of course everything breaks because they never checked anything. They want to show their tidyverse method rather than solving the problem. The next hour is then spent on separating the useless 10-lines-in-1-to-show-magrittr-because-it-is-said-to-be-awsome in 10 individual ones so that we can check each step, use matrix or data.table because there are millions of lines and you need to join data, doing actual analysis and not just write code, etc.

    Having interviewed hundreds of candidates I can say with great confidence that:
    the C++ guys are excellent if we were building a space shuttle (things of this complexity),
    the Python guys are top when it comes in building a complete system,
    the normal-R guys are excellent in data science, and
    the tidyverse-obsessed are a waste of time.

    I am sorry.

    * My conclusions are based solely on skill. Not personality.
    * Of course there are exceptions.
    * Whenever I say ‘guys’ I mean ‘guys and girls’. We are not a racist institution.

    1. Wow, your viewpoint is even more pronounced than mine! Thank you for sharing your in-depth industry experience (which would be rather funny if it weren’t so sad)… and, yes: this “obsession” you are talking about can also be seen in the comments here…

      1. It is not sad. It is funny.

        It seems to me that people create packages in this tidyverse thingy just to join this cult. 100 people on this planet chose to create a cult. 100K decided to follow. The cult says that “mean(vector)” is less powerful than “vector %>% mean”. Fine. Whatever.

        Data scientists, quants, analysts, learn anything from SQL, to R, C/C++, Python, Cuda, whatever is needed. They are more than willing to say they are wrong, adapt, adopt, change. Whatever is needed.

        The cult thinks that tidyverse can also boil an egg. Fine.

        1. Great comment. That’s more or less a syndrome I had in mind when I wrote about “one-dimensional” programming above. Sometimes tidy verses tell you to reach your right ear with your left arm…

  16. Hi,
    Not a Data Scientist but using R for years mainly for graphs and ML. I’m surprised from the strong positions either for or against Tidyverse. It is just a package as others. If useful for your task, use it, if not just don’t use it.
    From my experience, I spend 80% of my time just to arrange the data, Tiyverse and the concept of tibble, just one parameter/feature by column is easy to explain to my scientists and should reduce the time I spend just to arrange data.
    And when an updated version of ggplot2 including %>% instead of +… It migth me very interseting to test data%>%ggplot2%>%plotly to obtain easily nice interactive graphs…

  17. I think that in the paragraph right after the code example you meant to write

    .. it starts in the middle with the numeric vector, goes to the right into the mean function (by the pipe operator %>%) and after that to the **LEFT** (not right) into the variable (by the assignment operator <-). It is not only longer but also less clear in my opinion.

  18. From my perspective, the main beauty of tidyverse is “functional programming” paradigm that aims to use pure functions and their compositions as main building blocks, and to avoid interaction with object’s state unlike imperative paradigm.
    So this “pipelines without assignment on each row” problem is actually a feature that provides opportunity to create function composition on the fly and gives more control over where and when we actually want an assignment that will change dataframe’s state.
    Without this you need to interact with state on each single operation that can lead to bugs when code runs interactively and makes code more difficult to reason about.
    Functional paradigm can be confusing at first but it definitely has it’s beauty and advantages, and in my opinion, tidyverse finds good balance between functional and imperative styles.

      1. You’re right, of course, R is a functional language in sense that it has first-class functions and lambdas and therefore supports functional programming paradigm.

        However, I think that the paradigm itself is not only about first-class functions, but also about avoiding mutable state.
        Tidyverse is just pushing functional capabilities of R further and giving instruments to treat data.frames/tibbles as if they are immutable. For me, it brings to R much more functional experience of declarative high-level code with composable functions and (quasi-)immutable state.

          1. With pleasure!

            Please, consider these two chunks of code:

            
            # Imperative style
            mtcars_copy <- mtcars
            mtcars_copy$mpg_log <- log(mtcars_copy$mpg)
            mtcars_copy <- mtcars_copy[mtcars_copy$cyl == 6,]
            
            # Tidyverse functional style
            mtcars_copy2 <- 
              mtcars %>%
              mutate(mpg_log = log(mpg)) %>%
              filter(cyl == 6)
            
            

            Essentially, they do the same set of operations – add a column and subset the result.
            However, internal mechanics is slightly different. Imperative code changes the state of mtcars_copy 2 times (one assignment per operation), while tidyverse code composes two actions together into one, and only then does the assignment only once.

            It may seem like a minor difference, but it leads to several consequences:

            1. R has an annoying habit to continue running the code interpreter even if some statement threw an error.
            For example: if in both chunks you mistype “loh” instead of “log” in function call, the first one will throw an error, but continue to the subsetting and will complete it. In the end you will receive your object, but because of the error it will be internally incorrect. Now, you will need to create a new copy, and if you for some reason didn’t work with a copy, your original data.frame is damaged, you can’t just fix the typo and run it again.
            Tidyverse code, on the other hand, will throw error and fail entirely, without any assignment done at all. It is much more safe and it actually saved me many hours of debugging.

            2. The same thing may be said about some other things, that may intercept interpretation process, for example, manual stopping of some heavy computation.
            Even if you prefer row-by-row manual running it’s still dangerous – some distraction may lead to confusion of where and when you stopped, with objects already in the some intermediate state.

            3. In tidyverse we can do things without any assignment at all – play with data, make some plots, check some summary statistics. In fact, in most cases I write code like the second chunk, run all rows under assignment (without it), check their correctness, add something, and only then run it with the assignment operation. I personally consider this workflow very flexible and comfortable.

            Of course, some of these problems you can avoid by enclosing your imperative code chunks in functions that receive dataframe and return dataframe. It’s the right thing to do, but enclosing big chunks will lead to big and too specific, not so flexible functions.
            And splitting them to small, well-parametrised helper functions will lead to, well, tidyverse.

          2. That is very interesting, thank you, Michael! Would you be willing to expand on functional programming with R (with and without tidyverse) in a guest post? I think this would be beneficial to many of our readers! What do you think?

  19. I jumped off the tidyverse train a year ago, to get onboard data.table instead. Haven’t looked back, the syntax is concise, it’s fast, and I don’t find the whole verb-like paradigm to be that helpful at complex tasks. data.table doesn’t create a walled garden around you, which is nice too.

  20. I don’t agree on your last point about stability. If you want stability, use packrat (well better renv now) or docker or a similar tool. Python had many more breaking changes say in the last 5 years than base R I feel it’s needed to bring the language forward. But of course, if you don’t have tools like virtual environments and dependency management tools, you can’t make progress without breaking some people’s code.

  21. love this and this is at least partially my experience in environmental science. “tidy” code is provided to me that is unreadable, brittle and difficult to debug. I push everyone to data.table and drop the whole “tidy” idea as though everything else is somehow “dirty”. I can live with ggplot although you usually have to unmake them to be publishable.

    I found one article online in support of tidy saying that they were moving to functional ideas and away from old ideas like working with matrices. This seems like learning how to tie your shoes, how can you even contemplate working with data when you have no idea how they are organized? Those seem to be the ideals promoted in the tidyverse.

  22. Curious about your thoughts on Julia? Do you know if this newer language has benefitted from hindsight and critiques like your own?

    1. I hear a lot of good things about Julia and more people seem to be pretty positive about it. The problem is the still small community and missing extensions (packages). The other thing is, of course, the effort to learn yet another language (with all its idiosyncrasies). I haven’t had the time for a deeper look yet.

      Do you use it and what are your experiences?

      1. I have played with it some. I really liked it, and coming from someone who has used both R and Python, picking it up seemed easy and natural. There are similarities to both languages. For example, it has list comprehensions python similar to Python, but it’s first item indexed by 1 like R. Like you said the community is small and the extensions are lacking. Of course this wouldn’t be the case if the larger data science community suddenly shifted, but what gains would warrant such a shift? Idk. Thanks for the reply 🙂

  23. I enjoyed reading this. I started with data.table and I’ve never needed to use tidyverse. I love the syntax of data.table and the fact that I typically don’t have to load many packages. Not to mention how much more efficient and fast it is…

    1. data.table is the most impressive tool in R

      I was sad to see Matt Dowle’s face on datacamp.

      Hadley Wickham, in my view, is trying to direct R traffic to his own camps. He is amazing. But also an imperialist.

  24. Is ‘tidy’ becoming a religion?

    I had a recent encounter with people at a workshop who kept slipping the word “tidy” into everything. For example “is that tidy data?” or ” you could do this in a tidy way”. It is reminding me of the 1980s cartoon “The Smurfs” where the little blue characters were always saying things like “they smurfed me from playing” or “he smurfed that big mushroom house in no time”.

    Tidy as an adjective is scary enough but I am not sure if what to do if they turn it into a verb.

  25. As a rethoric device early on you bring this quote from a dictionary:
    characterized by conceited assertiveness and dogmatism.
    “an arrogant and opinionated man”
    But looking at the book R for Data Science paints a very different picture:
    In the preface, Page roman 13 under Python, Julia and Friends they write (not literally, but everyone can look this up):
    The authors acknowledge the usefulnes of these tools, but only state their opinion, that it is better to learn first one properly than a lot concurrently shallowly, and they have chosen R (resp tidyverse), obviously because being more familiar with R.

  26. I think this post is spot on. We shouldn’t ignore the fact that there is a ‘commercial/corporate interest’ behind this, and that companies – especially those who operate in the research / consulting space – have been known to promote star players / thought leaders. Some clever PR and social media savvy probably reinforce the legend (and the bottom line). I think it would be naive to assume that all of the packages attributed to Mr Wickham have necessarily been produced entirely through his own efforts. He may have entire teams at his disposal – or at the very least he receives extensive assistance. This wouldn’t be a unusual. I’ve encountered corporate leaders who stick there names on conference papers, theses and other research output and claim it as their own – when in fact the work was produced by others. While R Studio is free to operate as it wishes, I think some healthy opposition is necessary just in case the tidyverse gets a bit messier in future.

    1. Yes, this is most certainly the case. Wickham is a businessman. Seeing Matt Dowle and Max Kuhn in the datacamp was kind of sad.

  27. I don’t think anyone has mentioned that dplyr/dbplyr provides a transparent front-end for accessing databases, including SQL and Spark. If you start a pipeline with a connection to an SQL database, dplyr translates your processing steps into SQL. The resulting commands are run remotely until you explicitly retrieve the results. There is a vignette illustrating this capability and a year-old chart showing which SQL commands are supported for which database backends. This remarkable capability permits you to run almost identical code on local dataframes and remote databases. There is also a dtplyr package, which translates and executes dplyr pipelines using data.table. It’s not quite as fast as native data.table, but it’s faster than everything else.

    I think that this front-end capability arises because the dplyr verbs correspond to elemental operations (and thus can be translated into other systems relying on similar elemental operations) and the pipeline permits taking advantage of lazy evaluation, so no operations are performed until necessary.

    Personally I find the tidy controversy silly. Don’t use it if you don’t like it. But like it or not, I don’t see how one can have an informed discussion without considering dplyr‘s capability to serve as a front-end to other systems.

    As a final comment about the commercial interests behind the tidyverse, I’d suggest reading this blog post about RStudio becoming a Public Benefit Corporation and watching J.J. Allaire’s keynote at the RStudio conference. Yes, RStudio is a commercial entity. But they’re explicitly committed to putting a lot of their profits back into open source. There is nothing to stop anyone else from starting their own for-profit company and creating an open source counter-movement.

    1. Thank you, I did know about the data.table interface but not about the database connectivity. As I said, adding functional add-ons absolutely makes sense. What I am critical of are structural ones that distort the character of a language.

      1. I don’t know how you could create these specific add-ons at low cost without changing the character of the language. Both tidyverse and data.table change the character of the language. One front-end that can encapsulate both SQL and data.table? I’m impressed! Maybe someone who’s more of an expert on R language extensions will jump in and offer their thoughts.

        As an aside, those who aren’t familiar with Matt Dowle may enjoy his history of data.table in a 2014 talk.

  28. I’m not talking about just connecting to a remote data source. Lots of programs, both R and other languages, will let you execute your own SQL code on a remote server. The question is which of those packages write the SQL for you, so that you can use the exact same syntax to manipulate local data and remote data across a range of database servers? That’s what dplyr/dbplyr accomplishes. If I’m overlooking something, I’m curious to know which packages you have in mind.

    1. While it is true that you can use the exact same syntax to manipulate local and remote data this syntax is neither base R nor SQL but something new, which makes it a lot less impressive because you would have to learn a new syntax anyway (if you don’t use the tidyverse already).

  29. Thanks for your post. I don’t think using base R, tidyverse or even data.table are mutually exclusive. They each have their pros and cons.

    When I started learning R in 2015, I used base R but once I learnt about dplyr and other tidyverse packages, I switched to using them primarily as I personally find them easier to use. Base R has a lot of gotchas and inconsistencies, e.g. df[, "a"] returns a vector while df[, c("a", "b")] returns a data.frame. I know there is a drop argument but it’s not obvious.

    I wish the core team would clean up some of the warts in the language which they seem to be doing gradually, e.g. changing default value of stringsAsFactors to FALSE in R 4.0.

    It would also be really neat if R had type hints like Python got in Python 3.5 as it would help document functions and linters could detect possible type errors.

    1. Thank you for your comment. I agree that there are inconsistencies in R and with the drop argument you have a point (I also wasted hours of debugging by overlooking this small detail!).

      The question is if the remedy should be something that completely changes the character of the language and creates structurally incompatible code. This is what worries me.

      Concerning linters: do you know the lintr package? I personally haven’t used it and would be interested in your opinion.

      1. Breaking change can be good if done right and pros outweigh the cons. Allowing junk to accumulate will lead to misery and eventual decline in use. I am sick of hearing these (bogus) arguments: “Don’t use R because it is not a real programming language. ” or “It is slow, inconsistent and uses 1-based arrays”. I program in other languages (C#, Python, PowerShell) but prefer to use R for what it is made for: working with data and stats.

        I don’t have much experience with `lintr` but I do use styler to format my code. The default style gets me > 90% there and I then manually tweak results to get the desired format. Since my scripts usually are small, it seems like too much effort to create a custom style given costs and benefits.

  30. Very interesting comments, that have me wondering how I should proceed with stud of R. My goal is just basic data analysis. Right now I am pretty comfortable subsetting data sets according the Base R. Lately, I’ve been going through Hadley Wickham’s R for Data Science, which is significantly different than another tutorial I had been using.
    It does provide some simple tools for manipulating data, but I wonder if it is worth it to learn this sort of syntax especially since I am the type with not a great memory who just wants one way of doing something.
    So, are there any good books that go over the basics as R For Data Science does, but with Base R instead?
    Does it make sense to combine elements of Base R and Tidyverse?

    1. Yes, combine them. Tidy now encompasses so many experts and resources, it is a waste to ignore Tidy. The double learning of syntax is annoying, but that is in part a challenge with every new package. Tidyverse hates it’s users less than base R but also allows them less freedom.

  31. Might just be me, but tidy pipe calls get me confused pretty quickly, especially if there are multiple maps in one call like

    grandAvers_mytable % map_df( ~map_dbl(.x, ~.$ccfRes$grandAver))
    

    Makes me wonder if I’m still just too new or if this is actually overly complicated.
    Also, I’m not sure where to set the limit for deciding whether to use a longish tidy call or when to write a (readable) function.

Leave a Reply to Niels Kristian Schmidt Cancel reply

Your email address will not be published. Required fields are marked *

I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.

This site uses Akismet to reduce spam. Learn how your comment data is processed.