Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

R is the worst programming language I've touched. I usually hear in response, "it may be quirky, but it has a lot of well-tested functions useful for data analysis."

This confidence seems misplaced. Experienced developers using better languages with better tooling create bugs all the time. How likely is it that R packages, written by specialists (or grad students) whose main focus is often not programming but some other discipline, in a language full of "quirks," are going to be really reliable?

One quirk that got me the last time I touched R was that functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

Fun fact, a lot of the R source code is machine translated from Fortran to C, but it has been cleaned up a lot. https://github.com/wch/r-source/search?p=1&q=f2c&unscoped_q=...



R's killer app is interactive analysis. Import some data, graph it, look at the graph, do some clean-up based on what you noticed, graph it again, run some summary statistics, fit a model, graph the residuals from the model, etc. It's amazing for this - arguably the best available open source software for this type of task. Comparable tools for this would be MATLAB, Stata, Mathematica, etc. (all with strengths in slightly different domains).

It's comparatively very weak for "software development" - writing libraries, long-running programs, etc. Even data cleaning/data analysis scripts are a headache beyond a certain length or number of contributors. People tend to have the same complaints about MATLAB, Stata, and Mathematica.

These are related - a lot of the features that make interactive use concise and easy to pick up end up being kind of a mess to handle consistently in a larger program.

Python is at a different point on this spectrum. It's moderately good for interactive analysis, but there are still a lot of tasks that are concise and easy in R that require a bunch of verbose object manipulation in Python. This is a choice - "explicit is better than implicit", etc. Python is also moderately good for "software development" - it has enough consistency and code structuring features that you can write libraries, systems, and infrastructure while collaborating with a small number of other people pretty easily. You still hit a point where Python's dynamic features make things like refactoring harder than you'd like, but it's usable.


Wholeheartedly agree.

I used to do mostly data analysis in my day-to-day work and R was my go-to and absolute favorite language for years in terms of usability for data analysis. Doing actual software development in R is quirky at best, to be honest.

Nowadays I write code for research that requires 'actual' software development, so I've been using python almost exclusively (with pytorch under the hood, which I love.) No doubt, python is a better language for software engineering.

Nevertheless, for analysis I just cannot warm up to numpy/pandas/matplotlib _at all_. When it's time to analyze results of my experiments or produce publication level graphics, I write my python results to disk and use the tidyverse as a last mile solution.


This is exactly why I still use R. Throw data in a file, suck it into R, start playing with slicing and visualization with a few commands. Once I get something interesting, a few more tweaks and I have a nice, descriptive graph.


RE: functions/variables -- you don't quite have this right. Let's walk through when this can occur and when this can't. First, this cannot occur if you're talking about your local environment:

  z <- function() { print("hello world") }
  z <- 3
  z # Outputs 3
  z() # Errors because there is no function z()
What the actual quirk you're talking about has to do with attaching other namespaces. So, the way R works is that by default it loads several packages. For instance, the reason you can call `mean()` is that the "base" package is attached when you open a blank session in R. It is absolutely possible to have multiple symbols take the same label across namespaces. Here's an example:

  mean <- 4 # Creates variable called mean in the local namespace
  mean(c(5, 10, 20)) # Uses the mean function in the attached base namespace
  mean # Uses the mean variable in the local namespace
But this is not unique to variables versus functions, it also works with functions:

  mean <- function(x) { sum(x) * 100 }
  mean(c(1, 2, 3)) # Outputs 600
  base::mean(c(1, 2, 3)) # Outputs 2
  rm(mean) # Drop local namespace function
  mean(c(1, 2, 3)) # Outputs 2
You can use sessionInfo() to see the order of attached namespaces that will be searched. In RStudio, you can also press the little dropdown arrow next to "Global Environment" in the environment pane to see the order of attached namespaces -- it'll search them in that order. Alternatively, you can be defensive and always prefix all functions in any namespace with the namespace name (equivalent to always using module.function style function calls in python).

Finally, you can use the built-in function "find" to see the order in which R will try to resolve a symbol, e.g.

  sum <- function() { print("i like sum coding in R!!!") }
  find("sum") # .GlobalEnv first, base second

I'm not sure this is a quirk. What are your options as a namespaced language? 1) Allow users to import multiple functions/variables with the same name across different namespaces and resolve the conflict via some kind of hierarchical order; 2) Don't allow users to import multiple functions/variables with the same name, so you can never use overloading to monkey patch; 3) Always require namespace prefixes at all times; 4) Make function dispatch a blocking operation that asks REPL users which to use? I dunno, it doesn't seem to me like R's approach is any less sane.

I guess one thing that is different about R's approach is that "built-ins" have no special priority, they're all part of some namespaces that are attached by default but otherwise exactly like third-party libraries (base, stats, graphics, etc.)

In R, the only reserved words that cannot be overloaded are while, repeat, next, if, function, for, and break. (Note: else does not appear here because of a genuinely baffling quirk about how else is implemented in R)


As much as I love R, there are indeed quirks that I think R allows to happen but shouldn't.

For example:

    mean(2, 4) # outputs 2
    mean(c(2, 4)) # outputs 3
I really don't see why R allows you to enter mean(2,4) without giving an error.


Thanks for the explanation. Yes, your second example is something like what I found in R code I was asked to modify.

What I find distasteful is that when calling mean(), the resolution of this name depends NOT on whether the local variable mean has been defined, but whether it has been assigned a function. This is illustrated by your 2nd and 3rd examples.

Of course if you are used to it, it may not catch you by surprise.


Can an alias be an option to resolve conflict: “import mypackage.mean as mymean”?


No aliases (afaik), but convention is that you explicitly call mypackage::mean in cases where names might even hint at being ambigous.


On a tangent here, but is there a reason to use <- instead of = for assignment?

Actually, to answer my own question (I know google), I found this very informative stackoverflow answer [1]. My TLDR: no difference, other than "<-" is more likely to cause carpal tunnel syndrome.

[1] https://stackoverflow.com/questions/1741820/what-are-the-dif...


> How likely is it that R packages, written by specialists (or grad students) whose main focus is often not programming but some other discipline, in a language full of "quirks," are going to be really reliable?

Do you have the data to back that up or is just a question that you don't have an answer to?

I think the statement brush off the high quality R packages that already in the ecosystem that is not in anywhere else. The author of element of statistic learning created glmnet package which took python years to even have ported. There are many other packages out there that Python does not have.

I am not going to argue it's the prettiest language.

But there are tons of packages and many found no where else in other language ecosystem.

If you're going to say just use rest/rpc then it just defeat the purpose of your initial argument.

And just look at springer or, gosh the other publisher escaped my mind right now, but these publishers have book on statistical subjects with R packages accompany the book.

It may be ugly but the packages are maintain by expert in the field of statistic and they may not be programmer. But they do dog food their package and use dataset to see the results. Likewise just read up on the Ranger Rpackage paper (https://arxiv.org/pdf/1508.04409.pdf). They test their output with the other randomforest package.

And it's silly to point this argument at just R when the same happen to Python. The bootstrap function in SciKit-Learn for the longest time didn't even really do bootstrap. The linear regression function automatically does shrinkage with no option to turn off.

No language is perfect. But I believe R have a place and especially in statistic. Many many wonderful expert statistician are maintain and creating R packages (eg Dr. Frank E. Harrell Jr. ) Many R packages have accompanying paper publish here https://www.jstatsoft.org/index


> functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

The two namespace approach makes it harder to assign 4 to foo and then call it, so I don't understand this comment.

You can assign 4 to foo and try to call it in a language that has one namespace.

  ;; Scheme, one namespace:
  (let ((foo 4))
    (foo) ;; oops

  ;; Common Lisp, two spaces:

  (let ((foo 4))
    (foo) ;; still calls the foo function,
          ;; not related to or shadowed by the above variable.
    (funcall foo)) ;; oops
The usual valid complaint about two namespaces is that funcall is required all over the place in code that works with higher-order functions, which uglifies the code, and that an operator like (function foo) or its abbreviation #'foo is required to lift references to functions as values instead of just writing foo, which likewise uglifies code.


R to Python convert here (numpy, pandas). Agreed about the general claim of clunkiness, but at least for statistical computing R still wins because of the richness of the long tail of packages (representative example: https://cran.r-project.org/web/packages/poweRlaw/index.html). In my experience, Python equivalents are much less developed and documented, if extant. Many data scientists I know would disagree with this claim, but that's because they tend to stick with things supported by scipy, pymc3, statsmodels, and a few other common libraries.

One solution I have found for small and medium data is to use rpy in Jupyter to let me keep most of my workflow in python, then shuttle stuff to R for exotic tests or to use key packages (ggplot, brms, lme4).


Separate namespaces for function names and data variables is a feature inherited from Lisp which is the language R evolved from. Not really a quirk. Common Lisp and Emacs Lisp also have separate namespaces, while Scheme has a single namespace.

The thing about a lot of R code not being written by programming specialists is a valid point, but then again not many programming specialists are also specialist in statistics so... the alternative is what?


R might be a bad choice to build software but for statistics and data analysis I haven't encountered a better tool. It's also trivial to simply write functions in C/C++/Java/Fortran and call it from R.


Exactly! I use R every day as the head of the data science department at a corporation. Most of my work is medium-sized data analysis projects and nothing can touch R for that level of work.


> functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

The most famous language with this property is Lisp.


Unix shells also compete in this category.

We can have a $foo variable, and a foo command/function.

GNU Make is another one, sort of.

Makefile:

  warning = abc
  $(info abc = $(warning))
  $(warning what?)
Output:

  abc = abc
  Makefile:3: what?
  make: *** No targets.  Stop.
$(warning) is a variable, whereas $(warning args ...) is an operator call.

If a macro is stored in a variable V, it cannot be called as $(V args), but using $(call V args). That's analogous to funcall in Common Lisp.

Looks like the Lisp-2 approach is well represented in the famous language scene. :)


You can't really think of R as a programming language in the sense of how we normally thing of programing languages. I see it as more of a computational scripting language. It is great at what it does.


At the same time there are companies in which there's R code running in production to serve ML results in real time. No idea how common it is, but I've worked for such a company. My point is that R isn't just for scripting and interactive use even if that's its best use case.


A plumber in front of your ML serving function running in a docker container. There, simple, fast enough. We have multiple R containers in production, it is working fine and the data scientist are happy that they can keep working in RStudio.


> One quirk that got me the last time I touched R was that functions and variables live in different namespaces, so you could assign 4 to foo and then call it.

Either I'm greatly misunderstanding or this is just plain wrong - assigning something to the same name as an existing function will just overwrite the function.


You're correct provided the overwrite is occurring in the same namespace (e.g. the local environment). He is correct for the wrong reason if you try to alias something from another environment. For example, the following R code may initial seem weird:

   my_vec <- c(1, 2, 3)
   c <- 4
   print(c)
   my_vec2 <- c(5, 6, 7)
But as I describe in my top level comment, it has nothing to do with variables versus functions, just how namespaces work.


If you've gone from SAS to R, R seems very much improved.


> it has a lot of well-tested functions

I tried to do a simple word cloud based on a Twitter search in R. The whole thing was plagued by mysterious failures due, AFAICT, to weird, rare, non-standard UTF characters in the source data that were crashing some of the R libraries I was using to clean the data.

Having done that before in Python, it was shocking to see how fragile the R ecosystem is.


As another example: I tried to get some R code distributed with MPI using a package that claims to do this. If I remember correctly, the package would generate and execute a shell script to spin up subprocesses. Then it would shuttle code (including huge matrices serialized to ASCII) over a socket, to be eval()'d on the other side. The secret codeword "kreab" would terminate the connection, so this could appear nowhere in the code sent over the socket.

This is clownshoes software quality. And it's used for important scientific research.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: