# Morer R

• I have inherited a decent-sized R codebase but it's full of redundant looping and rbinds and joins weird tricks to pass around row and column names, so it's showing how to program around R, and teaching me zippola about how to program in R

So let's try youz guyz:

What's the idiomatic R way to write this? Buttuming that I ab_table is a dataframe in R and an rdbms table in the example below.

"select a, b, count(*), avg(a) from ab_table group by a,b"

This should be absolutely trivial, but I can't find an easy way to do it...
... its like one command in SAS or SPSS... but in R I end with vectors containing one dataframe per element and who knows what... HELP!

paging @dkf, @Mikael_Svahnberg, (m.s,. am I right to page you?)

EDIT: points off if you suggest a function that calls SQL... I'm using that as a common-denominator way of expressing the general problem.

Paging @rhywden as well, you are likely to have stumbled upon R in your job...

• I'm not sure what gave you that idea?

• With all that measuring equipment, I buttumed you had also analysed some data, and then it's either matlab or R, I guessed.

• Naw, you're forgetting that I'm a science teacher - stuff like that, while interesting, would be absolute overkill.

I'm happy if my students comprehend the differences in usefulness of average and median, for example. If I want to introduce "advanced" concepts like the standard deviation, I'd have to tread very carefully

• Thanks for thinking about it..ever!

• First, my qualifications for R: zip null.

But I do understand that the count() and avg() functions are called aggregate functions, so I did a search for "aggregate functions R" and came up with this. I hope it helps.

Compute Summary Statistics of Data Subsets

Found another page that actually shows a working demo, I guess:

Aggregating Data

• ITYM morrrrrrrrrrrre R.

I'll be here all week! Trrrrrrry the veal.

Thanks, that looks promising...

BRB - tomorrow - it the end of my work day.

• I'll be here all week! Trrrrrrry the veal.

Be sure to trrrrrrrip your waitress.

No, wait--never mind.

• I did R at university - I studied maths, so we used R, Maple, and also FORTRAN. But I was never very good and I'm damned if I can remember any of it.

• I have a few ideas, but I am AFK (not counting the ipad) until monday and I want to test the ideas first before I say anything. It should probably be easy, but with my limited experience of R it is not. My R workflow is usually to spend half a day to figure out what I want and which function that does it for me, then two more days to figure out how to format the data to fit what the function wants.

My SQL is a bit rusty, too. Would your select statement not return two columns with x rows, and the count and avg columns with only one row?

• My SQL is a bit rusty, too. Would your select statement not return two columns with x rows, and the count and avg columns with only one row?

The intent is a row for each combination of (a,b) that exists in the data, and the count of how many times the combination (a,b) occurs, and the average of some variable for those records where the combination (a,b) (my generalizing to avg(a) clearly works out to just the value of a in each row - DOH!)

Just summary data grouped by some variables from a dataframe to a dataframe.

So actually, for the record:

select a, b, count(*), avg(c) from ab_table group by a,b

Alternatively I will accept a scolding about how using dataframes is Doing it Wrong.

• What's the idiomatic R way to write this?

Well, while my users are dead keen on R, I don't really do very much with it. So I googled for `r sql` and got this as one of the top hits. (The others were for `sqldf`, which is something else entirely.)

I'm guessing that the code will look something like this (fill in the right `???` values):

``````drv = dbDriver("???")
con = dbConnect(drv, user="???", dbname="???", host="???")
dbGetQuery(con,"select a, b, count(*), avg(a) from ab_table group by a,b")
``````

and that that will fetch every row of the resultset back. Whether that's what you want and whether the SQL is right, I have no idea.

• So you're saying that idiomatic R is making SQL queries? Just like idiomatic Python is writing most important things in C?

• So I googled for r sql and got this as one of the top hits. (The others were for sqldf, which is something else entirely.)

IJIJ:
EDIT: points off if you suggest a function that calls SQL... I'm using that as a common-denominator way of expressing the general problem.

thanks for playing!

• So you're saying that idiomatic R is making SQL queries?

If you're querying against an SQL database, then yes.

• Except OP is not.

• That would require me to be paying detailed attention when I'm already in an interesting workshop with my boss…

• interesting workshop with my boss…

This phrase could lead to so many questions...

(Based on avatar): Workshop name?
"Lumber pre-processing: a tool for career advancement."
or
"Calculating Railway Safe-stopping Zones for Fun and Profit"

• Workshop name?

Samples Club. (Yes, we've done “the first rule of Samples Club is …”)

The samples in question are physical things in labs done in systems biology and synthetic biology, with input from biomedical research and the microbial end of ecology.

• This took longer than expected -- as expected. Formatting the data and understanding the functions in R is a bitch. For a long while I was into the `reshape2` package and its `melt` and `dcast` methods. In the end, `dcast` gave me too much of a headache to return more than one result at once. Technically, I should be able to apply several aggregation functions on the data and get multiple columns in the output, but it seems as if this was "sort of" removed between `reshape` and `reshape2`.

Instead, my search led me to the `plyr` package, which did the trick:

``````> require(plyr)
>
> # Create a data frame
> dfx <- data.frame(
+   group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
+   sex = sample(c("M", "F"), size = 29, replace = TRUE),
+   age = runif(n = 29, min = 18, max = 54)
+ )
>
> # Just to see what it looks like
group sex      age
1     A   F 40.39112
2     A   F 25.97179
3     A   F 42.89477
4     A   F 42.36620
5     A   F 20.84443
6     A   F 46.19999
>
> # The magic.
> # split the dataframe over group and sex
> # call the summarise function on the resulting rows
> # in particular, get the mean and the length.
> ddply(dfx, .(group, sex), summarise,
+         mean=round(mean(age),2),
+         count=length(age))
group sex  mean count
1     A   F 33.83     8
2     B   F 32.73     9
3     B   M 27.37     6
4     C   F 36.45     1
5     C   M 31.04     5
>
>
> # using another example
> # --------------
> dim(mtcars) # 32 observations of 11 variables
[1] 32 11
> head(mtcars) # A quick look at the data
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
>
> # Split over cylinders and horsepowers
> # calculate the mean mpg per thusly created group
> # and list the number of observations per group
> ddply(mtcars, .(cyl,hp), summarise, meanMPG=round(mean(mpg),2), count=length(wt))
cyl  hp meanMPG count
1    4  52   30.40     1
2    4  62   24.40     1
3    4  65   33.90     1
4    4  66   29.85     2
5    4  91   26.00     1
6    4  93   22.80     1
7    4  95   22.80     1
8    4  97   21.50     1
9    4 109   21.40     1
10   4 113   30.40     1
11   6 105   18.10     1
12   6 110   21.13     3
13   6 123   18.50     2
14   6 175   19.70     1
15   8 150   15.35     2
16   8 175   18.95     2
17   8 180   16.30     3
18   8 205   10.40     1
19   8 215   10.40     1
20   8 230   14.70     1
21   8 245   13.80     2
22   8 264   15.80     1
23   8 335   15.00     1
``````

HTH.

EDIT the `mtcars` dataset is a builtin sample set. Use `library(help="datasets")` to find out which other builtin sample datasets you have.

• Technically, I should be able to apply several aggregation functions on the data and get multiple columns in the output, but it seems as if this was "sort of" removed between reshape and reshape2.

That.

All the magical coercion of types and structures within all the nesting of functions seems to be at the root of all evil in R.

• Ever see a banyan tree?

IMHO, that's how the "root of all evil" looks like in R.

• that's how the "root of all evil" looks like in R.

Not this?

``````sqrt(all_evil)
``````

• Time and money

Girls = 2Money

Girls = 2Money