Morer R

ijij

I have inherited a decent-sized R codebase but it's full of redundant looping and rbinds and joins weird tricks to pass around row and column names, so it's showing how to program around R, and teaching me zippola about how to program in R

So let's try youz guyz:

What's the idiomatic R way to write this? Buttuming that I ab_table is a dataframe in R and an rdbms table in the example below.

"select a, b, count(*), avg(a) from ab_table group by a,b"

This should be absolutely trivial, but I can't find an easy way to do it...
... its like one command in SAS or SPSS... but in R I end with vectors containing one dataframe per element and who knows what... HELP!

paging @dkf, @Mikael_Svahnberg, (m.s,. am I right to page you?)

EDIT: points off if you suggest a function that calls SQL... I'm using that as a common-denominator way of expressing the general problem.

Mikael_Svahnberg

I'll think about this tomorrow, my brain has shut down for the night.

Paging @rhywden as well, you are likely to have stumbled upon R in your job...

Rhywden

I'm not sure what gave you that idea? :)

Mikael_Svahnberg

With all that measuring equipment, I buttumed you had also analysed some data, and then it's either matlab or R, I guessed.

Rhywden

Naw, you're forgetting that I'm a science teacher - stuff like that, while interesting, would be absolute overkill.

I'm happy if my students comprehend the differences in usefulness of average and median, for example. If I want to introduce "advanced" concepts like the standard deviation, I'd have to tread very carefully :)

ijij

Thanks for thinking about it..ever!

CoyneTheDup

First, my qualifications for R: zip null.

But I do understand that the count() and avg() functions are called aggregate functions, so I did a search for "aggregate functions R" and came up with this. I hope it helps.

Compute Summary Statistics of Data Subsets

Found another page that actually shows a working demo, I guess:

Aggregating Data

lolwhat

ITYM morrrrrrrrrrrre R.

I'll be here all week! Trrrrrrry the veal.

ijij

Demonstrating my poor google-fu...

Thanks, that looks promising...

BRB - tomorrow - it the end of my work day.

FrostCat

@lolwhat said:

I'll be here all week! Trrrrrrry the veal.

Be sure to trrrrrrrip your waitress.

No, wait--never mind.

CarrieVS

I did R at university - I studied maths, so we used R, Maple, and also FORTRAN. But I was never very good and I'm damned if I can remember any of it.

Mikael_Svahnberg

I have a few ideas, but I am AFK (not counting the ipad) until monday and I want to test the ideas first before I say anything. It should probably be easy, but with my limited experience of R it is not. My R workflow is usually to spend half a day to figure out what I want and which function that does it for me, then two more days to figure out how to format the data to fit what the function wants.

My SQL is a bit rusty, too. Would your select statement not return two columns with x rows, and the count and avg columns with only one row?

ijij

@Mikael_Svahnberg said:

My SQL is a bit rusty, too. Would your select statement not return two columns with x rows, and the count and avg columns with only one row?

The intent is a row for each combination of (a,b) that exists in the data, and the count of how many times the combination (a,b) occurs, and the average of some variable for those records where the combination (a,b) (my generalizing to avg(a) clearly works out to just the value of a in each row - DOH!)

Just summary data grouped by some variables from a dataframe to a dataframe.

So actually, for the record:

select a, b, count(*), avg(c) from ab_table group by a,b

Alternatively I will accept a scolding about how using dataframes is Doing it Wrong.

dkf

@ijij said:

What's the idiomatic R way to write this?

Well, while my users are dead keen on R, I don't really do very much with it. So I googled for r sql and got this as one of the top hits. (The others were for sqldf, which is something else entirely.)

I'm guessing that the code will look something like this (fill in the right ??? values):

drv = dbDriver("???")
con = dbConnect(drv, user="???", dbname="???", host="???")
dbGetQuery(con,"select a, b, count(*), avg(a) from ab_table group by a,b")

and that that will fetch every row of the resultset back. Whether that's what you want and whether the SQL is right, I have no idea.

Gąska

So you're saying that idiomatic R is making SQL queries? Just like idiomatic Python is writing most important things in C?

ijij

@dkf said:

So I googled for r sql and got this as one of the top hits. (The others were for sqldf, which is something else entirely.)

IJIJ:
EDIT: points off if you suggest a function that calls SQL... I'm using that as a common-denominator way of expressing the general problem.

thanks for playing!

;)

Filed under: at least I'm getting better answers here than at SO

dkf

@Gaska said:

So you're saying that idiomatic R is making SQL queries?

If you're querying against an SQL database, then yes. ;)

Gąska

Except OP is not.

dkf

That would require me to be paying detailed attention when I'm already in an interesting workshop with my boss…

ijij

@dkf said:

interesting workshop with my boss…

This phrase could lead to so many questions...

(Based on avatar): Workshop name?
"Lumber pre-processing: a tool for career advancement."
or
"Calculating Railway Safe-stopping Zones for Fun and Profit"

dkf

@ijij said:

Workshop name?

Samples Club. (Yes, we've done “the first rule of Samples Club is …”)

The samples in question are physical things in labs done in systems biology and synthetic biology, with input from biomedical research and the microbial end of ecology.

Mikael_Svahnberg

This took longer than expected -- as expected. Formatting the data and understanding the functions in R is a bitch. For a long while I was into the reshape2 package and its melt and dcast methods. In the end, dcast gave me too much of a headache to return more than one result at once. Technically, I should be able to apply several aggregation functions on the data and get multiple columns in the output, but it seems as if this was "sort of" removed between reshape and reshape2.

Instead, my search led me to the plyr package, which did the trick:

> require(plyr)
> 
> # Create a data frame
> dfx <- data.frame(
+   group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
+   sex = sample(c("M", "F"), size = 29, replace = TRUE),
+   age = runif(n = 29, min = 18, max = 54)
+ )
> 
> # Just to see what it looks like
> head(dfx)
  group sex      age
1     A   F 40.39112
2     A   F 25.97179
3     A   F 42.89477
4     A   F 42.36620
5     A   F 20.84443
6     A   F 46.19999
> 
> # The magic.
> # split the dataframe over group and sex
> # call the summarise function on the resulting rows
> # in particular, get the mean and the length.
> ddply(dfx, .(group, sex), summarise,
+         mean=round(mean(age),2),
+         count=length(age))
  group sex  mean count
1     A   F 33.83     8
2     B   F 32.73     9
3     B   M 27.37     6
4     C   F 36.45     1
5     C   M 31.04     5
> 
> 
> # using another example
> # --------------
> dim(mtcars) # 32 observations of 11 variables
[1] 32 11
> head(mtcars) # A quick look at the data
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> 
> # Split over cylinders and horsepowers
> # calculate the mean mpg per thusly created group
> # and list the number of observations per group
> ddply(mtcars, .(cyl,hp), summarise, meanMPG=round(mean(mpg),2), count=length(wt))
   cyl  hp meanMPG count
1    4  52   30.40     1
2    4  62   24.40     1
3    4  65   33.90     1
4    4  66   29.85     2
5    4  91   26.00     1
6    4  93   22.80     1
7    4  95   22.80     1
8    4  97   21.50     1
9    4 109   21.40     1
10   4 113   30.40     1
11   6 105   18.10     1
12   6 110   21.13     3
13   6 123   18.50     2
14   6 175   19.70     1
15   8 150   15.35     2
16   8 175   18.95     2
17   8 180   16.30     3
18   8 205   10.40     1
19   8 215   10.40     1
20   8 230   14.70     1
21   8 245   13.80     2
22   8 264   15.80     1
23   8 335   15.00     1

HTH.

EDIT the mtcars dataset is a builtin sample set. Use library(help="datasets") to find out which other builtin sample datasets you have.

ijij

@Mikael_Svahnberg said:

Technically, I should be able to apply several aggregation functions on the data and get multiple columns in the output, but it seems as if this was "sort of" removed between reshape and reshape2.

That.

All the magical coercion of types and structures within all the nesting of functions seems to be at the root of all evil in R.

Mikael_Svahnberg

Ever see a banyan tree?

IMHO, that's how the "root of all evil" looks like in R.

dkf

@Mikael_Svahnberg said:

that's how the "root of all evil" looks like in R.

Not this?

sqrt(all_evil)

swayde

http://www.midasupload.com/users/1/evilgirls.jpg

JazzyJosh

Time and money

Indicates addition.

Girls = 2Money

antiquarian

@JazzyJosh said:

Indicates addition.

Girls = 2Money

boomzilla

@JazzyJosh said:

Indicates addition.

Girls = 2Money

But that's just applying the distributive property. They take money for each unit of time. Sure, you can multiply it out and add it up, but that's just math.

PleegWat

Best be careful, before you evolve it into 2 girls 1 money.