Morer R
-
I have inherited a decent-sized R codebase but it's full of redundant looping and rbinds and joins weird tricks to pass around row and column names, so it's showing how to program around R, and teaching me zippola about how to program in R
So let's try youz guyz:
What's the idiomatic R way to write this? Buttuming that I ab_table is a dataframe in R and an rdbms table in the example below.
"select a, b, count(*), avg(a) from ab_table group by a,b"
This should be absolutely trivial, but I can't find an easy way to do it...
... its like one command in SAS or SPSS... but in R I end with vectors containing one dataframe per element and who knows what... HELP!paging @dkf, @Mikael_Svahnberg, (m.s,. am I right to page you?)
EDIT: points off if you suggest a function that calls SQL... I'm using that as a common-denominator way of expressing the general problem.
-
I'll think about this tomorrow, my brain has shut down for the night.
Paging @rhywden as well, you are likely to have stumbled upon R in your job...
-
I'm not sure what gave you that idea? :)
-
With all that measuring equipment, I buttumed you had also analysed some data, and then it's either matlab or R, I guessed.
-
Naw, you're forgetting that I'm a science teacher - stuff like that, while interesting, would be absolute overkill.
I'm happy if my students comprehend the differences in usefulness of average and median, for example. If I want to introduce "advanced" concepts like the standard deviation, I'd have to tread very carefully :)
-
Thanks for thinking about it..ever!
-
First, my qualifications for R: zip null.
But I do understand that the count() and avg() functions are called aggregate functions, so I did a search for "aggregate functions R" and came up with this. I hope it helps.
Compute Summary Statistics of Data Subsets
Found another page that actually shows a working demo, I guess:
-
ITYM morrrrrrrrrrrre R.
I'll be here all week! Trrrrrrry the veal.
-
Demonstrating my poor google-fu...
Thanks, that looks promising...
BRB - tomorrow - it the end of my work day.
-
I'll be here all week! Trrrrrrry the veal.
Be sure to trrrrrrrip your waitress.
No, wait--never mind.
-
I did R at university - I studied maths, so we used R, Maple, and also FORTRAN. But I was never very good and I'm damned if I can remember any of it.
-
I have a few ideas, but I am AFK (not counting the ipad) until monday and I want to test the ideas first before I say anything. It should probably be easy, but with my limited experience of R it is not. My R workflow is usually to spend half a day to figure out what I want and which function that does it for me, then two more days to figure out how to format the data to fit what the function wants.
My SQL is a bit rusty, too. Would your select statement not return two columns with x rows, and the count and avg columns with only one row?
-
My SQL is a bit rusty, too. Would your select statement not return two columns with x rows, and the count and avg columns with only one row?
The intent is a row for each combination of (a,b) that exists in the data, and the count of how many times the combination (a,b) occurs, and the average of some variable for those records where the combination (a,b) (my generalizing to avg(a) clearly works out to just the value of a in each row - DOH!)
Just summary data grouped by some variables from a dataframe to a dataframe.
So actually, for the record:
select a, b, count(*), avg(c) from ab_table group by a,b
Alternatively I will accept a scolding about how using dataframes is Doing it Wrong.
-
What's the idiomatic R way to write this?
Well, while my users are dead keen on R, I don't really do very much with it. So I googled for
r sql
and got this as one of the top hits. (The others were forsqldf
, which is something else entirely.)I'm guessing that the code will look something like this (fill in the right
???
values):drv = dbDriver("???") con = dbConnect(drv, user="???", dbname="???", host="???") dbGetQuery(con,"select a, b, count(*), avg(a) from ab_table group by a,b")
and that that will fetch every row of the resultset back. Whether that's what you want and whether the SQL is right, I have no idea.
-
So you're saying that idiomatic R is making SQL queries? Just like idiomatic Python is writing most important things in C?
-
So I googled for r sql and got this as one of the top hits. (The others were for sqldf, which is something else entirely.)
IJIJ:
EDIT: points off if you suggest a function that calls SQL... I'm using that as a common-denominator way of expressing the general problem.thanks for playing!
;)
Filed under: at least I'm getting better answers here than at SO
-
So you're saying that idiomatic R is making SQL queries?
If you're querying against an SQL database, then yes. ;)
-
Except OP is not.
-
That would require me to be paying detailed attention when I'm already in an interesting workshop with my boss…
-
interesting workshop with my boss…
This phrase could lead to so many questions...
(Based on avatar): Workshop name?
"Lumber pre-processing: a tool for career advancement."
or
"Calculating Railway Safe-stopping Zones for Fun and Profit"
-
Workshop name?
Samples Club. (Yes, we've done “the first rule of Samples Club is …”)
The samples in question are physical things in labs done in systems biology and synthetic biology, with input from biomedical research and the microbial end of ecology.
-
This took longer than expected -- as expected. Formatting the data and understanding the functions in R is a bitch. For a long while I was into the
reshape2
package and itsmelt
anddcast
methods. In the end,dcast
gave me too much of a headache to return more than one result at once. Technically, I should be able to apply several aggregation functions on the data and get multiple columns in the output, but it seems as if this was "sort of" removed betweenreshape
andreshape2
.Instead, my search led me to the
plyr
package, which did the trick:> require(plyr) > > # Create a data frame > dfx <- data.frame( + group = c(rep('A', 8), rep('B', 15), rep('C', 6)), + sex = sample(c("M", "F"), size = 29, replace = TRUE), + age = runif(n = 29, min = 18, max = 54) + ) > > # Just to see what it looks like > head(dfx) group sex age 1 A F 40.39112 2 A F 25.97179 3 A F 42.89477 4 A F 42.36620 5 A F 20.84443 6 A F 46.19999 > > # The magic. > # split the dataframe over group and sex > # call the summarise function on the resulting rows > # in particular, get the mean and the length. > ddply(dfx, .(group, sex), summarise, + mean=round(mean(age),2), + count=length(age)) group sex mean count 1 A F 33.83 8 2 B F 32.73 9 3 B M 27.37 6 4 C F 36.45 1 5 C M 31.04 5 > > > # using another example > # -------------- > dim(mtcars) # 32 observations of 11 variables [1] 32 11 > head(mtcars) # A quick look at the data mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > > # Split over cylinders and horsepowers > # calculate the mean mpg per thusly created group > # and list the number of observations per group > ddply(mtcars, .(cyl,hp), summarise, meanMPG=round(mean(mpg),2), count=length(wt)) cyl hp meanMPG count 1 4 52 30.40 1 2 4 62 24.40 1 3 4 65 33.90 1 4 4 66 29.85 2 5 4 91 26.00 1 6 4 93 22.80 1 7 4 95 22.80 1 8 4 97 21.50 1 9 4 109 21.40 1 10 4 113 30.40 1 11 6 105 18.10 1 12 6 110 21.13 3 13 6 123 18.50 2 14 6 175 19.70 1 15 8 150 15.35 2 16 8 175 18.95 2 17 8 180 16.30 3 18 8 205 10.40 1 19 8 215 10.40 1 20 8 230 14.70 1 21 8 245 13.80 2 22 8 264 15.80 1 23 8 335 15.00 1
HTH.
EDIT the
mtcars
dataset is a builtin sample set. Uselibrary(help="datasets")
to find out which other builtin sample datasets you have.
-
Technically, I should be able to apply several aggregation functions on the data and get multiple columns in the output, but it seems as if this was "sort of" removed between reshape and reshape2.
That.
All the magical coercion of types and structures within all the nesting of functions seems to be at the root of all evil in R.
-
Ever see a banyan tree?
IMHO, that's how the "root of all evil" looks like in R.
-
-
-
Time and money
Indicates addition.
Girls = 2Money
-
Indicates addition.
Girls = 2Money
-
Indicates addition.
Girls = 2Money
But that's just applying the distributive property. They take money for each unit of time. Sure, you can multiply it out and add it up, but that's just math.
-
Best be careful, before you evolve it into 2 girls 1 money.