Selecting rows conditioned on values in columns is easy, as for example selecting people aged over 33. What is about selecting rows conditioned on statistics computed on multiple rows of the data frame, as for example selecting pupils in groups by the number of pupils per group?
That is where the very nice dplyr package comes in.
We build and print the data frame:
df <- data.frame(id=1:9, classid=c(1,1,1,2,2,3,3,3,3), math=round(runif(9,1,6),1))
> print(df) id classid math 1 1 1 5.4 2 2 1 4.0 3 3 1 1.1 4 4 2 2.2 5 5 2 3.9 6 6 3 2.7 7 7 3 6.0 8 8 3 2.0 9 9 3 1.6
Now, we want to select – i.e. “filter” in terms of the dplyr package – pupils that are part of groups/classes with more than two pupils per class. In dplyr there are three different syntaxes to achieve this.
# step-by-step df.g <- group_by(df, classid) df.n <- filter(df.g, n()>2) # or nested syntax df.n <- filter(group_by(df, classid), n()>2) # or with %.% operator df.n <- df %.% group_by(classid) %.% filter(n()>2)
The result is the same for all:
> print(df.n) Source: local data frame [7 x 3] Groups: classid id classid math 1 1 1 5.4 2 2 1 4.0 3 3 1 1.1 4 6 3 2.7 5 7 3 6.0 6 8 3 2.0 7 9 3 1.6
Of course, you can do this the pure-R-way,
> df.c <- df[df$classid %in% which(xtabs(~classid, df)>2), ] > print(df.c) id classid math 1 1 1 5.4 2 2 1 4.0 3 3 1 1.1 6 6 3 2.7 7 7 3 6.0 8 8 3 2.0 9 9 3 1.6
but I think with dplyr it looks quite a bit nicer.
Happy dpylr!