Monthly Archives: March 2014

Select Pupils by Number of Pupils per Group

Selecting rows conditioned on values in columns is easy, as for example selecting people aged over 33. What is about selecting rows conditioned on statistics computed on multiple rows of the data frame, as for example selecting pupils in groups by the number of pupils per group?

That is where the very nice dplyr package comes in.

We build and print the data frame:

df <- data.frame(id=1:9, classid=c(1,1,1,2,2,3,3,3,3), math=round(runif(9,1,6),1))
> print(df)
  id classid math
1  1       1  5.4
2  2       1  4.0
3  3       1  1.1
4  4       2  2.2
5  5       2  3.9
6  6       3  2.7
7  7       3  6.0
8  8       3  2.0
9  9       3  1.6

Now, we want to select – i.e. “filter” in terms of the dplyr package – pupils that are part of groups/classes with more than two pupils per class. In dplyr there are three different syntaxes to achieve this.

# step-by-step
df.g <- group_by(df, classid)
df.n <- filter(df.g, n()>2)

# or nested syntax
df.n <- filter(group_by(df, classid), n()>2)

# or with %.% operator
df.n <- df %.% 
  group_by(classid) %.% 
  filter(n()>2)

The result is the same for all:

> print(df.n)
Source: local data frame [7 x 3]
Groups: classid

  id classid math
1  1       1  5.4
2  2       1  4.0
3  3       1  1.1
4  6       3  2.7
5  7       3  6.0
6  8       3  2.0
7  9       3  1.6

Of course, you can do this the pure-R-way,

> df.c <- df[df$classid %in% which(xtabs(~classid, df)>2), ]
> print(df.c)
  id classid math
1  1       1  5.4
2  2       1  4.0
3  3       1  1.1
6  6       3  2.7
7  7       3  6.0
8  8       3  2.0
9  9       3  1.6

but I think with dplyr it looks quite a bit nicer.

Happy dpylr!