Category Archives: R

Recursive R function to find intersection of two or more given sets

The intersect(x, y) function from R “base” package works for no more then two vectors x and y.  The following function accepts as many vectors as you like and returns the intersection vector of them all:

## recursive intersect function
intersect.rec <- function(x, ...) {
    a <- list(...)
    if(length(a) == 0) stop("Two or more arguments are needed.")
    if(is.list(a[[1]])) a <- a[[1]]
    if(length(a) == 1) return(intersect(x, a[[1]]))
    else intersect.rec(intersect(x, a[[1]]), a[-1])
}

Sample usage:

x <- letters[1:5]
y <- letters[2:5]
z <- letters[3:5]
intersect.rec(x, y, z)
## returns
## [1] "c" "d" "e"

This is especially a basic example of an R function using dots argument and recursively iterating over them.

A Simple Method of Sample Size Calculation for Logistic Regression

This posting is based on the paper “A simple method of sample size calculation for linear and logistic regression” by F. Y. Hsieh et al., which can be found under http://dx.doi.org/10.1002/(SICI)1097-0258(19980730)17:14<1623::AID-SIM871>3.0.CO;2-S .

We consider  the case where we want to calculate the sample size for a multiple  logistic regression with continous response variable and with continous covariates.

Fomula (1) in the paper computes the required sample size for a simple logistic regression, given the effect size to be tested, the event rate at the mean of the (single) covariate, level of significance, and required power for the test. This formula is implemented in the function SSizeLogisticCon()  from R package “powerMediation” and can easily be applied.

For the multiple case, Hsieh et al. introduce the variance inflation factor (VIF), with which the sample size for the simple case can be inflated to get the sample size for the multiple case. I have implemented it as R function:

## p1: the event rate at the mean of the predictor X
## OR: expected odds ratio. log(OR) is the change in log odds 
##     for an increase of one unit in X.
##     beta*=log(OR) is the effect size to be tested
## r2: r2 = rho^2 = R^2, for X_1 ~ X_2 + ... + X_p
ssize.multi <- function(p1, OR, r2, alpha=0.05, power=0.8) {
	n1 <- SSizeLogisticCon(p1, OR, alpha, power)
	np <- n1 / (1-r2)
	return(np)
}

Another approximation for the simple case is given in Formula (4), and is based on formulae given by A. Whittemore in “Sample size for logistic regression with small response probability”. I have implemente the simple case,

## p1: as above
## p2: event rate at one SD above the mean of X
ssize.whittemore <- function (p1, p2, alpha = 0.05, power = 0.8) {
    beta.star <- log(p2*(1-p1)/(p1*(1-p2)))
    za <- qnorm(1 - alpha/2)
    zb <- qnorm(power)
    V0 <- 1
    Vb <- exp(-beta.star^2 / 2)
    delta <- (1+(1+beta.star^2)*exp(5*beta.star^2 / 4)) * (1+exp(-beta.star^2 / 4))^(-1)
    n <- (V0^(1/2)*za + Vb^(1/2)*zb)^2 * (1+2*p1*delta) / (p1*beta.star^2)
    n.int <- ceiling(n)
    return(n.int)
}

and the multiple case.

## all parameters as above
ssize.whittemore.multi <- function(p1, p2, r2, alpha=0.05, power=0.8) {
	n1 <- ssize.whittemore(p1, p2, alpha, power)
	np <- n1 / (1-r2)
	return(np)
}

The complete R script, the examples from the paper included, can be found under http://rpubs.com/candrea/ssizelogreg .

 

New R Package “betas”

In social science it is often required to compute standardized regression coefficients – called beta coefficients, or simply betas. These betas can be interpreted as effects, and thus are independent of the original scale.

There are two approaches how you can obtain betas. You Z-standardize the data before fitting or you compute the betas after fitting the model.

The first approach has flaws if you have non-numeric variables in your data or if you intent to incorporate interaction terms in your model.

The second approach is a way more convenient, but until now there was no R package helping you compute betas for as many kinds of models as you needed. For example with lm.beta() from “QuantPsyc” Package you cannot handle models with factors with more than two levels.

The features of the “betas” R package are (so far for v0.1.1):

Compute standardized beta coefficients and corresponding standard errors for the following models:

  • linear regression models with numerical covariates only
  • linear regression models with numerical and factorial covariates
  • weighted linear regression models
  • all these linear regression models with interaction terms
  • robust linear regression models with numerical covariates only

You can install the package from CRAN (http://cran.r-project.org/web/packages/betas/):

install.packages("betas")
library(betas)

The package is maintained on GitHub: https://github.com/andreaphsz/betas .

Feel free to report issues: https://github.com/andreaphsz/betas/issues .

Enjoy hassle-free computations of betas in R!

Robust Standardized Beta Coefficients

Standardized beta coefficients are definded as

beta = b * sd_x/sd_y

where b are the coefficients from OLS linear regression, and sd_x and sd_y are standard deviations of each x variable and of y.

In the case where you performe a robust linear regression, sd_x and sd_y seems not be very meanigfull anymore, because variances and hence standard deviations are not robust. The R package “robust” provides the function covRob() to compute a robust covariance estimator.

I have written the following function to compute standardized beta coefficients for a robust linear regression. Setting the parameter classic=TRUE gives you the classic estimation of the beta coefficients. For very bad data, the covRob() function cannot compute the covariance due singularities. In this case the classical estimator is returned.

my.lmr.beta <- function (object, classic = FALSE) {
  if(class(object) != "lmRob")
    stop("Object must be of class 'lmRob'")
  model <- object$model
  num <- sapply(model, is.numeric)  # numeric vars only
  b <- object$coefficients[num][-1]  # final coefficients w/o intercept
  ## compute robust covariance
  covr <- NULL
  try(covr <- diag(covRob(model[num])$cov), silent = TRUE)
  if(is.null(covr) & classic == FALSE)
    warning("covRob() coud not be computed, instead covClassic() was applied.")
  ## compute classic covariance if robust failed
  if(is.null(covr) | classic == TRUE)
    covr <- diag(covClassic(model[num])$cov)
  sx <- sqrt(covr[-1])  # standard deviation of x's
  sy <- sqrt(covr[1])  # standard deviation of y
  beta <- b * sx/sy
  return(beta)
}

 UPDATE — 2014-07-23

Computing standard deviations for factors makes sense, because variance is definded for the binomial distribution.  So I have removed the num variable.

my.lmr.beta <- function (object, classic = FALSE) {
  if(class(object) != "lmRob")
    stop("Object must be of class 'lmRob'")
  model <- object$model
  #num <- sapply(model, is.numeric)  # numeric vars only
  b <- object$coefficients[-1]  # final coefficients w/o intercept
  ## compute robust covariance
  covr <- NULL
  try(covr <- diag(covRob(model)$cov), silent = TRUE)
  if(is.null(covr) & classic == FALSE)
    warning("covRob() coud not be computed, instead covClassic() was applied.")
  ## compute classic covariance if robust failed
  if(is.null(covr) | classic == TRUE)
    covr <- diag(covClassic(sapply(model, as.numeric))$cov)
  sx <- sqrt(covr[-1])  # standard deviation of x's
  sy <- sqrt(covr[1])  # standard deviation of y
  beta <- b * sx/sy
  return(beta)
}

 

Select Pupils by Number of Pupils per Group

Selecting rows conditioned on values in columns is easy, as for example selecting people aged over 33. What is about selecting rows conditioned on statistics computed on multiple rows of the data frame, as for example selecting pupils in groups by the number of pupils per group?

That is where the very nice dplyr package comes in.

We build and print the data frame:

df <- data.frame(id=1:9, classid=c(1,1,1,2,2,3,3,3,3), math=round(runif(9,1,6),1))
> print(df)
  id classid math
1  1       1  5.4
2  2       1  4.0
3  3       1  1.1
4  4       2  2.2
5  5       2  3.9
6  6       3  2.7
7  7       3  6.0
8  8       3  2.0
9  9       3  1.6

Now, we want to select – i.e. “filter” in terms of the dplyr package – pupils that are part of groups/classes with more than two pupils per class. In dplyr there are three different syntaxes to achieve this.

# step-by-step
df.g <- group_by(df, classid)
df.n <- filter(df.g, n()>2)

# or nested syntax
df.n <- filter(group_by(df, classid), n()>2)

# or with %.% operator
df.n <- df %.% 
  group_by(classid) %.% 
  filter(n()>2)

The result is the same for all:

> print(df.n)
Source: local data frame [7 x 3]
Groups: classid

  id classid math
1  1       1  5.4
2  2       1  4.0
3  3       1  1.1
4  6       3  2.7
5  7       3  6.0
6  8       3  2.0
7  9       3  1.6

Of course, you can do this the pure-R-way,

> df.c <- df[df$classid %in% which(xtabs(~classid, df)>2), ]
> print(df.c)
  id classid math
1  1       1  5.4
2  2       1  4.0
3  3       1  1.1
6  6       3  2.7
7  7       3  6.0
8  8       3  2.0
9  9       3  1.6

but I think with dplyr it looks quite a bit nicer.

Happy dpylr!

Compute SPSS like mean index variables

Suppose the problem discribed under theanalysisfactor.com.

The point is to compute meaningfull mean index variables while missing values are present. In R you have the switch na.rm  to tell a function – here  the mean()  function –  what to do with missing values. Setting this to true – mean(…, na.rm=TRUE)  – forces the function to use all non-missing values, even if there is only one. In the other case – mean(…, na.rm=FALSE)  – the function will return NA  even if there is ony one missing value.

To handel this situation I have written a very handy function that works like the MEAN.{X}() function in SPSS, where {X} denotes the minimal number of variables that should be non-missing to be incorporated in computing the mean value.

My single line R function looks like

spss.row.means <- function(vars, not.na=0) {
  apply(vars, 1, function(x) ifelse(sum(!is.na(x)) >= not.na, mean(x, na.rm=TRUE), NA))
}

As the first argument you have to pass the variables (in columns) and the second argument is the minimal number of variables that should be non-missing.

Have fun(ction)!

 

The Beauty of Unique Alphanumeric Random Codes

My challenge was to generate random codes of variable length and size. The codes should be all unique and build of alphanumeric symbols. I wrote the following function using the build-in function sample():

alpha.rnd <- function(length, size) {
	if(size > 36^length) stop("\n size cannot be greater than 36^length")
	b <- NULL
	repeat {
		a <- NULL	
		for (i in 1:length) {
			a <- paste0(sample(c(letters, 0:9), size, replace = TRUE), a)
		}
		b <- c(a,b)
		if(sum(duplicated(b)) == 0) break
	  	size = sum(duplicated(b))
		b <- b[!duplicated(b)]		
	}
	return(b)
}

Example:

> set.seed(77)
> alpha.rnd(3,5)
[1] "5qk" "94z" "f45" "au8" "qq0"

Enjoy!

Compiling R v3.1.0 with MKL Support on OS X v10.8.4

Inspired by Compiling R 3.0.1 with MKL support I compiled R v3.1.0 with MKL (Intel® Math Kernel Library) Support on OS X v10.8.4 and I was wondering if I could see any increase in performance without the need to use parallelism.

First of all you have to download and install MKL for OS X. Unfortunately there is no single package including only the library, instead you have to download either Intel® C++ Composer XE 2011 for Mac OS X or Intel® Fortran Composer XE 2011 for Mac OS X (see also “How can I download the Intel® IPP and Intel® MKL for Mac OS* X?“). Both are not freely available, but you can download and install the “Free 30-Day Evaluation Version”, as I did with Intel® C++ Composer XE 2011. Installation procedure works fine and I trusted the final screen of the installation process who told me, that the installation was successful done.

Then the big nightmare begann by trying to compile/build R from source….

Before you start, you have to install all this developer tools such as Xcode and Fortran.

My first attempt was to compile R v3.0.1. After a few tries I ended up with this configure parameters:

./configure --enable-R-shlib --enable-threads=posix --with-lapack --with-blas="-fopenmp -m64 -I$MKLROOT/include -L$MKLROOT/lib/intel64 -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread -lm" r_arch=x86_64 SHELL="/bin/bash" CC="gcc -arch x86_64 -std=gnu99" CXX="g++ -arch x86_64" OBJC="gcc -arch x86_64" F77="gfortran -arch x86_64" FC="gfortran -arch x86_64" --with-system-zlib

Both variables MKLROOT and LD_LIBRARY_PATH must be defined before, e.g., with the following command:

source /opt/intel/mkl/bin/mklvars.sh intel64

“configure” went right, but then “make” ended up with errors, namely with:

** testing if installed package can be loaded
*** arch - R
ERROR: sub-architecture 'R' is not installed
*** arch - x86_64
ERROR: loading failed for ‘R’
* removing ‘/Users/phsz/Downloads/R-3.0.1/library/MASS’
make[2]: *** [MASS.ts] Error 1
make[1]: *** [recommended-packages] Error 2
make: *** [stamp-recommended] Error 2

After googling for a wile I found the hint to try with the R-devel from svn/trunk, i.e, with R v3.1.0. Here everything went right with “make”. Errors occurred on “make install”, nevertheless I found, that R was working if running with the “–arch” option, i.e., by executing

R --arch=x86_64

First, I did the R benchmark, which showed impressible results:

R 3.1.0 with MKL:

Total time for all 15 tests_________________________ (sec):  7.47700000000002
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.443668541312943

R 3.0.1 without MKL:

Total time for all 15 tests_________________________ (sec):  34.8980000000001
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.34762633295107

My unspoken hope was to be able to increase performance in functions like unidimTest(). But afterwards clearly it could not perform better, because of the chains and loops…

Have fun and be prepared for some more happenings during compiling!

BTW uninstalling MKL is easy:

sudo /opt/intel/composer_xe_2013.3.171/uninstall_ccompxe.sh

 

Parallel computing unidimTest in IRT

In the R package ltm by Dimitris Rizopoulos there is a function called unidimTest(). Computations of this function are very power consuming due the Monte Carlo procedure used inside. Without parallelizing only one core of your (surely) multicore computer is used for this computation. A simple modification of unidimTest() makes it possible to use the quite easy to use R package foreach for parallel computing using the R packages parallel and doParallel. See also parallel-r-loops-for-windows-and-linux by Vik Paruchuri.

I have changed the lines of unidimTest()

for (b in 1:B) {
  if (!missing(object)) 
    z.vals <- rnorm(n, ablts, se.ablts)
  data.new <- rmvlogis(n, parms, IRT = IRT, z.vals = z.vals)
  T.boot[b, ] <- eigenRho(data.new)$ev
}

to

T.boot[,] <- foreach(b=1:B, .combine="rbind", .errorhandling="pass") %dopar%
{
  #if (!missing(object)) 
  z.vals <- rnorm(n, ablts, se.ablts)
  data.new <- rmvlogis(n, parms, IRT = IRT, z.vals = z.vals)
  eigenRho(data.new)$ev   	
}

and named the function unidimTest.p().

The function is called as

library(foreach)
library(parallel)
library(doParallel)

registerDoParallel(cores=4) # put in here how many cores you want to be used

set.seed(666)
proc <- proc.time()
uni.p <- unidimTest.p(fit.rasch.u, B=100)
proc.time() - proc

and works at least on an Mac OS X (10.8.3) environment with R version 3.0.1.