## Thursday, 12 September 2013

### A technique for doing parameterized unit test in R: Case study with stock price data analysis

Ensuring the quality and correctness of statistical or scientific software in general constitute as one for the main responsibilities of scientific software developers and scientists who provide a code to solve a specific computational task. Sometimes tasks could be mission critical. For example, in drug trails, clinical research or designing an aviation related component,  a wrong outcome would risk human life. ACM award holder John Chambers, creator of S language, which R project is inspired from, stated in his brilliant book
Both the data analyst and the software provider therefore have a strong responsibility to produce a result that is trustworthy, and, if possible, one that can be shown to be trustworthy.
One of the tools in fulfilling this responsibility in practise is implementing unit tests for the functionality of our R code. Usually in a package setting. Aim of unit tests are to ensure that installed or the code in use are producing what is designed for in the smallest individual atomic unit, i.e., functions. RUnit package on CRAN provides set of tools to archive this. Set of assertions can be tested to see if they are all TRUE, leading to a passing test. Instead of talking about what coverage of unit tests are sufficient and what to test in the first place. I would like to ask a different question: What happens if our assertions on the outcome of the function are based on the data? For example, a numerical value that represents a quantity extracted from data by the function in test. Normally, we hard code that numeric value in our unit tests. But if we have lots of different data inputs we may end up re-writing, essentially multiple copies of the same unit test. One way to avoid doing this is utilising what is called in general parametrised unit testing (PUT). In this post I would like to use PUT in a strategy  mixed with meta-programming. See my recent post on how to do basic meta-programming with R.

The essential idea of PUT is to write a function in R that produces the unit tests on the fly based on the parameters, like the name of the data and the value of the function. Let's have one trivial example that demonstrates this idea. Imagine that we would like to check the volatility of a stock price in the given time interval. A naive way of doing this is to check stock price standard deviation using R's own sd function, So we are unit testing sd function here. And remember that our unit test must not  contain any numerical value hard coded in its body, while that is the thing we would like to avoid. Let's use the following as a 'template' unit test function:
vecSD &lt;- function(vec, expectedSD) {
sdComputed &lt;- sd(vec)
checkEquals(sdComputed, expectedSD, tol=1e-6)
}

A metaprogramming exercise will come into picture. Let's assume we would like to check volatility of Google, Microsoft and Apple's stock prices from the beginning of 2005 till the end of 2012.  So to speak we need to re-write the  vecSD function with 3 different parameter sets. An important note that, we can not  call vecSD as a unit test function. Recall that unit tests do not contain any arguments. They are only executed for pass or error, c.f., Runit vignette. We need to generate 3 unit test functions that re-writes vecSD function using Google, Microsoft and Apple stock prices. Let's make this a generic exercise, and write a function that generates n such functions:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 vecSDmeta <- function(vecList, expectedSDvec) { nFunctions <- length(expectedSDvec) for(i in 1:nFunctions) { fName <- paste("vecSDunit", i, sep="") print(fName) assign(fName, eval( substitute( function() { sdComputed <- sd(vec) checkEquals(sdComputed, expectedSD, tol=1e-3) }, list(vec=vecList[[i]], expectedSD=expectedSDvec[i]) ) ), envir=parent.frame() ) } } 
We have used the following functions provided by R, substitute and assign. One can think of using body as well but substitution should be performed similarly. Let's use this function with real data, we mentioned above. First we get the stock price data using quandmod and put them in a form which is suitable to call vecSDmeta:

  1 2 3 4 5 6 7 8 9 10 11 12 13 require('quantmod') require('RUnit') data.env <- new.env() getSymbols(c("MSFT", "GOOG", "AAPL"), src='yahoo', env=data.env, from '2005-01-01', to='2012-12-31'); openGoog <- data.env$GOOG[1:dim(data.env$GOOG)[1], 1] openMS <- data.env$MSFT[1:dim(data.env$MSFT)[1], 1] openAP <- data.env$AAPL[1:dim(data.env$AAPL)[1], 1] vecList <- list( vGoog = as.vector(openGoog), vMS = as.vector(openMS), vAP = as.vector(openAP) ) expectedSDvec <- c(126.5066, 3.391194, 169.4987) # this is expected 

Now if we run vecSDmeta, 3 new unit test functions will be defined automatically. And if we run those generated unit test functions, they should produce all TRUE, otherwise you have a problem with either the volatility value or the data itself.

  1 2 3 4 5 6 7 8 9 10 > vecSDmeta(vecList, expectedSDvec) [1] "vecSDunit1" [1] "vecSDunit2" [1] "vecSDunit3" > vecSDunit1() [1] TRUE > vecSDunit2() [1] TRUE > vecSDunit3() [1] TRUE 
Note that vecSDunitX functions has no parameters and they will be available only after vecSDmeta execution. Which is suitable to use with RUnit. One nice extension to this exercise would be to add a parameter that tells us a file path of the data, if it is read from a file. A more complex case is to have a description file that maps parameters, data and unit test generator can be established, for example using XML, YAML or JSON  formats for descriptions. Of course the approach shown here appears to be   quite primitive  compare to templates provided by C++. Moreover, symbolic manipulation we made using substitute can be extended and put in to general framework, similar to .Net PUT, so we would only provide the template function.

In summary, the technique I have described above would save a lot of time if you deal with testing lots of data and specially in often changing code base. You could be sure that the code does not make funny things if you use many different datasets. Take home message is: if you are testing your functions with a data, don't use hard-coded unit tests, those unit tests will surely not last much longer if you have incoming datasets to test and code base is in constant review.

## Sunday, 8 September 2013

### Weighted Histograms: MATLAB or Octave

Histograms are widely used in exploratory data analysis.  In some applications, such as Monte Carlo simulations, sometimes the data or the distribution manifest with corresponding weights, for example in the analysis of  umbrella sampling results. In those situations weights of each data point should be taken into account in constructing histograms out of data points.  If you are programming with R,  weights package would help you to do this. Recently, I have added small utility function in mathworks file exchange called Generate Weighted Histograms to plot weighted histograms. In this post, I would like to demonstrate how these function can be used. Let's generate some synthetic data, which contains values and corresponding weights and find the weighted histogram.

 1 2 3 4 5 6 7 % Generate synthetic data myData.values = rand(100,1); myData.weights = rand(100,1); % Generate weighted histogram [histw, vinterval] = histwc(myData.values, myData.weights, 10); % Visualize bar(vinterval, histw) 
This trivial example demonstrate the usage of histwc function. One can adjust the bins as the third argument of the function.

## Thursday, 5 September 2013

### Metaprogramming in R with an example: Beating lazy evaluation

Functional languages allows us to treat functions as types. This brings us a distinct advantage of being able to write a code that generates further code, this practise is generally known as metaprogramming. As a functional language R project provides tools to perform well structured code generation. In this post, I will present a simple example that generates functions on the fly based on different parametrisation in the function body. Consider the following simple function taking a vector as an argument and returning the number of element that are higher than a given threshold.
 1 2 3 4 myFun <- function(vec) { numElements <- length(which(vec > threshold)) numElements } 
If somehow we need to have a different threshold value within the body, for a moment accept that it is a requirement rather than proposing to have an other argument in the function definition. Instead of rewriting the function by hand we will write a function that generates all these functions in our work space. Problematic bit of this exercise will be to beat lazy evalution.  Here is the function that produces losts of myFun type functions:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 genMyFuns <- function(thresholds) { ll <- length(thresholds) print("Generating functions:") for(i in 1:ll) { fName <- paste("myFun.", i, sep="") print(fName) assign(fName, eval( substitute( function(vec) { numElements <- length(which(vec > tt)); numElements; }, list(tt=thresholds[i]) ) ), envir=parent.frame() ) } } 
Let's shortly analyse  this function. If we don't use substitute explicitly there, due to lazy evalution our value for the threshold will not be assigned at the loop value but the  last value of thresholds[i]. Here is one numeric example on the R CLI session:

  1 2 3 4 5 6 7 8 9 10 11 12 > genMyFuns(c(7, 9, 10)) [1] "Generating functions:" [1] "myFun.1" [1] "myFun.2" [1] "myFun.3" > myFun.1(1:20) [1] 13 > myFun.2(1:20) [1] 11 > myFun.3(1:20) [1] 10 > 
To be able to generate code is very powerful tool. However, a caution should be taken in practicing code generation in a large project. This may bring more problems in debugging. Every powerful method comes with a hidden cost.