Memo's Island: 2013

Thursday 12 September 2013

A technique for doing parameterized unit test in R: Case study with stock price data analysis

Ensuring the quality and correctness of statistical or scientific software in general constitute as one for the main responsibilities of scientific software developers and scientists who provide a code to solve a specific computational task. Sometimes tasks could be mission critical. For example, in drug trails, clinical research or designing an aviation related component, a wrong outcome would risk human life. ACM award holder John Chambers, creator of S language, which R project is inspired from, stated in his brilliant book Software for data analysis programming with R :

Both the data analyst and the software provider therefore have a strong responsibility to produce a result that is trustworthy, and, if possible, one that can be shown to be trustworthy.

One of the tools in fulfilling this responsibility in practise is implementing unit tests for the functionality of our R code. Usually in a package setting. Aim of unit tests are to ensure that installed or the code in use are producing what is designed for in the smallest individual atomic unit, i.e., functions. RUnit package on CRAN provides set of tools to archive this. Set of assertions can be tested to see if they are all TRUE, leading to a passing test. Instead of talking about what coverage of unit tests are sufficient and what to test in the first place. I would like to ask a different question: What happens if our assertions on the outcome of the function are based on the data? For example, a numerical value that represents a quantity extracted from data by the function in test. Normally, we hard code that numeric value in our unit tests. But if we have lots of different data inputs we may end up re-writing, essentially multiple copies of the same unit test. One way to avoid doing this is utilising what is called in general parametrised unit testing (PUT). In this post I would like to use PUT in a strategy mixed with meta-programming. See my recent post on how to do basic meta-programming with R.

The essential idea of PUT is to write a function in R that produces the unit tests on the fly based on the parameters, like the name of the data and the value of the function. Let's have one trivial example that demonstrates this idea. Imagine that we would like to check the volatility of a stock price in the given time interval. A naive way of doing this is to check stock price standard deviation using R's own sd function, So we are unit testing sd function here. And remember that our unit test must not contain any numerical value hard coded in its body, while that is the thing we would like to avoid. Let's use the following as a 'template' unit test function:

vecSD &lt;- function(vec, expectedSD) {
  sdComputed &lt;- sd(vec)
  checkEquals(sdComputed, expectedSD, tol=1e-6)
}

A metaprogramming exercise will come into picture. Let's assume we would like to check volatility of Google, Microsoft and Apple's stock prices from the beginning of 2005 till the end of 2012. So to speak we need to re-write the vecSD function with 3 different parameter sets. An important note that, we can not call vecSD as a unit test function. Recall that unit tests do not contain any arguments. They are only executed for pass or error, c.f., Runit vignette. We need to generate 3 unit test functions that re-writes vecSD function using Google, Microsoft and Apple stock prices. Let's make this a generic exercise, and write a function that generates n such functions:

vecSDmeta &lt;- function(vecList, expectedSDvec) {
  nFunctions &lt;- length(expectedSDvec)
  for(i in 1:nFunctions) {
    fName &lt;- paste("vecSDunit", i, sep="")
    print(fName)
    assign(fName, eval(
                       substitute(
                                  function() {
                                    sdComputed &lt;- sd(vec)
                                    checkEquals(sdComputed, expectedSD, tol=1e-3)
                                  },
                                  list(vec=vecList[[i]], expectedSD=expectedSDvec[i])
                                 )
                      ),
             envir=parent.frame()
           )
  }
}

We have used the following functions provided by R, substitute and assign. One can think of using body as well but substitution should be performed similarly. Let's use this function with real data, we mentioned above. First we get the stock price data using quandmod and put them in a form which is suitable to call vecSDmeta:

require('quantmod')
require('RUnit')
data.env &lt;- new.env()
getSymbols(c("MSFT", "GOOG", "AAPL"), src='yahoo', env=data.env,  from '2005-01-01', to='2012-12-31');
openGoog &lt;- data.env$GOOG[1:dim(data.env$GOOG)[1], 1]
openMS   &lt;- data.env$MSFT[1:dim(data.env$MSFT)[1], 1]
openAP   &lt;- data.env$AAPL[1:dim(data.env$AAPL)[1], 1]
  vecList  &lt;- list(
                   vGoog = as.vector(openGoog),
                   vMS   = as.vector(openMS),
                   vAP   = as.vector(openAP)
                  )
  expectedSDvec &lt;- c(126.5066, 3.391194, 169.4987) # this is expected

Now if we run vecSDmeta, 3 new unit test functions will be defined automatically. And if we run those generated unit test functions, they should produce all TRUE, otherwise you have a problem with either the volatility value or the data itself.

>  vecSDmeta(vecList, expectedSDvec)
[1] "vecSDunit1"
[1] "vecSDunit2"
[1] "vecSDunit3"
> vecSDunit1()
[1] TRUE
> vecSDunit2()
[1] TRUE
> vecSDunit3()
[1] TRUE

Note that vecSDunitX functions has no parameters and they will be available only after vecSDmeta execution. Which is suitable to use with RUnit. One nice extension to this exercise would be to add a parameter that tells us a file path of the data, if it is read from a file. A more complex case is to have a description file that maps parameters, data and unit test generator can be established, for example using XML, YAML or JSON formats for descriptions. Of course the approach shown here appears to be quite primitive compare to templates provided by C++. Moreover, symbolic manipulation we made using substitute can be extended and put in to general framework, similar to .Net PUT, so we would only provide the template function.

In summary, the technique I have described above would save a lot of time if you deal with testing lots of data and specially in often changing code base. You could be sure that the code does not make funny things if you use many different datasets. Take home message is: if you are testing your functions with a data, don't use hard-coded unit tests, those unit tests will surely not last much longer if you have incoming datasets to test and code base is in constant review.

Sunday 8 September 2013

Weighted Histograms: MATLAB or Octave

Histograms are widely used in exploratory data analysis. In some applications, such as Monte Carlo simulations, sometimes the data or the distribution manifest with corresponding weights, for example in the analysis of umbrella sampling results. In those situations weights of each data point should be taken into account in constructing histograms out of data points. If you are programming with R, weights package would help you to do this. Recently, I have added small utility function in mathworks file exchange called Generate Weighted Histograms to plot weighted histograms. In this post, I would like to demonstrate how these function can be used. Let's generate some synthetic data, which contains values and corresponding weights and find the weighted histogram.

% Generate synthetic data
myData.values = rand(100,1); 
myData.weights = rand(100,1);
% Generate weighted histogram
[histw, vinterval] = histwc(myData.values, myData.weights, 10);
% Visualize
bar(vinterval, histw)

This trivial example demonstrate the usage of histwc function. One can adjust the bins as the third argument of the function.

Thursday 5 September 2013

Metaprogramming in R with an example: Beating lazy evaluation

Functional languages allows us to treat functions as types. This brings us a distinct advantage of being able to write a code that generates further code, this practise is generally known as metaprogramming. As a functional language R project provides tools to perform well structured code generation. In this post, I will present a simple example that generates functions on the fly based on different parametrisation in the function body. Consider the following simple function taking a vector as an argument and returning the number of element that are higher than a given threshold.

myFun <- function(vec) {
  numElements <- length(which(vec > threshold))
  numElements
}

If somehow we need to have a different threshold value within the body, for a moment accept that it is a requirement rather than proposing to have an other argument in the function definition. Instead of rewriting the function by hand we will write a function that generates all these functions in our work space. Problematic bit of this exercise will be to beat lazy evalution. Here is the function that produces losts of myFun type functions:

genMyFuns <- function(thresholds) {
  ll <- length(thresholds)
  print("Generating functions:")
  for(i in 1:ll) {
    fName <- paste("myFun.", i, sep="")
    print(fName)
    assign(fName, eval(
                       substitute(
                                  function(vec) {
                                    numElements <- length(which(vec > tt));
                                    numElements;
                                  }, 
                                  list(tt=thresholds[i])
                                 )
                      ),
             envir=parent.frame()
           )
  }
}

Let's shortly analyse this function. If we don't use substitute explicitly there, due to lazy evalution our value for the threshold will not be assigned at the loop value but the last value of thresholds[i]. Here is one numeric example on the R CLI session:

>  genMyFuns(c(7, 9, 10))
[1] "Generating functions:"
[1] "myFun.1"
[1] "myFun.2"
[1] "myFun.3"
>  myFun.1(1:20)
[1] 13
>  myFun.2(1:20)
[1] 11
>  myFun.3(1:20)
[1] 10
>

To be able to generate code is very powerful tool. However, a caution should be taken in practicing code generation in a large project. This may bring more problems in debugging. Every powerful method comes with a hidden cost.

Thursday 13 June 2013

Practicing static typing in R: Prime directive on trusting our functions with object oriented programming

The creator of S language which R is derived from John Chambers said in one of his books Software for data analysis programming with R:

...This places an obligation on all creators of software to program in such a
way that the computations can be understood and trusted. This obligation
I label the Prime Directive.

He was referring to prime directive from Star Trek. One of the practice in this direction is to have a proper checks in place for the types we use. We can trust that if we pass for example a wrong type to our function, it will fail gracefully. So a type system of a programming language is quite important in mission critical numerical computations. Since R language is weakly typed language, or dynamically typed similar to Perl, Python or Matlab/Octave, most of R users omit to place type checks in their functions if not rarely. For example take the following function that takes arguments of a matrix, a vector and a function name. It applies the named function to each columns of the matrix listed in the given vector. Assuming named function is returning a single number our function will return a vector of numbers.

myMatrixOperation  <-  function(A, v, fName) {
  sliceA <-  A[, v];
  apply(sliceA, 2, fName);
}

One obvious way to put if statements for each argument in our function. So, function may look like:

myMatrixOperation <- function(A, v, fName) {
  if(!is.matrix(A)) {
   stop("A is not a matrix");
  }
  if(!is.vector(v)) {
   stop("v is not a vector");
  }
  if(!is.funcion(fName)) {
   stop("fName is not a function");
  }
  sliceA <- A[, v];
  apply(sliceA, 2, fName);
}

The problem with this approach appears to be the fact that it is too verbose and if we have a repeating pattern of arguments in many functions and many arguments, we would copy and paste code many times. It would not only look ugly but wastes our time. Luckily there is a mechanism to address this: S4-class system. Let's define an S4 class for our set of arguments, following an example instantiation.

setClass("mySlice", representation(A="matrix", v="vector", fName="function"))

myS <- new("mySlice",, A=matrix(rnorm(9),3,3),v=c(1,2), fName=mean)
str(myS)
Formal class 'mySlice' [package ".GlobalEnv"] with 3 slots
  ..@ A    : num [1:3, 1:3] 0.356 -0.34 -0.642 -0.466 2.915 ...
  ..@ v    : num [1:2] 1 2
  ..@ fName:function (x, ...)

Now if we re-write the function that uses our S4 class with type checking only to passing object once.

is.mySlice <- function(obj) {
  l <- FALSE 
  if(class(obj)[1] == "mySlice") { l <- TRUE } 
  l 
} 

myMatrixOperation <- function(mySliceObject) { 
  if(!is.mySlice(mySliceObject)) { 
    stop("argument is not class of mySlice") 
  }  
  sliceA <- mySliceObject@A[, mySliceObject@v]; 
  apply(sliceA, 2, mySliceObject@fName); 
}

This simple example demonstrates how we can introduce a good organization to our R codes, that obeys the prime directive. Further more modern approach to object orientation is introduced by John Chambers called Reference classes. If you practice this kind of approach in your R codes than I can only say; Live long and prosper.

Monday 10 June 2013

Ripley Facts

Normally, this blog would only contain technical and scientific related posts. But this time I would like to share with you a very interesting phenomenon I came across on the R mailing list(s). I call it 'Ripley Facts' after the prolific statistician, educator, academic, author and core developer of the R Software Professor Brian Ripley [here]. Facts are his replies to questions. I have listed some of my favourite quotes from those replies below. The year of post is given at the end of each quote and linked to the archive:

Once you appreciate that you have seriously misread the page, things will become a lot clearer. (2005)
You will need to do your homework a lot more carefully, as it seems you don't have enough knowledge to recognise the errors you are making. (2007)
Well, don't try to use a Makefile as you do not know what you are doing. (2013)
It is user lack-of-understanding: there is no error here. (2013)

So, be careful if you decide to post something not well formed there in the R mailing lists. You will be surely grilled. Actually, that's not the point. The main take home message here is practising self-criticism and ability to find the answers independently alone before asking any technical help and possibly waste other people's time. Though, in some cultures this sort of replies may constitute an offensive reply, those cultures may have no idea about brilliant British humour. If you have a favourite quote from Prof. Ripley please do let me know.

Monday 6 May 2013

Logarithmic Density Plots MATLAB or Octave: Two Gaussians Different Scales

Common visualization of 3D data manifest as surface and mesh plots. Data may contain values which are different in several orders of magnitude if not more. In this post, I will demonstrate how to transform your 3D data values to logarithmic scale so that this magnitude discrepancy is not washed out due to linear scaling. I will give code sample using MATLAB but example should be applicable to Octave as well.
But the procedure should be generic and easily be written in any other language of your choice.

The most common 3D data set appear as values over spatial dimensions. For example third dimension could be a probability distribution. Let's say we use two multivariate gaussians with different means and covariances using mvnpdf. In the first figure we see only a small region; visiable for us. However there are data points which are washed out due to different scale. In the code below we re-write the data's zeros to a very small value (the matrix F), say 1-e14, so that we can convert the values to logarithmic scale, remember that log(0) is an -Inf and not plotable. Resulting figure gives us more information about our data.

% Two Gaussians on the plane
mu1    = [0 0];
sigma1 = [.25 .3; .3 1];
mu2    = [1, 1];
sigma2 = [.0025 .003; .003 0.1];
x      = -4:.1:4; 
y      = -4:.1:4;
[X, Y] = meshgrid(x,y);
F      = mvnpdf([X(:) Y(:)], mu1, sigma1) + mvnpdf([X(:) Y(:)], mu2, sigma2);
F      = reshape(F,length(x),length(y));
figure(1)
pcolor(X, Y, F);
shading flat 
xlabel('x');
ylabel('y');
c=colorbar; 
ylabel(c,'Probability Density') ;
figure(2)
% Fill zeros with small number
 smallN       = 1e-14;
 [~,y]        = size(F);
 getZeroIds   = @(colN) find(F(:,colN) < smallN);
 applyAllCols = @(ii) evalin('base', ['colN=', sprintf('%d',ii) ,';F(getZeroIds(colN), colN)=smallN;']);
 arrayfun(applyAllCols, 1:y);
%
pcolor(X, Y, log10(F));
shading flat 
xlabel('x');
ylabel('y');
c=colorbar; 
ylabel(c,'Probability Density') ;
% 
clear all
close all
[x y] = meshgrid(-5:0.5:5, -5:0.5:5);
z     = - (1 - x.^2 -y.^2) .* exp(-(x.^2 + y.^2)).* exp(-(0.5*x.^2 - y.^2));
surf(z);

Figure 1: Two Gaussians in linear scale.

Figure 2: Two Gaussians in logarithmic scale
(see the code above)

Wednesday 3 April 2013

Reading Binary Data Files Written in C: Python - Numpy Case Study

One of the major components of scientific computation is producing results into a file. Usually bunch of numbers, usually structured, is written into so called data files. Writing binary files as outputs is practised commonly. It is the choice due to good speed and/or memory efficiency compare to plain ASCII files. However, a care on documenting endianness must be observed for good portability.

There are many popular languages to achieve the above task. For speed and efficiency reasons usually C or Fortran is used in writing out a binary file. Let's give an example of writing one integer (42) and three doubles (0.01, 1.01, 2.01) into binary file in C:

#include <stdio.h>
#include <stdlin.h>

int main() {
  FILE * myF;
  int i,j;
  double *numbers, kk;

  myF     = fopen("my.bindata", "wb") ;
  numbers = malloc(3*sizeof(double));
  i = 42;
  fwrite(&amp;i, sizeof(int), 1, myF);
  for(j=0; j<3; j++) {
    kk = (double)j+1e-2;
    numbers[j] = kk;
  }
  fwrite(numbers, sizeof(double), 3, myF);
  fclose(myF);
  return(0);
}

This code would produce a binary file called my.bindata. Our aim is to read this into Python so we can post-process the results i.e. visualisation or further data analysis. The core idea is to use higher language in processing the outputs directly instead of writing further C code; so to speak avoiding one more step in our work flow and avoiding cumbersome compilation of extra C code.

In order to read from files byte by byte, the standard library of Python provides a module called struct. Basically this module provides packing and unpacking of data into or from binary sources, in this case study our source is a file. However it is tedious and error prone to use this in a custom binary file where format would contain different types. Well at least needs an effort to read our custom binary file. At this point, our friend is Numpy facilities. Specially two functionality;
numpy.dtype and numpy.fromfile. The former provides an easy way of defining our file's format similar to Fortran syntax via creation of a data type object as its name stands. The later is a direct way of reading the binary file in one go that would return us a Python object that contains the all information present in the data file.
Here is the Numpy code that reads our binary file created by the above C code.

import numpy as np
dt      = np.dtype("i4, (3)f8")
myArray = np.fromfile('my.bindata', dtype=dt)
myArray[0] 
#(42, [0.01, 1.01, 2.01])
myArray[0][1] 
#array([ 0.01,  1.01,  2.01])
myArray[0][0] 
#42
myArray[0][1][1] 
#1.01

I have tested this case study on GNU/Linux PC, so the binary file is little-endian hence the writing and reading patterns. Ideally a generic wrapper around this Python code would help to simplify things.

Tuesday 5 March 2013

Big-O notation: Care on asymptotic behaviour

Big-O notation conveys us information about how a given algorithm run time scales with respect to a given input size. However, one has to be careful on the interpretation of this scaling (limiting) behaviour. First of all, when we talk about $O(N^2)$, it doesn't mean that it is valid for any input size. Let's illustrate that asymptotic behaviour shown by big-O notation could be “misleading” for “smaller” input. Let’s take number of operation $T(N)=4 N^2 – 2N$. $T’(4)$ must be 4 times of $T’(2)$ if we consider as $T’(N) = N^2$. But if we use $T(N)$ instead, $T(4)$ would be $4.666667$ times of $T(2)$.

A good example in this direction is N-body force calculations (or here). An $O(N^2)$ algorithm performs better compare to $O(NlogN)$ ones up to a break-even point even 50K particles. This is a good paper explaining this;

Pringle, Gavin J. "Comparison of an O (N) and an O (N log N) N-body solver."

[pdf].

Friday 1 February 2013

Multicore Run in Matlab via Python: Generation of Henon Maps

Hennon Map for a=1.4 and b=0.3.

Multi-core processors are considered as a de facto standard. In scientific computing it is common to face a problem of generating data based on a single functionality with many different parameters. This can be thought as "single instruction multiple parameter sets" type of computation. It is quite natural to utilise all cores available in your host machine. While many people uses MATLAB for rapid prototyping; Here I show how to generate many Hennon Maps with different initial conditions (ICs) using MATLAB and Python. We will use Python to drive the computations and "single instruction" is being a function in MATLAB.

Let's first shortly remember the definition of the Hennon Map

$ x_{n+1} = y_{n} + 1 - \alpha x_{n}^2 $
$ y_{n+1} = \beta x_{n} $

It is known that for parameter ranges $\alpha \in [1.16, 1.41]$ and $\beta \in [0.2, 0.3]$ map generates chaotic behaviour i.e. sensitive dependence on initial conditions.

Here is a MATLAB function that generates a png file for a given parameters, initial condition values, file name and the upper bound for the iteration.

function hennonMap(a, b, xn, yn, upper, filename)
% HENNONMAP generate hennon map at given parameters and initial conditions
%
% Mehmet Suzen
% msuzen on gmail
% (c) 2013
% GPLv3
%
% a \in [1.16 1.41]
% b \in [0.2, 0.3]
%
% Example of running this in the command line
%  > matlab  -nodesktop -nosplash -logfile my.log -r "hennonMap(1.4, 0.3, -1.0, -0.4, 1e4, 'hennonA1.4B0.3.png'); exit;"
%
  Xval = zeros(upper, 1);
  Yval = zeros(upper, 1);
  for i=1:upper
    Xval(i) = xn;
    Yval(i) = yn;
    x  = yn + 1 - a*xn^2;
    y  = b * xn;
    xn = x;
    yn = y;
  end
  h = figure
  plot(Xval, Yval,'o')
  title(['Henon Map at ', sprintf(' a=%0.2f', a), sprintf(' b=%0.2f', b)])
  xlabel('x')
  ylabel('y')
  print(h, '-dpng', filename)
end

Running this function from a command line, as described in the function help, would generate the figure shown above. Now imagine that we need to generate many plots with different ICs. It is easy to open many shells and run from command line many times. However it is a manual task and we developers do not like that at all! Python would help us in this case, specifically its multiprocessing module, to spawn the process from a single script. The concept here is closely related to threading, while multiprocessing module mimics threading API.

For example, let's say we have 16 cores and we would like to generate Henon maps for 16 different initial conditions. Here is a Python code running the above Hennon Map function by invoking 16 different MATLAB instances with each using different ICs:

#
# Mehmet Suzen
# msuzen on gmail
# (c) 2013
# GPLv3
#  
#  Run Hennon Map on multi-process (spawning processes)
#

from multiprocessing import Pool
import commands
import numpy
import itertools

def f(argF):
        a            = '%0.1f' % argF[0]
        b            = '%0.1f' % argF[1]
        filename     = "hennonA" + a + "B" + b + ".png"
        commandLine  = "matlab  -nodesktop -nosplash -logfile my.log -r \"hennonMap(1.4, 0.3," +  a + "," + b + ", 1e4, '" + filename + "'); exit;\""
        print commandLine
        return commands.getstatusoutput(commandLine)

if __name__ == '__main__':
        pool     = Pool(processes=12)              # start 12 worker processes
        xns      = list(numpy.linspace(-1.0, 0.5, 4))
        yns      = list(numpy.linspace(-0.4, 1.1, 4))
        print pool.map(f, list(itertools.product(xns, yns)))

It would be interesting to do the same thing only using R. Maybe in the next post, I'll show that too.

Wednesday 30 January 2013

R's range and loop behaviour: Zero, One, NULL

One of the most common pattern in programming languages is to ability to iterate over a given set (a vector usually) by using 'for' loops. In most modern scripting languages range operations is a build in data structure and trivial to use with 'for' loops. In this post, I would like to discuss R's behaviour when the upper bound of the range is zero. For example:

> length(0:0)
[1] 1

> for(i in 0:0) { print(i) }
[1] 0

In Python, the behaviour is somehow more intuitive (well depends on your interpretation of zero):

>>> len(range(0,0))
0
>>> for i in range(0,0): print i
...
>>>

So when you use 'for loops' in R, which you should avoid as much as you can, use 'apply' type operations instead, do not assume that 'for loop' will not enter into 'for body' when the upper bound is zero. This can be important when one uses 'length' for a variable length vector as the upper bound. Similar issue is discussed at the section 8.1.60 of R inferno book. Instead, 'seq_along' is suggested there as a replacement. However, this will also fail to catch 0 length sequence.

> for(i in seq_along(0:0)) { print(i) }
[1] 1

Hence, the usage of 'length' or 'seq_along' to determine the upper bound is not recommended. In relation to this, do not use 'seq' or range ':' in generating a sequence. To completely avoid such a bug to occur, a wrapper sequence generation can be used. This is rather quite a conservative approach but would avoid any confusion if your vector's or sequences are variable length and there is a good chance that their upper bound would be zero during run time. Here is one simple wrapper:

> seqWrapper <- function(lb, ub, by=1) {
+   s <- c()
+   if(!ub <= lb) s <- seq(lb,ub, by=by)
+   return(s)
+ }
> seqWrapper(0,0)
NULL

Sometimes being conservative in coding would help to avoid a simple run time bug.

Memo's Island