Saturday, 5 August 2017

Post-statistics: Lies, damned lies and data science patents

US Patent (Wikipedia)
Statistics is so important field in our daily lives nowadays, the emerging field of 50 years old data science that is applied to almost every human activity now, or post-statistics, a kind of post-rock,  fusing operations research, data mining, software and performance engineering and of course multitude fields of statistics to machine learning. Even though, the reputation of statistics is a bit shaky due to quotes like  Lies, damned lies, and statistics. Post-statistics is still emerging and drive innovation in almost all industries that drive things with data.
One of the most important characteristics of data science appears to be shared ideal with open source movement, i.e. free software. Note that "free" here means freedom of using the source code and sharing recipes, i.e. a workflows/combination of algorithms for example. The entire innovation in data science we are witnessing last 5 years or so fundamentally driven by this attitude that is embraced by giants like Microsoft, Google and IBM supporting a huge number of enthusiastic individuals from industry and academics. These technology giants open source their workflows and tools to the entire community like Tensorflow and supporting community via event or investing in research that partly goes into public.  On the other hand, traditionally patents are designed to encourage innovation and invention culture. A kind of a gift and a natural right to innovator that given certain time frame he/she or organisation ripe some benefits. 


A recent patent on predicting data science project outcome, unfortunately, do not entirely served to this purpose: Data science project automated outcome prediction US 9710767 B1Even though it is very well written a patent, scope reads very restricted in the first instance, however, the core principle is identical to the standard work-flow activity a data science professional applies in daily routine: where to produce an automated outcome prediction.  The interpretation of  'data science project' is open to any activity on prediction outcome. I am of course no legal expert but based on this patent, which claims to invent outcome prediction pipeline for a 'data science project', Sci-kit learn's workflow manager, pipelines can be taken to court while it facilitates the exact same outcome prediction pipeline this patent claims to invent. It does not matter how this is enforceable but it gives right to patent holder an opportunity to sue everyone doing automated data science outcome prediction. 

This patent US 9710767 B1  is a tremendous disservice to the entire data science community and damaging to an industry and professionals that are trying to use the data in outcome prediction for the greater good in society and solve problems. We definitely do not claim that data science is the solution to our problems in general but will help us to tackle important problems in industry and society. So maybe in the post-statistics world, we have to yell; lies, damned lies and data science patents. While holders of such patent may look like encouraging a patent shark or troll, rather than  the intention of innovating or inventing.


Sunday, 18 June 2017

Pitfalls in pseudo-random number sampling at scale with Apache Spark

With kind contributions from Gregory Piatetsky-Shapiro, KDnuggets

In many data science applications and in academic research, techniques involving Bayesian Inference is now used commonly. One of the basic operation in Bayesian Inference techniques is drawing instances from given statistical distribution. This of course well known pseudo-random number sampling. Most commonly used methods first generates uniform random number and mapping that into distribution of interest via cumulative function (CDF) of it, i.e., Box-Mueller transform.

Large scale simulation are now possible, due to highly stable computational frameworks that can scale well. One of the unique framework is Apache Spark due to its distributed data structure supporting fault tolerance, called Resilient Distributed Data (RDD). Here is a simple way to generate one million Gaussian Random numbers and generating an RDD:

1
2
3
4
5
6
// Generate 1 million Gaussian random numbers
import util.Random
Random.setSeed(4242)
val ngauss = (1 to 1e6.toInt).map(x => Random.nextGaussian)
val ngauss_rdd = sc.parallelize(ngauss)
ngauss_rdd.count // 1 million

One unrealistic part of the above code example is that you may want to generate huge number of samples that won't fit in single memory, ngauss variable above.  Luckily, there are set of library functions one can use to generate random data as an RDD from mllib, see randomRDD. But for the remainder of this post, we will use our home made random RDD.
Figure: Scaling of execution time with increasing size,
with or without re-partitioning.

Concept of Partitions

As RDDs are distributed data structures, the concept of partition comes into play (link)..  So, you need to be careful of the size of partitions in RDDs. Recently I posed a question about this in Apache Spark mailing list (link)(gist). If you reduce the data size, take good care that your partition size reflects this, so to speak avoiding huge performance reduction. Unfortunately, Spark does not provide an automated  out of box solution optimising partition size. The actual data items that might reduce during your analysis pipeline. A reduced RDD will inherit partition size of its parent and this may be a limiting issue.

As you might have already guessed, RDDs are great tool in doing large scale analysis but they won't provide you a free lunch. Let's do a small experiment.

Hands on Experiment

Going back to our original problem of using Spark in Bayesian inference algorithms, it is common to operate on samples via certain procedure. And those procedures, let's say an operator, highly likely that it will reduce the number of elements in the sample. One example would be applying a cut-off or a constrained in the CDF, which essential the definition of it, probability of random variable $x > x_{0}$.  As seen in Figure, we have generated random RDDs up to 10 million numbers and measure the wall-clock time of `count` operation,  which occurs after a filter operation that reduces the number of items considerably. See codes in the Appendix. As a result, in Figure, we have identified 3 different regions, depending on data size, 

  1.  Small Data: Re-partitioning does not play a role in the performance.
  2.  Medium Size: Re-partitioning gives up to order of magnitude better performance.
  3.  Big Data: Re-partitioning gives a constant performance improvement, up to 3 times better, and  the improvement is drifting, meaning it will be more significant larger the data size. 
Conclusion

Spark provides a superb API to develop high quality Data Science solutions. However, programming with Spark and designing algorithms requires optimisation of different aspects of the RDD workflow. Here, we only demonstrate only dramatic effect of re-partitioning after a simple operation in the walk clock time. Hence, it is advised to have a benchmark identifying under which circumstances your data pipeline produce different wall clock behaviour before going into production.

Appendix

Entire code base can be cloned from github  (here).

Spark Benchmark

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/*
     Generating Gaussian numbers : Performance with repartition
     RDD partition size retains from the parent

     (c) 2017
     GPLv3

     Author: Mehmet Suzen (suzen at acm dot org)

     Run this script on spark-shell
     spark-shell --executor-cores 4
     spark-shell> :load gaussian_random_filter.scala

 */

import util.Random     // normal
import breeze.linalg._ // Linear Algebra objects and csvwrite
import java.io.File    // File io

/*

   Generate gaussian random numbers manually without mllib
   Benchmark repartition

 */
// Random numbers to generate
val Ns = Array(1e3, 1e4, 5e4, 1e5, 5e5, 1e6, 2e6, 4e6, 8e6, 1e7)
val benchMat = DenseMatrix.zeros[Double](10,3)
Random.setSeed(4242)
for(i <- 0 to Ns.size-1) {
   println("running for " + Ns(i))
   // Generate random RDD size Ns
   var ngauss     = (1 to Ns(i).toInt).map(x=>Random.nextGaussian)
   var ngauss_rdd = sc.parallelize(ngauss)
   var ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
   // An operation without repartition
   var t0 = System.nanoTime()
   var c1 = ngauss_rdd2.count
   var t1 = System.nanoTime()
   var e1 = (t1 - t0)/1e9 // seconds
   // An operation with repartition
   var ngauss_rdd3 = ngauss_rdd2.repartition(1)
   t0 = System.nanoTime()
   var c2 = ngauss_rdd3.count
   t1 = System.nanoTime()
   var e2 = (t1 - t0)/1e9
   benchMat(i,::) := DenseVector[Double](Ns(i), e1, e2).t
}

/* Record the benchmark results */
csvwrite(new File("bench.csv"), benchMat)

Plotting code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 
#  Plot the benchmark from Spark Gaussian Random numbers
# 
#  (c) 2017
#  GPLv3
#
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape)
bench_df           <- read.csv2("bench.csv", header=FALSE, sep=",")
colnames(bench_df) <- c("N", "no", "yes")
bench_df2          <- reshape2:::melt(bench_df,
                                      measure.vars=c("no","yes"))
colnames(bench_df2) <- c("N", "repartion", "time")
bench_df2$N    <-  as.numeric(as.vector(bench_df2$N))
bench_df2$time <-  as.numeric(as.vector(bench_df2$time))
gt <-  theme(
             panel.background = element_blank(), 
             axis.text.x      = element_text(face="bold", color="#000000", size=11),
             axis.text.y      = element_text(face="bold", color="#000000", size=11),
             axis.title.x     = element_text(face="bold", color="#000000", size=11),
             axis.title.y     = element_text(face="bold", color="#000000", size=11)
            )  
p1                 <- ggplot(bench_df2,
    aes(x=N, y=time, colour=repartion)) + 
               geom_smooth(formula="y~x", span=0.3) + xlab("Number of random draws") + ylab("Wall Clock (Seconds)") +
                             ggtitle("Effect of repartition in count: Gaussian Random Numbers") + 
                             gt
grid.newpage()
footnote <- "(c) 2017, Mehmet Suzen : http://memosisland.blogspot.de/"
g <- arrangeGrob(p1, bottom = textGrob(footnote, x = 0, hjust = -0.1, vjust=0.1, gp = gpar(fontface = "italic", fontsize = 12)))

png(file="spark_repartition_random.png")
grid.draw(g)
dev.off()

Saturday, 7 January 2017

Practical Kullback-Leibler (KL) Divergence: Discrete Case

KL divergence (Kullback-Leibler57) or KL distance is non-symmetric measure of difference between two probability distributions. It is related to mutual information and can be used to measure the association between two random variables.
Figure: Distance between two distributions. (Wikipedia)

In this short tutorial, I show how to compute KL divergence and mutual information for two categorical variables, interpreted as discrete random variables.

${\bf Definition}$: Kullback-Leibler (KL) Distance on Discrete Distributions

Given two discrete probability distributions ${\it P}(A)$ and ${\it Q}(B)$ with discrete random variates, $A$ and $B$, having realisations $A=a_{j}$ and $B=b_{j}$, over $n$ singletons $j=1,...,n$. KL divergence or distance $D_{KL}$ in between $P$ and $Q$ is defined as follows:

$D_{KL} = D_{KL}\big({\it P}(A) || {\it Q}(B)\big)=\sum_{j=1}^{n} {\it P}(A=a_{j}) \log \Big(  \cfrac{{\it P}(A=a_{j})}{{\it Q}(B=b_{j})} \Big)$

$\log$ is in base $e$.

${\bf Definition}$: Mutual Information on Discrete Random Variates

Two discrete random variables $X$ and $Y$, having realisations $X=x_{k}$ and $Y=y_{l}$, over $m$ and $n$ singletons $k=1,...,n$ and $l=1,...,m$ respectively, are given. Mutual information, $I(X;Y)$ is defined as follows,

$I(X;Y) = \sum_{k=1}^{n} \sum_{l=1}^{m} {\it R}(X=x_{k}, Y=y_{l}) \log \Big(\cfrac{  {\it R}(X=x_{k}, Y=y_{l})  }{  {\it R}(X=x_{k})  {\it R}(Y=y_{l})} \Big)$

$\log$ is in base $e$ and $R$ denotes probabilities.

${\bf Definition}$: Mutual Information on Discrete Random Variates with KL distance

Two discrete random variables $X$ and $Y$. Mutual information, $I(X;Y)$ is defined as follows,

$I(X;Y) = D_{KL} \Big( {\it R}(X, Y) || {\it R}(X) {\it R}(Y) \Big)$

${\bf Problem: Example}$: Given two discrete random variables $X$ and $Y$ defined on the following sample spaces $(a,b,c)$ and $(d,e)$ respectively. Write down the expression for the mutual information, I(X;Y), expanding summation.

${\bf Solution: Example}$:

$I(X;Y)  = {\it R}(X=a, Y=d) \log \Big(\cfrac{ {\it R}(X=a, Y=d) }{  {\it R}(X=a)  {\it R}(Y=d)} \Big) +$

${\it R}(X=b, Y=d) \log \Big(\cfrac{  {\it R}(X=b, Y=d)  }{  {\it R}(X=b)  {\it R}(Y=d)} \Big)+$

${\it R}(X=c, Y=d) \log \Big(\cfrac{  {\it R}(X=c, Y=d)  }{  {\it R}(X=c)  {\it R}(Y=d)} \Big)+$

${\it R}(X=a, Y=e) \log \Big(\cfrac{  {\it R}(X=a, Y=e)  }{  {\it R}(X=a)  {\it R}(Y=e)} \Big)+$

${\it R}(X=b, Y=e) \log \Big(\cfrac{  {\it R}(X=b, Y=e)  }{  {\it R}(X=b)  {\it R}(Y=e)} \Big)+$

${\it R}(X=c, Y=e) \log \Big(\cfrac{  {\it R}(X=c, Y=e)  }{  {\it R}(X=c)  {\it R}(Y=e)} \Big)$


${\bf Definition}$: Two discrete random variables $X$ and $Y$, having realisations $X=x_{k}$ and $Y=y_{l}$, over $m$ and $n$ singletons $k=1,...,n$ and $l=1,...,m$ respectively, are given. Two-way contingency table of realisations of $X$ and $Y$ along the same measured data points, $N$ observations, can be constructed by counting co-occurances, or cross-classifying factors. Normalizing the resulting frequency table will produce joint and marginal probabilities.

                Y=y_{1}              ...    Y=y_{l}              R(X)
               
    X=x_{1}    R(X=x_{1}, Y=y_{1})   ... R(X=x_{1}, Y=y_{l}) R(X=x_{1})
    X=x_{2}    R(X=x_{2}, Y=y_{1})   ... R(X=x_{2}, Y=y_{l}) R(X=x_{2})
    ...             ...              ...        ...              ...
    X=x_{k}    R(X=x_{k}, Y=y_{1})   ... R(X=x_{k}, Y=y_{1} R(X=x_{k}) 
   
    R(Y)        R(Y=y{1})           ...  R(Y=y_{l})


Simulated example with R


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
set.seed(3951824)
# Table of counts (contingency)
X <- sample(letters[1:3], 100, 1)
Y <- sample(letters[4:5], 100, 1)
t  <- table(X,Y)
tp  <- t/100 # proportions
tn  <- tp/sum(tp)     # normalized, joints
p_x <- rowSums(tn)    # marginals
p_y <- colSums(tn)

P <- tn 
Q <- p_x %o% p_y 

# P(X, Y)   : bin frequency: P_i
# P(X) P(Y) : bin frequency, Q_i 
mi <- sum(P*log(P/Q))
library(entropy)
mi.empirical(t) == mi

References

(KullbackLeibler57) Kullback, S.; Leibler, R.A. (1951). "On information and sufficiency". Annals of Mathematical Statistics 22 (1): 79–86. 


(c) Copyright 2008-2017 Mehmet Suzen (suzen at acm dot org)

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License.