Minggu, 26 Juli 2015

PyMC Tutorial #1: Bayesian Parameter Estimation for Bernoulli Distribution

Suppose we have a Coin which consists of two sides, namely Head (H) and Tail (T). All of you might know that we can model a toss of a Coin using Bernoulli distribution, which takes the value of \(1\) (if H appears) with probability \(\theta\) and \(0\) (if T appears) with probability \(1 - \theta\). In this case, \(\theta\) is also called as the parameter of a Bernoulli distribution since knowing the value of \(\theta\) is sufficient for determining \(P(H)\) and \(P(T)\). For a fair Coin, \(\theta\) is set to \(0.5\), which means that we have equal degree of belief for both sides.

This time, we aim at estimating the parameter \(\theta\) of a particular Coin. To do that, first, we need to collect the data sample, which serves as our evidence, from an experiment. Second, we use that data to estimate the parameter \(\theta\). Suppose, to collect the data, we toss the Coin 10 times and record the outcomes. We get a sequence of \(\{H, H, T, H, ..., T\}\) which consists of 10 elements, in which each element represents the outcome a single coin tossing. By assuming that the previous data sample is independent and identically distributed (often referred to as i.i.d), we then perform statistical computation to determine the estimate of \(\theta\).

There are two broad categories of estimating the parameter of a known probability distribution. The first one is so called Maximum Likelihood Estimation (MLE) and the second one is Bayesian parameter estimation. We will examine both methods briefly in this post. In the end, we will focus on Bayesian parameter estimation and show the usage of PyMC (Python library for MCMC framework) to estimate the parameter of a Bernoulli distribution.

Maximum Likelihood Estimation (MLE)

Please do not be afraid when you hear the name of this method ! Eventhough the name of this method is somewhat “long-and-complicated”, but the opposite situation actually happens. MLE often involves basic counting of events on our data. As an example, MLE estimates the paramater θ of the Coin using the following, “surprisingly simple”, statistic

\[\hat{\theta} = \frac{\# Heads}{\# Heads + \# Tails}\]

Because of that, people usually refer MLE as a “Frequentist approach”. In general, MLE aims at seeking a set of parameters which maximizes the likelihood of seeing our data.

\[\hat{\theta} = \substack{argmax \\ \theta} P(x_1, x_2, ..., x_n|\theta)\]

Now, let us try to implement MLE for estimating the parameter of a Bernoulli distribution (using Python programming language). We simulate the experiment of tossing a Coin N times using a list of integer values, in which 1 and 0 represents Head and Tail, respectively. Each value is generated randomly from a Bernoulli distribution.

\[P(H) = P(1) = \theta\]
\[P(T) = P(0) = 1 - \theta\]

We use Bernoulli-like distribution provided by Scipy library. So, we need to import this library as the first step.

 from scipy.stats import bernoulli  

Next, we generate a sample data using the following code.

 sampleSize = 20  
 theta = 0.2  
 def generateSample(t, s):  
  return bernoulli.rvs(t, size=s)  
 data = generateSample(theta, sampleSize)  

The preceding code will assign “data” with the following value


We can see that assigning theta to 0.2 makes the number of 0’s much more than the number of 1’s, which means that Tail has higher probability to occur compared to Head.

Now, we pretend that we do not know the parameter \(\theta\) and we only know the data. Given that data, we are going to estimate the value of \(\theta\), which is unknown to us. We use MLE, which means that we need to implement the aforementioned statistic

 def thetaMLE(data):  
   count = 0  
   for i in data:  
   return count/float(len(data))  

Now, let see several estimates when we use different sample size.

 def showSeveralEstimates(sampleSizes):  
  for size in sampleSizes:  
    estimate = thetaMLE(generateSample(0.2, size))  
    print("using sample with size %i : theta = %f" % (size,estimate))  

The preceding code yields the following results (the results may differ each time you run this program since it involves random sampling):

 using sample with size 10 : theta = 0.3  
 using sample with size 100 : theta = 0.23  
 using sample with size 1000 : theta = 0.194  
 using sample with size 2000 : theta = 0.1965  
 using sample with size 5000 : theta = 0.1982  
 using sample with size 10000 : theta = 0.2006  

Look ! we can see that as the size of data increase, the estimate is getting closer to the real value of \(\theta\), that is \(0.2\). This concludes that if you want to obtain better estimate, you need to increase your data.

Bayesian Parameter Estimation

Although MLE is often easy to prepare as well as to compute, it has several limitations. One of them is that MLE can not leverage prior information or knowledge regarding the parameter itself. For example, based on our experience, we are really certain that a Coin is fair. Unfortunately, when we try to estimate the parameter using MLE, we cannot incorporate such knowledge to our computation. On the other hand, Bayesian Parameter Estimation takes into account prior knowledge regarding the parameter, which makes Bayesian Parameter Estimation provides more realistic and accurate estimation [1][2]. Sometimes we have prior belief about something before the observance of the data or evidence. But, once we finally see the evidence or data about that, we may change our belief [1][2].

Instead of directly estimating \(P(data|parameter)\), Bayesian Parameter Estimation estimates \(P(parameter|data)\).  Here, prior information about parameter \(\theta\) is encoded as a probability distribution \(P(\theta)\), which means that we consider \(\theta\) as a value of a random variable. When we quantify uncertainty about \(\theta\), it will be easy for us to encode our prior belief. After we observe our data, we then change our prior belief towards \(\theta\), into our posterior belief, denoted as \(P(\theta|X)\).

\[P(\theta|x_1, x_2, ..., x_n) \propto P(\theta) P(x_1, x_2, ..., x_n|\theta) \]

In the preceding formula, \(P(\theta)\) is the prior distribution of \(\theta\). \(P(X|\theta)\) is the likelihood of our observed data. The likelihood represents how likely that we will see the observed data when we know already the parameter \(\theta\). \(P(\theta |X)\) is the posterior distribution that represents the belief about \(\theta\) after taking both the data and prior knowledge into account (after we see our data).

We usually use the expected value to give the best estimate of \(\theta\). In other words, given the data \(X\), the estimate of \(\theta\) is obtained by calculating \(E[\theta |X]\). \(E[\theta |X]\) is usually called as the Bayes estimator.

\[\hat{\theta} = E[\theta |x_1, x_2, ..., x_n] \]

Hierarchical Bayesian Model

The prior distribution \(P(\theta)\) may be estimated using the so called hyperprior distributions. This kind of model is known as Hierarchical Bayesian Model. Furthermore, we can also estimate the hyperprior distribution itself, using hyper-hyperprior distribution, and so on. The reason behind using hyperprior distribution is that, instead we use directly the distribution \(\theta\), which may be available (from previous experiment), why don’t we let the “present data tell us about \(\theta\) by themselves” [2].

Let us see the previous example, in which we try to estimate the parameter of Bernoulli parameter \(\theta\), given the data collected by conducting several tosses of a Coin \(\{H, T, H, H, H, T,  ..., T\}.\) Suppose \(x_i\) represents the value of a single Coin toss.

\[x_i \sim Ber(\theta)\]

Now, we can model the parameter \(\theta\) using Beta distribution. In other words, \(\theta\) is a random variable that follows Beta distribution with parameter \(\alpha\) and \(\beta\). \(\alpha\) and \(\beta\) are called hyper-parameters. We use Beta distribution since it is the prior conjugate of a Bernoulli distribution. We will not elaborate more on the notion of conjugacy in this post. However, there are several mathematical reasons behind the use of conjugate prior distributions. One of them is that conjugate prior distribution will make our computation easier.

\[\theta \sim Beta(\alpha, \beta)\]

The posterior distribution of \(\theta\) can be then denoted as follows

\[P(\theta |x_1, ..., x_n, \alpha, \beta) \propto P(\theta |\alpha, \beta) P(x_1, ..., x_n |\theta, \alpha, \beta)\]

We can also represent the preceding model using the well-known plate notation as follows

Where \(N\) represents the number of tosses that we perform (the size of sample data). We get back to our main goal: estimating \(\theta\) (the posterior distribution of \(\theta\)) using Bayesian parameter estimation. We have just learnt that the estimation task involves computing the expectation value of \(\theta\) (\(E[\theta |X]\)), which means that we might need to perform a number of integrations. Unfortunately, in some cases, performing integrations will not be feasible, or at least it will be difficult to achieve a specified accuracy. Thus, we need to think of any approximation methods to back up our plan.

There are many types of numerical approximations for Bayesian parameter estimation. One of them (the most common) is Markov Chain Monte Carlo (MCMC). MCMC estimates the posterior distribution of \(\theta\) by performing a number of iterations or sampling. In each iteration, we improve the quality of our target distribution by leveraging the sampled data, and hoping that it will eventually arrive at the “true” posterior distribution of \(\theta\).

PyMC: A Python Library for MCMC Framework

Now, we are ready to play with the programming problem. Python has a library that provides MCMC framework for our problem. This library is called PyMC. You can go directly to its official website, if you want to know more about it.

First, let’s import several libraries that we need, including PyMC and pymc.Matplot for drawing histogram.

 import pymc as pc  
 import pymc.Matplot as pt  
 import numpy as np  
 from scipy.stats import bernoulli  

Next, we need to create our model.

 def model(data):  
   theta_prior = pc.Beta('theta_prior', alpha=1.0, beta=1.0)  
   coin = pc.Bernoulli('coin', p=theta_prior, value=data, observed=True)  
   mod = pc.Model([theta_prior, coin])  
   return mod  

In the preceding code, we represent \(\theta\) as “theta_prior”, which follows Beta distribution with parameter \(\alpha\) and \(\beta\). Here, we set both \(\alpha\) and \(\beta\) with 1.0. “coin” represents a sequence of coin tosses (NOT a single toss), in which each toss follows Bernoulli distribution (This corresponds to \(X\) in the preceding plate notation). We set “observed=True” since this is our observed data. “p=theta_prior” means that the parameter of “coin” is “theta_prior”. Here, our goal is to estimate the expected value of “theta_prior”, which is unknown. MCMC will perform several iterations to generate the sample from “theta_prior”, in which each iteration will improve the quality of the sample. Finally, we wrap all of our random variables using class Model.

Like the previous one, we need a modul that can generate our toy-sample:

 def generateSample(t, s):  
   return bernoulli.rvs(t, size=s)  

Suppose, we have already generated a sample, and pretend that we do not know the parameter of the distribution where it comes from. We then use the generated sample to estimate \(\theta\).

 def mcmcTraces(data):  
   mod = model(data)  
   mc = pc.MCMC(mod)  
   mc.sample(iter=5000, burn=1000)  
   return mc.trace('theta_prior')[:]  

The preceding procedure/function will produce traces, or MCMC samples generated by a number of interations. Based on that code, MCMC will iterate 5000 times. “burn” specifies the minimum iteration that we need before we are sure that we have achieved the “true” posterior distribution of \(\theta\). The function yields the traces of MCMC (except the sample generated before the burn-in period).

Now, let’s perform the MCMC run on our model, and plot the posterior distribution of \(\theta\) on a histogram.

 sample = generateSample(0.7, 100)  
 trs = mcmcTraces(sample)  
 pt.histogram(trs, "theta prior; size=100", datarange=(0.2,0.9))  

Suppose the data is generated from a Bernoulli distribution with parameter \(\theta = 0.7\) (size = 100). If we draw the traces of \(\theta\) using Histogram, we will get the following figure.

We can see that the distribution is centered in the area 0.65 – 0.80. We are most likely happy with this result (since the prediction somehow close to 0.70), yet the variance is still very high. Now, let see what will happened when we increase the size of our observable data !

The following histogram was generated when we set size to 500:

The following histogram was generated when we set size to 5000:

See ! when we increase the size of our data, then the variance of the distribution gets lower. Thus, we are more confident about our prediction !

If we need the estimate value of \(\theta\), we can use the expected value (mean) of that distribution. We can use numpy library to get the mean of our sample.

 #estimated theta  
 est_theta = np.mean(trs)  

Main References:
[1] Building Probabilistic Graphical Models with Python, Kiran R. Karkera, PACKT Publishing 2014
[2] Bayesian Inference, Byron Hall (STATISTICAT, LLC)

Alfan Farizki Wicaksono
(firstname [at] cs [dot] ui [dot] ac [dot] id)
Fakultas Ilmu Komputer, UI
Ditulis di Tambun, Bekasi, 26 Juli 2015

Tidak ada komentar:

Posting Komentar