Name	Name	Last commit message	Last commit date
parent directory ..
Bayesian_Optimization	Bayesian_Optimization
Bayesian_neural_network	Bayesian_neural_network
code	code
images	images
README.md	README.md

Bayesian Statistics, MLE/MAP and Markov Chain Monte Carlo

Content

1. Bayesian Statistics

1.1 Frequentist and Bayesian

There are two very different approaches of statistical inference from probabilistic systems - frequentist and Bayesian, with very different philosophies [QuantStart Team, 1].

Frequentist statistics tries to eliminate uncertainty by providing estimates and probabilities are the long-run frequency of random events in repeated trials. Bayesian statistics tries to preserve and refine uncertainty by adjusting individual beliefs in light of new evidence.

Thus in the Bayesian interpretation a probability is a summary of an individual's opinion. A key point is that different (intelligent) individuals can have different opinions (and thus different prior beliefs), since they have differing access to data and ways of interpreting it. This philosophy is similar to a random forest, which does not give a definitive value but a collection of values (by Nando de Freita's MCMC lecture [Nando de Freitas]) However, as both of these individuals come across new data that they both have access to, their (potentially differing) prior beliefs will lead to posterior beliefs that will begin converging towards each other, under the rational updating procedure of Bayesian inference.

In order to carry out Bayesian inference, we need to utilise a famous theorem in probability known as Bayes' rule and interpret it in the correct fashion.

1.2 Bayes' Rule for Bayesian Inference

An example: Coin flips

A natural example question to ask is "What is the probability of seeing 3 heads in 8 flips (8 Bernoulli trials), given a fair coin?".

The coin flip obeys a binomial distribution, a collection of Bernoulli trials. A Bernoulli trial is a random experiment with only two outcomes, usually labelled as "success" or "failure", in which the probability of the success is given by $\theta$, i.e. $P(\textrm{Head}) = \theta$, where $0 \le \theta \le 1$, for each trail. A fair coin gives $\theta$ = 0.5, otherwise unfair. The values of $\theta$ itself follows a distribution (like Gaussian), $P(\theta)$, called a prior.

Through coin flip experiments (by repeated Bernoulli trials) we will generate some data, $D$, about heads or tails. We are interested in the probability distribution which reflects our belief about different possible values of $\theta$, given that we have observed some data $D$, denoted by $P(\theta|D)$. The probability of seeing data $D$ given $\theta$ is $P(D|\theta)$.

Bayes' rule

The way to calculate $P(\theta|D)$, called posterior, is

$$P(\theta |D) = \frac{P(D| \theta)P(\theta)}{P(D)} = \frac{\textrm{likelihood} \times \textrm{prior}}{\textrm{evidence}}$$

where:

$P(\theta)$ is the prior: This is the strength in our belief of $\theta$ without considering the evidence $D$. Our prior view on the probability of how fair the coin is.
$P(\theta|D)$ is the posterior: This is the (refined) strength of our belief of $\theta$ once the evidence $D$ has been taken into account. After seeing 4 heads out of 8 flips, say, this is our updated view on the fairness of the coin.
$P(D|\theta)$ is the likelihood: This is the probability of seeing the data $D$ as generated by a model with parameter $\theta$. If we knew the coin was fair, this tells us the probability of seeing a number of heads in a particular number of flips.
$P(D)$ is the evidence: This is the probability of the data as determined by summing (or integrating) across all possible values of $\theta$, weighted by how strongly we believe in those particular values of $\theta$, i.e.

Note since $\theta$ is continuous, we implement integral rather than summation

$$P(D) = \int_{\theta} P(D|\theta)P(\theta)d\theta$$

If we had multiple views of what the fairness of the coin is (but didn't know for sure), then this tells us the probability of seeing a certain sequence of flips for all possibilities of our belief in the coin's fairness.

Coin-flipping example I: A fair coin [QuantStart Team, 1]

In this example we are going to consider multiple coin-flips of a coin with unknown fairness. Now, suppse we have data $D$, which are generated by 1000 Bernoulli trials and $\theta = 0.5$ (but, keep in mind we actually don't know $\theta = 0.5$)

data = stats.bernoulli.rvs(0.5, size=1000)

Our goal is to learn from data and posterior density. We use a Beta distribution as prior. The Beta prior is called a conjugate prior, since the posterior distributions $P(\theta|x)$ are in the same probability distribution family as the prior probability distribution $P(\theta)$, the prior and posterior are then called conjugate distributions.

Once we know the number of heads within the number of trails, the posterior $P(\theta|D)$ can be modelled as a beta distribution

$$P(\theta|D) = \textrm{Beta}(x, N_{H}, N_{T}) \propto x^{N_{H}-1} (1-x)^{N_{T}-1}$$

Uisng Scipy, we can model the posterior density as

x = np.linspace(0, 1, 100) 
y = stats.beta.pdf(x, heads, N - heads)

Before any trials, 0 trials and 0 heads, our posterior density is a uniform distribution. We can interpret it as which due to the unknown fairness, the uniform distribution states that each level of fairness (or each value of $\theta$) to be equally likely.

Next panel shows 2 trials and 1 head. By this data, we have symmetric posterior density pattern, but $\theta = P(H) = 0.2$ and $\theta = 0.8$ still have comparable densities to that at $\theta = 0.5$. The uncertainty of $\theta$ is wide. However, by more and more trails in the data, we can see posterior density has a prominent peak at $\theta = 0.5$, and the uncertainty shrinks.

Here we borrow the code from the blog [QuantStart Team, 1] shown here too.

Coin-flipping example II: An unfair coin

Now what will we see if the data implicitly hints the coin is unfair? Suppose we have data generated from 1000 Bernoulli trials but $\theta = 0.8$:

data = stats.bernoulli.rvs(0.8, size=1000)

By the similar procesures, we can see the posterior is learned to have peak nearby $\theta = 0.8$:

2. Maximum Likelihood/a Posteriori Estimation

In Bayesian framework, maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP), are both methods for estimating some variable in the setting of probability distributions or graphical models [Agustinus Kristiadi].

In short, mathmatically a MLE model is determined by

$$\theta_{\textrm{MLE}}= \underset{\theta}{\textrm{argmax}}P(D| \theta)$$

whereas for MAP, a model is determined by

$$\theta_{\textrm{MAP}}= \underset{\theta}{\textrm{argmax}}P( \theta |D)$$

2.1 maximum Likelihood Estimation (MLE)

We actually use MLE without knowing it in our common life. For example, when fitting a Gaussian to our dataset, we immediately take the sample mean and sample variance, and use it as the parameter of our Gaussian. This is MLE, as, if we take the derivative of the Gaussian function with respect to the mean and variance and setting the derivative to zero [Agustinus Kristiadi]. This step is to maximize the likelihood. Another example is Naive Bayes spam filter. We can comute the likelihood of a specific word appearing in a spam. Thus, given an email, the probability of the email being spam is the naive mulitplication of the individual likelihoods whose words appear in the email.

Most of the optimization in Machine Learning and Deep Learning (neural net, etc), could be interpreted as MLE. Given model parameters $\theta$, the probability to get data or dataset $D$ is the likelihood $P(D|\theta)$. The parameters are determined by maximizing the likelihood function. In comparison, MLE is

$$\theta_{\textrm{MLE}}= \underset{\theta}{\textrm{argmax}}P(D| \theta) = \underset{\theta}{\textrm{argmax}}\prod^{n}_{i=1}P(x_i, y_i|\theta)$$

which gives us an estimate on $\theta$. On other hand, MAP is

$$\theta_{\textrm{MAP}} = \underset{\theta}{\textrm{argmax}}P(\theta |D) = \underset{\theta}{\textrm{argmax}} P(\theta) \prod^{n}_{i=1}P(x_i, y_i|\theta)$$

which gives us a distribution of $\theta$.

MLE Example: Linear regression

To provide more concrete examples, in linear regression, we make an assumption that the likelihood is a normal distribution (both $\theta$ and $x$ are multi-dimensional)

$$P(\symbf{x}_i, y_i|\theta) \sim \exp{ \Big( \frac{-(y_i-\symbf{\theta}^T \symbf{x}_i)^2}{2\sigma^2}} \Big)$$

then the likelihood is

$$L(\symbf{\theta}) = P(\symbf{X}|\theta) = \prod^n_{i=1} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\Big( \frac{-(y_i -\symbf{\theta}^T \symbf{x}_i)^2}{2\sigma^2}} \Big)$$

Note that maximizing the likelihood functions is equal to maximizing the log of these functions. Therefore, We usually denote MLE as

$$\theta_{\textrm{MLE}} = \underset{\theta}{\textrm{argmax}}\log \Big( \prod_{i}P(\symbf{x}_i, y_i| \theta) \Big) = \underset{\theta}{\textrm{argmax}} \sum_i \log P(\symbf{x}_i, y_i| \theta) = \underset{\theta}{\textrm{argmin}} \sum_i \Big[ -\log P(\symbf{x}_i, y_i| \theta) \Big]$$

Thus the problem is equivalent to minimizing the cost functions

$$C(D) = \sum^n_{i=1} (y_i - \symbf{\theta}^T \symbf{x}_i)^2$$

MLE Example: Logistic regression

On the other hand, in logistic regression (binary classification), the likelihood is a Bernoulli distribution. Designate the hypothesis function $h_{\theta}$ defined as

$$ h_{\theta} = p(y=1 \vert \theta, \symbf{x}_i) = \frac{1}{1+e^{-\symbf{\theta}^T \symbf{x}_i}}, $$

is the probability of $y_i = 1$ given $x_i$ and $\theta$. The likelihood is

$$ P( \symbf{x}_i, y_i \vert \theta) = {\theta}^{y} (1 - {\theta})^{1-y} .$$

$$ P( \symbf{x}_i, y_i \vert \theta) = ( h_{\theta} )^{y_i} (1-h_{\theta})^{1-y_i} = \Big( \frac{1}{1+e^{-\symbf{\theta}^T \symbf{x}_i}}\Big)^{y_i}\Big( 1- \frac{1}{1+e^{-\symbf{\theta}^T \symbf{x}_i}} \Big)^{1-y_i}.$$

Then similarly, we convert maximizing the log of the likelihood to optimizing the cost functions

$$C(D) = \sum^n_{i=1} \left[ y_i \log \big( \frac{1}{1+e^{-\symbf{\theta}^T \symbf{x}_i}}\big) + (1-y_i) \log \big( 1- \frac{1}{1+e^{-\symbf{\theta}^T \symbf{x}_i}} \big) \right]$$

2.2 Maximum A Posteriori (MAP)

On the other hand, MAP usually comes up in Bayesian setting. It works on a posterior distribution $P(\theta|D)$, not only the likelihood $P(D|\theta)$

$$\theta_{\textrm{MAP}}= \underset{\theta}{\textrm{argmax}}P( \theta |D) = \underset{\theta}{\textrm{argmax}}\frac{P(D|\theta)P(\theta)}{P(D)} \propto \underset{\theta}{\textrm{argmax}}P(D|\theta)P(\theta)$$

Using the log trick, we can rewrite the above expression as

$$\theta_{\textrm{MAP}} = \underset{\theta}{\textrm{argmax}} \sum^n_{i=1} \Big(\log P(\symbf{x}_i, y_i| \theta) + \log P(\theta) \Big)$$

What it means is that, the likelihood is now weighted with some weight coming from the prior [Agustinus Kristiadi]. When the prior is uniform, i.e. we assign equal weights everywhere, on all possible values of the $\theta$, MLE and MAP have same estimate. When we know parameter distribution, MAP could be more helpful [Mary Mcglohon].

Prior plays a role as regularization

Instead, if we implement Gaussian distribution to the prior $P(\theta)$ [Brian Keng],

$$P(\theta) \propto \exp{\Big( \frac{-(\theta - \mu_{\theta})^2}{2\sigma^2}} \Big)$$

which can be identify to a L2 (Ridge) regularization term if $\mu_{\theta}=0$ (and recall maximizing likelihood is equivalent to minimizing the minus likelihood)

$$-\log P(\theta) = -\log \Big( \exp{\Big[ \frac{-(\theta - \mu_{\theta})^2}{2\sigma^2}} \Big] \Big) = \frac{\theta^2}{2\sigma^2}$$

commonly seen in regression [Nando de Freitas] and $1/\sigma^2$ corresponds to regularization strength.

On the other hand, L1 (Lasso) is the posterior mode for θ under a double-exponenetial prior [Stathis Kamperis].

$$P(\theta) \propto \exp{\Big( \frac{-|\theta - \mu_{\theta}|}{\sigma}} \Big)$$

Below (credit from book: An Introduction to Statistical Learning), Left: Gaussian prior (for ridge). Right: double-exponential prior (for lasso).

MLE is that it overfits the data, meaning that the variance of the parameter estimates is high, or put another way, that the outcome of the parameter estimate is sensitive to random variations in data (by James McInerney, [Quora]). Maximizing MAP can be regarded as adding regularisation to MLE.

The prior beliefs about the parameters determine what this random process looks like. If the prior beliefs are strong, then the observed data have relatively little impact on the parameter estimates, (i.e., low variance but high bias), while if the prior beliefs are weak, then the outcome is more like standard MLE (i.e., low bias but high variance).

2.3 Examples of MLE and MAP

a. Binomial outcome

First let's come back to the coin-flip problem; the outcome is either head or tail. The likelihood of each sample is a Bernoulli distribution (or entire dataset is a binomial distribution) [Barnabás Póczos & Aarti Singh]

$$P(D|\theta) = \prod_{i}P(d_i|\theta) = \theta^{N_H}(1-\theta)^{N_T}$$

Question: suppose we observed data X = {1,1,1,1,1,1} and the sample comes from iid Bernoulli distribution, what is a good guess of $\theta$?

MLE

Here we denote $N_H$ is the number of heads and $\theta$ is the probability to have heads. Now we are looking for $\theta$ which maximizes the probability of observed data

$$\theta_{\textrm{MLE}} = \underset{\theta}{\textrm{argmax}} P(D|\theta) = \underset{\theta}{\textrm{argmax}} \Big[ \theta^{N_H}(1-\theta)^{N_T} \Big] = \underset{\theta}{\textrm{argmax}} \ L(\theta)$$

To do it, we differentiate $\log L(\theta)$ with respect to $\theta$ and make it equal to zero.

$$\log L(\theta) = N_H \log(\theta) + N_T \log (1-\theta); \ \frac{\partial L(\theta)}{\partial \theta} = \left. \Big( \frac{N_H}{\theta} - \frac{N_T}{1-\theta} \Big) \right|_{\theta} = 0$$

Then we obtain the MLE estimate [Cross Validate: How to derive the likelihood function for binomial distribution for parameter estimation?]

$$\theta_{\textrm{MLE}} = \frac{N_H}{N_H+N_T} = \frac{6}{6} = 1$$

which is consistent with our intution.

MAP

For MAP, we need to look for Bayesian posterior $P(\theta|D)$:

$$\theta_{\textrm{MAP}}= \underset{\theta}{\textrm{argmax}}P( \theta |D) \propto \underset{\theta}{\textrm{argmax}}P(D|\theta)P(\theta)$$

if we implement Beta distribution (which is a conjugate prior for binomal distribution) as part of the prior

$$P(\theta)= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}\theta^{\alpha-1} (1-\theta)^{\beta-1} = \frac{1}{B(\alpha, \beta)}\theta^{\alpha-1} (1-\theta)^{\beta-1}$$

By probability normalization, we will have the following relation

$$\int_{\theta} P(\theta) d\theta = 1 = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}\int_{\theta} \theta^{\alpha-1} (1-\theta)^{\beta-1}d\theta = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}.$$

Then we can see the posterior and prior has same form

$$P(\theta|D) \propto P(\theta)P(D|\theta) = \Big( \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1} (1-\theta)^{\beta-1} \Big) \theta^{N_H} (1-\theta)^{N_T},$$

which can be simply rewritten as

$$P(\theta|D) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha^{\prime}-1}(1-\theta)^{\beta^{\prime}-1}, \textrm{ where } \alpha^{\prime}=\alpha+N_H, \ \beta^{\prime}=\beta+N_T.$$

Given the data, the expectation value on $\theta$ is

$$\bar{\theta} = \int_{\theta} \theta P(\theta|D) d\theta= \frac{\Gamma(\alpha^{\prime} + \beta^{\prime})}{\Gamma(\alpha^{\prime}) \Gamma(\beta^{\prime})}\int_{\theta} \theta^{\alpha^{\prime}} (1-\theta)^{\beta^{\prime}-1}d\theta = \frac{\alpha^{\prime}}{ \alpha^{\prime}+\beta^{\prime}}$$

$$\bar{\theta} = \frac{6+\alpha}{6 + \alpha+\beta}$$

From the above, using:

$B(2,2)$, $\bar{\theta} = 8/10$.
$B(1, 1)$, $\bar{\theta} = 7/8$.
$B(1, 0.01)$, $\bar{\theta} = 7/7.01$.

The blog [Suzanna Sia] has more description on Bayesian inference for binomial distributions with Python code.

b. Multinomial outcome

Second example is to extend binomial outcome to multinomial. Let's say if now we have a dice, then the likelihood function becomes

$$P(D|\theta) =\theta^{N_1}_1 \theta^{N_2}_2 \cdots \theta^{N_6}_6$$

If we use Dirichlet distribution as the prior,

$$P(\theta) = \frac{\theta^{\beta_1}_1 \theta^{\beta_2}_2 \cdots \theta^{\beta_6}_6}{B(\beta_1, \cdots, \beta_6)} \sim \textrm{Dirichlet}(\beta_1, \cdots, \beta_6),$$

then we have the posterior as Dirichlet distribution

$$P(\theta|D) \sim \textrm{Dirichlet}(N_1+\beta_1, \cdots, N_6+\beta_6),$$

and we still can maximize the term to estimate $\theta$.

3. Markov Chain Monte Carlo for Bayesian Inference

In the above coin-flip exmaple, we infer a binomial proportion using the concept of conjugate priors. However, not all models can make use of conjugate priors. In such cases, calculation of the posterior distribution would need to be approximated numerically. Recall

$$P(D) = \int_{\theta} P(D, \theta)d\theta = \int_{\theta} P(D|\theta)P(\theta)d\theta$$

To infer a coin flippinh, we have one varable and use the conjugate prior such that both prior and posterior have same probability distribution.

But in reality, both $P(D|\theta)$ and $P(\theta)$ could be high-dimensional, so the integral is difficult to evaluate. This means we are in a situation often described as the Curse of Dimensionality: the volume of a high-dimensional space is so vast that any available data becomes extremely sparse within that space and hence leads to problems of statistical significance.

To be more concrete, let's look at Bayesian logisitic regression [Nando de Freitas].

3.1 Bayesian logisitic regression

Given data $D = (\symbf{X}, y)$, the logisitic regression model specifies a probability of a binary output y = 0, 1 given input $\symbf{X}$:

$$P(y|X, \theta) = \prod_{i=1}^{n} \textrm{Ber}(y_i |\textrm{sigmoid}(\theta^T \symbf{x_i} )) = \prod_{i=1}^n \Big( \frac{1}{1+e^{-\theta^T \symbf{x}_i}}\Big)^{y_i}\Big( 1- \frac{1}{1+e^{-\theta^T \symbf{x}_i}} \Big)^{1-y_i},$$

here $P(y|\theta,X)$ is the likelihood. We can assume the prior is a Gaussian:

$$P(\theta) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp{\Big(-\frac{\theta^T \theta}{2\sigma^2} \Big)}.$$

Our goal is to look for find optimal values of $\theta$. Given the posterior density

$$P(\theta|X,y) = \frac{P(y|\symbf{X}, \theta)P(\theta)}{\int_{\theta} P(y|\symbf{X}, \theta)P(\theta)d\theta}.$$

The optimal values of $\theta$ can be found by maximizing the likelihood of $P(\theta|\symbf{X},y)$

$$L(\theta, \symbf{X}, Y) = \log P(\theta|\symbf{X},y) \propto \sum^n_{i=1} \Big( y_i \log{\big( \frac{1}{1+e^{-\theta^T \symbf{x}_i} }\big)} + (1-y_i) \log{\big( 1-\frac{1}{1+e^{-\theta^T \symbf{x}_i} }\big) \Big)}$$

which is equivalent to minimizing the cross-entropy loss function using gradient descent in logistic regression.

On the other hand, We can image it will become very difficult to solve the integral since if we have 100 features, there are 100 variables in the integral! In Bayesian framework, we are also interested in evaluating expectation value:

3.2 Markov Chain Monte Carlo Algorithms

As above example, we have difficulty to perform high-dimensional integral. The motivation behind Markov Chain Monte Carlo methods is that they perform an intelligent search of the parameters, θ, within a high dimensional space and thus Bayesian Models in high dimensions become tractable. It is to sample from the posterior distribution by combining a "random search" (the Monte Carlo aspect) with a mechanism for intelligently "jumping" around, but in a manner that ultimately doesn't depend on where we started from (the Markov Chain aspect). Hence Markov Chain Monte Carlo methods are memoryless searches performed with intelligent jumps [QuantStart Team, 2]. Here Ben Shaver provided a good introudction. [Ben Shaver]

Markov Chain Monte Carlo is a family of algorithms, rather than one particular method. There are Metropolis, Metropolis-Hastings, the Gibbs Sampler, Hamiltonian MCMC and the No-U-Turn Sampler (NUTS). The main difference among the MCMC algorithms is how you update the

The basic recipes for most MCMC algorithms tend to follow this pattern:

1. Begin the algorithm at the current position θ_current in parameter space

2. Propose a "jump" to a new position θ_new in parameter space

3. Accept or reject the jump probabilistically using the prior information and available data

     If the jump is accepted, move to the new position θ_current = θ_new and return to step 1

     If the jump is rejected, stay unchanged in θ_current and return to step 1

4. After a set number of jumps have occurred, return all of the accepted positions

The Metropolis Algorithm

The Metropolis algorithm uses a normal distribution to propose a jump. This normal distribution has a mean value μ which is equal to the current position and takes a updating step (the blog [QuantStart Team, 2] uses "proposal width") for its standard deviation σ.

The updating step is a parameter of the Metropolis algorithm and has a significant impact on convergence. A larger step jumps parameter further but is easy to miss higher probability areas. However, a smaller step could take longer to converge.