# Markov Chain Monte Carlo (MCMC)

Recently I have been doing some revision on Data Science as I have gotten bored from Big Data. When one relearns, one would understand the subject from a different perspective. Following is my short note.

1. Result/ Example:  MCMC Word decryption of Metropolis variation, using Unigram of word, Bigrams of word and Trigrams of letter.
2. Motivation: Given a known equation with a set of parameters, one would and could a). Get the probability of a set of parameters   (and hence can apporoximate each parameter’s distribution) b). Get the best set of parameters.
3. Problem: Infinite combination of parameters and hence the calculation of denominator is intractable.
4. Solution: MCMC
5. Intuition: a) Monte Carlo – Sampling methods which follow a general format, i.e.: i) Identify parameters ii) Sample value from distribution iii) Compute the output iv)  Aggregate result b) Markov Chain – Localized (dependency, new sample depends on previous draw to reduce search space by introducing constraints which varies according to algorithms)
6. Common variation (algorithm):
a) Metropolis – for when i) probability distribution of random walk is “symmetric”, i.e.: equal in any direction. ii) Unknown full conditional distribution, meaning, distribution of parameters are not known.
b) Metropolis-Hastings – same as a) however augments asymmetric probability distribution of  random walk.
c) Gibbs-Sampling – Used when full conditional distribution is known. Can still get the joint probability, proven using Hammersly-Clifford Theorem.
7. MCMC parameters to improve convergence: a) Burn in – To remove initial records from sampled set as they were erratic in movements b)  Sampling interval – i.i.d c) Number of samples – to get to the intended distribution d) Perturbation size – Random walks magnitude.

Reference(s):