Recently I have been doing some revision on Data Science as I have gotten bored from Big Data. When one relearns, one would understand the subject from a different perspective. Following is my short note.
- Result/ Example: MCMC Word decryption of Metropolis variation, using Unigram of word, Bigrams of word and Trigrams of letter.
- Motivation: Given a known equation with a set of parameters, one would and could a). Get the probability of a set of parameters (and hence can apporoximate each parameter’s distribution) b). Get the best set of parameters.
- Problem: Infinite combination of parameters and hence the calculation of denominator is intractable.
- Solution: MCMC
- Intuition: a) Monte Carlo – Sampling methods which follow a general format, i.e.: i) Identify parameters ii) Sample value from distribution iii) Compute the output iv) Aggregate result b) Markov Chain – Localized (dependency, new sample depends on previous draw to reduce search space by introducing constraints which varies according to algorithms)
- Common variation (algorithm):
a) Metropolis – for when i) probability distribution of random walk is “symmetric”, i.e.: equal in any direction. ii) Unknown full conditional distribution, meaning, distribution of parameters are not known.
b) Metropolis-Hastings – same as a) however augments asymmetric probability distribution of random walk.
c) Gibbs-Sampling – Used when full conditional distribution is known. Can still get the joint probability, proven using Hammersly-Clifford Theorem. - MCMC parameters to improve convergence: a) Burn in – To remove initial records from sampled set as they were erratic in movements b) Sampling interval – i.i.d c) Number of samples – to get to the intended distribution d) Perturbation size – Random walks magnitude.
Reference(s):