Subido por Paco Martin

EM ALGORITHM

Anuncio
The EM Algorithm
The EM Algorithm
The EM Algorithm
The EM algorithm is a general method for finding maximum
likelihood estimates of the parameters of an underlying
distribution from the observed data when the data is
”incomplete” or has ”missing values”
The ”E” stands for ”Expectation”
The ”M” stands for ”Maximization”
To set up the EM algorithm successfully, one has to come up
with a way of relating the unobserved complete data with the
observed incomplete data so that the complete data has a
”nice” function of the unknown parameters such the
maximum likelihood estimation is easy
The EM Algorithm
The EM Algorithm Setup
Let Y be the observed data and let X be the unobserved
complete data
For the EM algorithm requires that:
observed data Y can be written as a function of the complete
data X . That is, there is some function t(X ) = Y that
collapses or projects X onto Y .
complete data X has a probability density or distribution
f (X |θ) for some parameter vector θ
We want to find θ̂ such that f (X |θ) is maximized, i.e., we
want to find the MLE (Maximum Likelihood Estimator).
The EM Algorithm
The EM Algorithm
If X were known, to find θ̂ we would generally take the log of
the likelihood function first
l(X |θ) = ln(f (X |θ))
We then maximize l(X |θ) with respect to θ
But since X is not observed, it is not possible to maximize
l(X |θ)
The EM algorithm can be used to maximize a conditional
likelihood for the unobserved X .
The EM Algorithm
The EM Algorithm
In the E step (expectation step) of the algorithm, we calculate
the following conditional expectation
Q(θ|θ0 ) = E [l(X |θ)|Y , θ0 ] = E [ln(f (X |θ))|Y , θ0 ]
where θ0 is some initial value of θ
Q is the expected complete data log-likelihood
The M step of the algorithm finds the θ̂ that maximizes
Q(θ|θ0 ).
Then set θ1 = θ̂, where θ1 is now your current estimate
Return to the E step and start the process over again by
calculating Q(θ|θ1 ) and maximizing it with respect to θ
Repeat this process until convergence, i.e., |θn − θn−1 | 6 ,
where is some small number (e.g. .0001)
The essence of the EM algorithm is that for each iteration
maximizing Q(θ|θi ) leads to an increase of the log likelihood
of the observed data Y for each iteration i.
The EM Algorithm
ABO Blood Type Example
The locus corresponding to the ABO blood group has three alleles,
A, B and O and is located on chromosome 9q34. Alleles A and B
are co-dominant, and the alleles A and B are dominant to O. This
leads to the following genotypes and phenotypes:
Genotype
AA, AO
BB, BO
AB
OO
Blood Type
A
B
AB
O
The EM Algorithm
EM: ABO Blood Type Example
From a sample of 521, the following blood types were observed:
Blood Type
A
B
AB
O
Total Number
186
38
13
284
We want to estimate pA , pB , and pO , the frequency of alleles
A, B, and O, respectively. How can we do this?
Note that θ = (pA , pB , pO ).
What is the observed data?
What is the complete data?
The EM Algorithm
EM: ABO Blood Type Example
Let N be the number of people in the study. The complete
data is X = (nA/A , nA/O , nB/B , nB/O , nA/B , nO/O ), where
nA/A is the number of people with A/A genotype, nA/O is the
number of people with the A/O genotype, etc...
The observed data is Y = (nA , nB , nAB , nO ), where nA is the
number of people with blood type A, nB is the number of
people with blood type B, etc...
What is N, the total number of people in the sample, in terms
of the unobserved complete data?
What is nA in terms of the unobserved complete data?
What is nB in terms of the unobserved complete data?
What is nAB in terms of the unobserved complete data?
What is nO in terms of the unobserved complete data?
What is the complete data likelihood? Assume HWE.
What is the complete data log-likelihood? Assume HWE.
The EM Algorithm
EM: ABO Blood Type Example
nA = nA/A + nA/O
nB = nB/B + nB/O
nAB = nA/B
nO = nO/O
What is the complete data log likelihood? Assume HWE.
If the genotype data at the ABO gene were observed, then the
likelihood function would have the following multinomial
distribution
N
f (X |θ) =
×
nA/A , nA/O , nB/B , nB/O , nA/B , nO/O
2 nO/O
(pA2 )nA/A (2pA pO )nA/O (pB2 )nB/B (2pB pO )nB/O (2pA pB )nA/B (pO
)
The EM Algorithm
Expectation Step of EM Algorithm
The complete data log-likelihood function is
N
ln(f (X |θ)) = ln
+
nA/A , nA/O , nB/B , nB/O , nA/B , nO/O
nA/A ln(pA2 )+nA/O ln(2pA pO )+nB/B ln(pB2 )+nB/O ln(2pB pO )+
2
nA/B ln(2pA pB ) + nO/O ln(pO
)
Remember that Y = (nA , nB , nAB , nO ). For the initial
iteration of the EM algorithm, the E step calculates
Q(θ|θ0 ) = E [ln(f (X |θ))|Y , θ0 ]
0 )), and we want to calculate
So, θ0 = (pA0 , pB0 , pO
0
0
0)
Q(pA , pB , pO |pA , pB , pO
0
0 ]?
= E [nA/A |Y , pA0 , pB0 , pO
What is nA/A
The EM Algorithm
Expectation Step of EM Algorithm
0
nA/A
= nA P(AA genotype|A blood type) = nA
0
0 ]?
What is nA/O
= E [nA/O |Y , pA0 , pB0 , pO
0
0 ]?
What is nA/B
= E [nA/B |Y , pA0 , pB0 , pO
0
0 ]?
What is nB/O
= E [nB/O |Y , pA0 , pB0 , pO
0
0 ]?
What is nO/O
= E [nO/O |Y , pA0 , pB0 , pO
The EM Algorithm
(pA0 )2
0
(pA0 )2 + 2pA0 pO
Expectation Step of EM Algorithm
Q(θ|θ0 ) =
0
0
0
0
nA/A
ln(pA2 )+nA/O
ln(2pA pO )+nB/B
ln(pB2 )+nB/O
ln(2pB pO )+
2
nA Bln(2pA pB ) + nO ln(pO
) + g (X )
where g (X ) = n ,n ,n N,n ,n ,n
and is not a
A/A A/O B/B B/O A/B O/O
function of the parameters pA , pB , and pO
For the M step, we want θ̂ = (p̂A , p̂B , p̂O ) that maximizes Q
How would we do this?
The EM Algorithm
Maximization Step of EM Algorithm
The M step involves maximizing Q, the expected value of the
log-likelihood (obtained in the E step) with respect to
θ = (pA , pB , pO ).
The MLE is:
p̂A =
p̂B =
p̂O =
0
0
2nA/A
+nA/O
+nAB
2N
0
0
2nB/B
+nB/O
+nAB
2N
0
0
+nB/O
2nO +nA/O
2N
1 = p̂
The next step is to set pA1 = p̂A , pB1 = p̂B , pO
O
Then return to the E step of the algorithm and compute
1)
Q(θ|θ1 ), where θ1 = (pA1 , pB1 , pO
Continue iterating between the E and the M step until the θi
values converge.
The EM Algorithm
Descargar