Senin, 27 Juli 2015

PyMC Tutorial #3: Latent Dirichlet Allocation (LDA) using PyMC

Before you read this post, we suggest you to read our previous post regarding Naïve Bayes NB topic model since the code presented in this post is just the modification from the previous post. Some terms have been explained in the previous post.

Previously, we have explained that Naïve Bayes (NB) model assumes that each document in the collection is drawn from a single topic. Unlike Naïve Bayes, Latent Dirichlet Allocation (LDA) assumes that a single document is a mixture of several topics [1][2]. Moreover, we will use smoothed version of LDA, which is described in its original paper authored by Blei et al. [1]. The smoothed version of LDA was proposed to tackle the sparsity problem in our collection. A document might have words that do not appear in the other documents. As a result, this situation will set zero probability to such document. Hence, it is impossible to generate the document. Smoothed-LDA overcomes this kind of problem [1].


The “generative aspect” of LDA model

We will just modify a little bit the NB model from the previous post !

Suppose, \(D\) is the number of documents in our collection. \(N_d\) is the number of words in the \(d\)-th document. \(K\) is the number of predefined-topics. In this case, the number of topics is not automatically inferred. Instead, we will manually set the value of \(K\) based on our intuition.

For each topic \(k\), we draw its word distribution, which is denoted as \(\phi_k\). Since we are going to model the word distribution using multinomial/categorical distribution, we then use the Dirichlet distribution (its conjugate prior) to model its parameter.

\[\phi_k \sim Dir(\beta), 1 \leq k \leq K\]

After that, for each document \(d\), we draw a topic distribution, which is denoted as \(\theta_d\)

\[\theta_d \sim Dir(\alpha), 1 \leq d \leq D\]

See ! the preceding notation differs from NB model ! since a document might contain more than one topics (with their corresponding proportions) !

The hyperparameters \(\alpha\) and \(\beta\) are assumed to be fixed in our case.

Then, for each word \(n\) in document \(d\), we draw a topic for that word. This topic is denoted as \(z_{d,n}\)

\[z_{d,n} \sim Categorical(\theta_d), 1 \leq d \leq D, 1 \leq n \leq N_d\]

In NB model, we have seen that a single topic is associated with a document. On the other hand, LDA associates a single topic with a single word. Hence, drawing topic must be performed in the level of word, not document.

Finally, after we know the topic of a word (\(z_{d,n}\)), we then draw the physical word itself from the word distribution associated with its selected topic. Each word is denoted as \(w_{d,n}\).

\[w_{d, n} \sim Categorical(\phi_k),  1 \leq d \leq D, 1 \leq n \leq N_d\]

The following figure is the plate notation for LDA model (smoothed):




Now, it is time to build the code using PyMC. We just need to modify the previous code :)

-code1-
 import numpy as np  
 import pymc as pc  
   
   
 def wordDict(collection):  
  word_id  = {}  
  idCounter = 0  
  for d in collection:  
    for w in d:  
      if (w not in word_id):  
        word_id[w] = idCounter  
        idCounter+=1  
  return word_id  
   
 def toNpArray(word_id, collection):  
  ds = []  
  for d in collection:  
    ws = []  
    for w in d:  
      ws.append(word_id.get(w,0))  
    ds.append(ws)  
  return np.array(ds)  
   
 ###################################################  
   
 #doc1, doc2, ..., doc7  
 docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],  
         ["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],  
         ["sepak","bola","sepak","bola","sepak","sepak"],  
         ["ekonomi","ekonomi","uang","uang"],  
         ["sepak","uang","ekonomi"],  
         ["komputer","komputer","teknologi","teknologi","komputer","teknologi"],  
         ["teknologi","komputer","teknologi"]]  
   
 word_dict = wordDict(docs)  
 collection = toNpArray(word_dict,docs)  
   
 #number of topics  
 K = 3  
   
 #number of words (vocab)  
 V = len(word_dict)  
   
 #number of documents  
 D = len(collection)  
   
 #array([1, 1, 1, ..., 1]) K times  
 alpha = np.ones(K)  
   
 #array([1, 1, 1, ..., 1]) V times  
 beta = np.ones(V)  
   
 #array containing the information about doc length in our collection
 Nd = [len(doc) for doc in collection]  
   
   
 ######################## LDA model ##################################  
   
 #topic distribution per-document  
 theta = pc.Container([pc.CompletedDirichlet("theta_%s" % i,   
                                             pc.Dirichlet("ptheta_%s"%i, theta=alpha))  
                      for i in range(D)])  
   
 #word distribution per-topic  
 phi = pc.Container([pc.CompletedDirichlet("phi_%s" % j,   
                                           pc.Dirichlet("pphi_%s" % j, theta=beta))  
                     for j in range(K)])  
   
   
 #Please note that this is the tricky part :)  
 z = pc.Container([pc.Categorical("z_%i" % d,  
                                  p = theta[d],  
                                  size = Nd[d],  
                                  value = np.random.randint(K, size=Nd[d]))   
                   for d in range(D)])  
   
 #word generated from phi, given a topic z  
 w = pc.Container([pc.Categorical("w_%i_%i" % (d,i),  
                                  p = pc.Lambda("phi_z_%i_%i" % (d,i),  
                                                lambda z=z[d][i], phi=phi : phi[z]),
                                  value=collection[d][i],  
                                  observed=True)  
                   for d in range(D) for i in range(Nd[d])])  
   
 ####################################################################  
   
 model = pc.Model([theta, phi, z, w])  
 mcmc = pc.MCMC(model)  
 mcmc.sample(iter=5000, burn=1000)  
   
   
 #show the topic assignment for each word, using the last trace  
 for d in range(D):  
    print(mcmc.trace('z_%i'%d)[3999])  
   

In our computer, the output of the preceding code is as follows:

-code2-
 [2 2 2 2 2 2 2]  
 [1 1 1 0 0 1 1]  
 [2 2 2 2 2 2]  
 [1 0 0 0]  
 [1 0 1]  
 [1 2 0 1 1 1]  
 [1 1 1]  
   

What does it mean ? Let’s compare the preceding results with our document collection:

-code3-
 docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],  
         ["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],  
         ["sepak","bola","sepak","bola","sepak","sepak"],  
         ["ekonomi","ekonomi","uang","uang"],  
         ["sepak","uang","ekonomi"],  
         ["komputer","komputer","teknologi","teknologi","komputer","teknologi"],  
         ["teknologi","komputer","teknologi"]]  
   

Remember that we set \(K = 3\). This value may come from our intuition, or our prior knowledge and experience.

We can see that the word “sepak” in doc #1 is associated with topic 2. The word “ekonomi” in doc #2 is associated with topic 1. The word “uang” in doc #5 is associated with topic 0. Using the results, we we can say that topic 2 is talking about sport, while topic 1 is somehow talking about economics. And, we cannot be really certain what topic 0 tells us.

We can also read the results as follows: doc #1 consists of only topic 2. Doc #2 is a mixture of topic 1 and topic 0. Doc #6 is generated from 3 topics.

We really cannot judge whether or not this LDA model works well here. Like we said before, it will be interesting if we can try this on a real word dataset to see the behavior :)



Main References:
[1] Latent Dirichlet Allocation, David M. Blei, Andrew Ng, Michael I. Jordan. Journal of Machine Learning Research, 2003.
[2] Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naïve Bayes for Collapsed Gibbs Sampling, Bob Carpenter, LingPipe Inc, 2010.



Alfan Farizki Wicaksono
(firstname [at] cs [dot] ui [dot] ac [dot] id)
Fakultas Ilmu Komputer, UI
Ditulis di Depok, 27 Juli 2015







1 komentar:

  1. hallo bro,
    btw, pake bhs Indonesia ok kan? ;)
    Kenalkan saya Clief, skrang gw lagi research seputaran topic modelling, dan gw ketemu nih blog Anda (Btw, gw masih newbie sama Python dan topic modelling)hehe.

    Gw lagi bikin LDA model, trus corpusnya dari Wikipedia, trus saya pikir kalo corpus wikipedia, pasti masalah data sparsity ada, dan gw googling2, salah cara buat itu dengan smoothing.
    gw pake library gensim di Python buat LDA, gw mo nanya bro, gmn ya mo implement smoothing ke gensim?

    Makasih bro sebelumnya,

    BalasHapus