Before you read this post, we suggest you to read our previous post regarding PyMC Introduction, several parameter estimation techniques, as well as Bernoulli parameter estimation.
NB model can be considered as one of the topic models (like Latent Dirichlet Allocation (LDA)), in which each word in a document is generated based on topic-specific multinomial/categorical distribution [1]. In addition to that, NB assumes that each document is drawn from a single topic (NOT mixture of topics) [1].
The “generative aspect” of NB model
Suppose, \(D\) is the number of documents in our collection. \(N_d\) is the number of words in the \(d\)-th document. \(K\) is the number of predefined-topics. In this case, the number of topics is not automatically inferred. Instead, we will manually set the value of \(K\) based on our intuition.
For each topic \(k\), we draw its word distribution, which is denoted as \(\phi_k\). Since we are going to model the word distribution using multinomial/categorical distribution, we then use the Dirichlet distribution (its conjugate prior) to model its parameter.
\[\phi_k \sim Dir(\beta), 1 \leq k \leq K\]
After that, we draw a global (collection) topic distribution, which is denoted as \(\theta\).
\[\theta \sim Dir(\alpha)\]
The hyperparameters \(\alpha\) and \(\beta\) are assumed to be fixed in our case.
Then, for each document \(d\) we draw a topic describing that document. This topic is denoted as \(z_m\).
\[z_m \sim Categorical(\theta), 1 \leq d \leq D\]
Here, Instead of modeling \(z_m\) as a multinomial random variable, we model it using Categorical distribution. We do not have any particular reason, except that later it will be easy for us to build our code this way.
Finally, after we know the topic of a document, we then draw words from the word distribution associated with its selected topic. Each word is denoted as \(w_{d,n}\).
\[w_{d, n} \sim Categorical(\phi_k), 1 \leq d \leq D, 1 \leq n \leq N_d\]
Estimating marginal posterior distribution of \(\theta\)
Our goal is to estimate \(\theta\), that is, we want to automatically infer the topic for each document, which is currently hidden. All we know is just the data, that is, all documents with their words. In other words, our task is just like a clustering task ! we aim at clustering documents that have the same topic.
First, we import the needed library for this problem.
-code1-
import numpy as np
import pymc as pc
import random as rn
Suppose, for our toy-implementation, our document collection is represented as follows.
-code2-
#doc1, doc2, ..., doc7
docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],
["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],
["sepak","bola","sepak","bola","sepak","sepak"],
["ekonomi","ekonomi","uang","uang"],
["sepak","uang","ekonomi"],
["komputer","komputer","teknologi","teknologi","komputer","teknologi"],
["teknologi","komputer","teknologi"]]
Then, we need to convert such collection into a special form such that we will be easy to process the collection later. In this case, we will create a set of words, and associate each word with a unique integer (as an id). Instead using array of words, we will use array of id’s.
-code3-
def wordDict(collection):
word_id = {}
idCounter = 0
for d in collection:
for w in d:
if (w not in word_id):
word_id[w] = idCounter
idCounter+=1
return word_id
def toNpArray(word_id, collection):
ds = []
for d in collection:
ws = []
for w in d:
ws.append(word_id.get(w,0))
ds.append(ws)
return np.array(ds)
After that, we need to set several useful variables. Suppose, we set the number of topic as 3.
-code4-
word_dict = wordDict(docs)
collection = toNpArray(word_dict,docs)
#number of topics
K = 3
#number of words (vocab)
V = len(word_dict)
#number of documents
D = len(collection)
#array([1, 1, 1, ..., 1]) K times
alpha = np.ones(K)
#array([1, 1, 1, ..., 1]) V times
beta = np.ones(V)
#array containing the information about doc length in our collection
Nd = [len(doc) for doc in collection]
Now, let us see how easy for us if we want to create our own topic model using PyMC :) the following code will explain how to represent NB model using PyMC.
-code5-
#naive bayes model
#word distribution for each topic
phi = pc.Container([pc.CompletedDirichlet("phi_%i" % k,
pc.Dirichlet("pphi_%i" % k, theta=beta))
for k in range(K)])
#topic distribution in the collection
theta = pc.CompletedDirichlet("theta",
pc.Dirichlet("ptheta", theta=alpha))
#for each document, draw a topic z_m
z = pc.Container([pc.Categorical("z_%i" % d,
p = theta,
value = rn.randint(0,K-1))
for d in range(D)])
#for each document, draw words, based on topic z_m
w = pc.Container([pc.Categorical("w_%i_%i" % (d,i),
p = pc.Lambda("phi_z_%i_%i" % (d,i),
lambda z=z[d], phi=phi : phi[z]),
value=collection[d][i],
observed=True)
for d in range(D) for i in range(Nd[d])])
model = pc.Model([theta, phi, z, w])
“Container” class is used when we need to wrap all the identical random variables (such as \(\phi_k\)'s) in a single reference. “Dirichlet” is a PyMC class for Dirichlet distribution. Unfortunately, it only generates \(k-1\) values since the \(k\)-th value can be easily determined by the fact that the sum of all probability values should be \(1\). Hence, “CompletedDirichlet” class will add the \(k\)-th value to the last part. “Lambda” is a class in PyMC if we want to represent a deterministic random variable using lambda expression. You can read more about all these stuffs via PyMC homepage.
Next, let’s perform MCMC to estimate the parameter of NB model !
-code6-
mcmc = pc.MCMC(model)
mcmc.sample(iter=5000, burn=1000)
In the preceding code, we use 5000 iterations (to generate the sample as well as improve the quality of marginal posterior distribution of the parameter), and assuming that burn-in period is achieved after we perform 1000 iterations. The generated traces will not consider the results before burn-in period.
Finally, let us see the estimation results ! Remember once again, that previously we used \(K = 3\), which means that we are really sure that there are 3 topics in our collection, but we do not know which document has which topic !
-code7-
#show z_i after sampling (random walks) has ended
#number of traces = iter – burn = 5000 – 1000 = 4000
for d in range(D):
print(mcmc.trace('z_%i'%d)[3999])
In our computer, the results are as follows:
-code8-
1
2
1
2
1
0
0
Let us compare the results with our collection once again:
-code9-
docs = [["sepak","bola","sepak","bola","bola","bola","sepak"], #doc1
["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"], #doc2
["sepak","bola","sepak","bola","sepak","sepak"], #doc3
["ekonomi","ekonomi","uang","uang"], #doc4
["sepak","uang","ekonomi"], #doc5
["komputer","komputer","teknologi","teknologi","komputer","teknologi"], #doc6
["teknologi","komputer","teknologi"]] #doc7
We can see that doc #1, #3, and #5 are assigned the same topic (topic 1). Doc #2 and #4 are of the same topic as well (topic 2). Lastly, Doc #6 and #7 have the same topic (topic 0). We can see that this algorithm works pretty well on automatically clustering the documents within the same topic. However, this is just a toy-implementation. We need to be really careful about this result.
It will be interesting if we can try this program in a real-world document collection :)
Main References:
[1] Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naïve Bayes for Collapsed Gibbs Sampling, Bob Carpenter, LingPipe Inc, 2010.
Alfan Farizki Wicaksono
(firstname [at] cs [dot] ui [dot] ac [dot] id)
Fakultas Ilmu Komputer, UI
Ditulis di Depok, 27 Juli 2015
Tidak ada komentar:
Posting Komentar