The Hidden Topic Markov Model

We propose modeling the topics of words in the document as a Markov chain.  Specifically, we assume that all words in the same sentence have the same topic, and successive sentences are more likely to have the same topics. Since the topics are hidden, this leads to using the well-known tools of Hidden Markov Models for learning and inference. We show that incorporating this dependency allows us to learn better topics and to disambiguate words that can belong to different topics. Quantitatively, we show that we obtain better perplexity in modeling documents with only a modest increase in learning and inference complexity.

 

*      HTMM: 100 Topics – one EM run result from the NIPS

*      LDA: 100 Topics – one sample result from the NIPS 

 

Code:

A c++ open source implementation of EM inference in the Hidden Topic Markov Model. I wrote this code at Google while working with Ashok Popat.


Data:

Nips HTMM data (check the README file for explanations about the format of the data)

Nips dataset - raw text (taken from Sam roweis' web site )

Vocabulary of 12113 words (the vocabulary does not contain stop words).


References:
"Hidden Topic Markov Models",
Amit Gruber, Michal Rosen-Zvi and Yair Weiss,
In Artificial Intelligence and Statistics (AISTATS), San Juan, Puerto Rico, March 2007.