Latent Topic Hypertext Model

A probabilistic generative model for hypertext document collections that explicitly models the generation of links. Specifically, links from a word w to a document d depend directly on how frequent the topic of w is in d, in addition to the in-degree of d. We show how to perform

EM learning on this model efficiently. By not modeling links as analogous to words, we end up using far less free parameters, and obtain better link prediction results.

Below you can find topics learned with this model and compared with topic learned with the LDA model, as well as the exact datasets we used.

 

*      LTHM: 20 Topics – one EM run result from the wikipedia

*      LTHM: 20 Topics – Top predicted links 

 


Code:
LTHM source code
The current implementation is the research code, built on top of the HTMM implementation with epsilon set to 1. It will soon be replaced with a cleaner and more efficient implementation.

Data:
WebKB original – 8282 html files from CMU

WIKIPEDIA processed - 105 html files with 790 links among the files, vocabulary of 2247 terms