opening it up with Common Lisp
Book review: Darwinia
Summer reading: Spin
the Omnivoire's Delimma
the Golem's Eye
Schwartz, Sista and Leek limn their research in Topic Discovery in this white paper from 2001. They take what I believe is the standard statistical model of paper generation: a paper is written by selecting a set of topics and then using an HMM to select words from these topics (plus a General Language topic shared by all). The problem to solve is how to annotate a corpus with topics automatically -- this includes finding the topics and naming them. The authors solution is to:
This leaves each topic with between 100 and 200 words. Given this set of discovered topics, they then go on to create names for them out of the "most interesting" words in the topic.
The authors point out that their approach suffers from finding very similar topics, some unfocused topics and some topics that were really combinations of two (or more) separate topics. On the other hand, with only statistics to go on, the algorithm can only do so well given its data.
The paper provides a good high-level summary of their work in three pages and is worth reading, especially for someone from outside the field (like me!)
Copyright -- Gary Warren King, 2004 - 2006