opening it up with Common Lisp

Favorite weblogs

Lisp Related

Bill Clementson

Finding Lisp

Lemonodor

Lispmeister.com

Planet Lisp

Politics

Orcinus

Talking Points Memo

This Modern World

Working for Change

Other home

Polliblog

Recent Readings

Book review: Darwinia
Reviewed: Friday, August 11, 2006

Summer reading: Spin
Reviewed: Saturday, August 5, 2006

Runner
Reviewed: Tuesday, July 18, 2006

the Omnivoire's Delimma
Reviewed: Wednesday, July 12, 2006

the Golem's Eye
Reviewed: Wednesday, May 31, 2006





tinderbox

Unsupervised Topic Discovery
Richard Schwartz, Sreenivasa Sista and Timothy Leek, 2001 , (Paper URL)
Sunday, May 9, 2004

Schwartz, Sista and Leek limn their research in Topic Discovery in this white paper from 2001. They take what I believe is the standard statistical model of paper generation: a paper is written by selecting a set of topics and then using an HMM to select words from these topics (plus a General Language topic shared by all). The problem to solve is how to annotate a corpus with topics automatically -- this includes finding the topics and naming them. The authors solution is to:

  • treat each document as a query in order to find similar documents in the corpus (assuming that similar documents share at least one topic).
  • If two documents share a topic, then they probably also share some words related to the topic. This gives us document intersections
  • Clustering the document intersections (using k-means)
  • Purify the distributions using Expectation Maximization (EM).

This leaves each topic with between 100 and 200 words. Given this set of discovered topics, they then go on to create names for them out of the "most interesting" words in the topic.

The authors point out that their approach suffers from finding very similar topics, some unfocused topics and some topics that were really combinations of two (or more) separate topics. On the other hand, with only statistics to go on, the algorithm can only do so well given its data.

The paper provides a good high-level summary of their work in three pages and is worth reading, especially for someone from outside the field (like me!)


Home | About | Quotes | Recent | Archives

Copyright -- Gary Warren King, 2004 - 2006