opening it up with Common Lisp
Book review: Darwinia
Summer reading: Spin
the Omnivoire's Delimma
the Golem's Eye
Enhanced hypertext categorization using hyperlinks
Intuitively, knowing that this thing is related to that thing ought to help me understand both of them better. Ah, but how to put that intuition into practice? Chakrabarti et. al. present one of the early set of answers. The paper is a wealth of ideas that have been mined by many others in recent years.
Suppose I'm trying to classify a set of objects X into categories or classes. Can linking between objects help? Thinking about the automatic classification of papers into topics or patents into categories provides the seeds of hope and cause for doubt. Papers I link to (or that link to me) should contain information about my class. Ah, but they may also contain much that is irrelevant. Indeed, the authors found that trying to use all of the text in linked papers produced no improvement and often makes things worse. So wow can I use the right information and ignore the bad stuff? How do I fix the "noisy neighbor" problem? One answer is that instead of using the text in the links, we instead use the class of the links. Our classifier then has an input the local text plus class labels on some portion of the papers linked to it. Does this help? Yes. It helps a lot.
Now we can move onto a more realistic problem. Instead of classifying one paper, I got bunches of them plus their interlinks. Some have labels, some don't. How should I classify them? One a time via some sequential decision process (and what would the best order be to ask the questions?)? All at once? How? Do I use only adjacent links or should I travel out further to ask questions about my neighbor's neighbors? Chakrabarti shows that we can borrow techniques from image processing (relaxation labeled) to co-classify everything simultaneously and iteratively. I.e., first take the information you have and make your best guesses (in a maximum likelihood or Bayesian sense) to update all the class labels. Then, do it again. With a bit of math and a network that exhibits homophilly (the love of Philadelphia), this will converge to a stable and remarkably accurate answer. It will do much better than the best text alone ever could and it will do well even if most of the class labels are not initially known; even if all of the class labels are unknown! The relational structure of the network leads towards a co-consistent labeling.
In sum, this is an well written paper with a wealth of ideas drawn together from information retrieval, machine learning, data mining, computer vision and statistics. Very good stuff.
Copyright -- Gary Warren King, 2004 - 2006