opening it up with Common Lisp
Book review: Darwinia
Summer reading: Spin
the Omnivoire's Delimma
the Golem's Eye
STREAM: The Stanford Data Stream Management System
When I think of databases, I think of big datasets being altered by transactions and queried for reports. Traditional databases like this continue to grow, but there are several new members of the database family that tweak some properties of the standard relational model. One of these is Data Stream Management Systems (DSMS) which attempt to corral the moving web of interactions in which we all participate. The goal is similar to that of moving from batch algorithms to incremental and on line ones.-- we want to compute answers in (near) real-time without having to build up big tables that we're going to throw away as soon as the query is complete (come to think of it, this is analogous to the collect versus map question). We also want to have answers available all of the time.
The DSMS discussed in this paper uses SQL-like language called the Continuous Query Language (CQL). One of it's primitives is setting the size of the window through which a stream of data is viewed. This can be determined by either a record count (show me 300 records) or via a time period (show me all the records in the last 2-minutes). The fun begins when working on a system with many simultaneous queries running against multiple, often bursty, streams. How can the queries be optimized against time and space? Can estimates be provided when the load gets too high? Can the queries be distributed across machines? How do crash protection and recovery change when running in a streaming environment?
This paper takes on portions of the first two questions and leaves the last two for future work. Stanford has found a number of nice bits and pieces to let queries share the load and to rationally drop records as streams become clogged. They've also built a well engineered system (at least on its surface level) that allows for monitoring and introspection of the system and its queries. The final two questions are interesting because the streaming model is different enough from standard RDMSs that its not clear whether or not all of the usual suspects are invited. For example, it's not clear that ACID transactions are the right model when you're already assuming that the data is streaming by constantly.
This is interesting work and will become more so over time. As we grow more digitized, there will be a need to have systems monitoring multiple data sources in real time. For example, critical medical care could improve by tracking connections between instruments rather than only looking at each instrument in isolation. Of course, this technology will be a double edged sword -- some of the streams out there are better left unmonitored. How we can keep them that way is not a technical problem but it is a problem that needs solving.
Copyright -- Gary Warren King, 2004 - 2006