The journeylism of @yreynhout
On CQRS, DDD(D), ES, …
Your EventStream is a linked list
December 7, 2011
Posted by on A week ago I had an interesting twonversation with Jérémie Chassaing and Rinat Abdullin about event streams. I mentioned how I had been toying with eventstreams as being linked lists of changesets (to the aficionados of Jonathan Oliver’s EventStore, this is very much akin to what he calls Commits) in the past. As a way of documenting some of my thoughts on the subject I’m putting up some schematics here.
Fom the above picture you can deduce that an event stream is really a simple thing: a collection of changesets. A changeset itself is a collection of events that occurred (how you got there is not important at this point). Event streams and changesets both have a form of unique identity. Head marks the identity of the latest known changeset. A changeset is immutable. For the sake of simplicity I’ve omitted any form of headers/properties you can attach to an event stream, a changeset and/or an event.
Each changeset knows its “parent”, i.e. the changeset that it should be appended to. Except for the very first changeset, which obviously does not have a “parent” (strictly speaking you could have an explicit terminator “parent”). Chronologically, changeset 1 came before changeset 2, changeset 2 came before changeset 3, and so on.
Looking at the write side of a system that uses event streams for storage, there are two main scenarios:
- Writing a changeset of a given event stream: concerns here are duplicate changeset elimination & detecting conflicts, besides the act of actually writing.
- Reading the entire event stream: the main concern here is reading all changesets of the event stream as fast as we can, in order.
I’m well aware I’ve omitted other concerns such as automatic event upgrading, event dispatching, snapshotting which, frankly, are distractions at this point.
Reading
Supposing that each changeset is say a file on disk, how would I know where to start reading? Various options, really. The picture above illustrates one option where – by using “<streamid>.txt” as a convention – the Stream Head File is loaded to bootstrap “walking the chain”, by virtue of having it point to the latest changeset document (represented as LatestChangesetRef) that makes up that stream. As each Changeset File is read, it provides a reference/pointer to the next Changeset File to read (represented by ParentRef). That reference is really the identity of the next changeset.
I hope I don’t need to explain why you need to keep those identifiers logical. Don’t make it a “physical” thing, like the path to the next changeset file. That would be really painful if you were ever to restructure the changeset files on disk. Instead you should delegate the responsibility of translating/resolving a logical identity into its physical location.
Other options for “where to start reading” could be:
- keeping the head of each stream in memory (causing “sticky” streams and dealing with recovery mechanisms).
- storing the head as a record in a database or blob storage with concurrency control
- …
Now, reading each changeset file could become a bit costly if they’re scattered allover the disk. There’s nothing stopping you from throwing all those changeset documents in one big file, even asynchronously. This is where immutability and resolving identities can really help you out. It’s important to distinguish between what happens at the logical level and what happens at the physical level.
Yet another approach might be to keep an index file of all changesets (above represented by the Stream Index File) that make up the event stream (in an append only fashion), thus alleviating the changeset documents from having to know their parents.
Writing
Basically, this operation can be split up into writing the changeset document and updating the head (or index) of the event stream. The advantage here is that storing the changeset document does not require any form of transaction. This allows you to choose from a broader range of data-stores as there really isn’t a requirement beyond the guarantee that they will remember and serve what you asked them to store. Updating the head of the event stream does require you to at least be able to detect concurrent writes are happening or have happened, depending on how you want to resolve conflicts. As such, there’s no need to store both of them in the same data-store. Also notice that the duration of the transaction is reduced by taking the changeset itself out of the equation.
When picking a changeset identity, you might be tempted to reuse the identifier of the message that caused the changes to happen (usually the Command’s message identifier). Don’t. Remember, retrying that same command might produce a different set of changes. How are you going to differentiate between rectifying a previous failure with a retry and some other thread trying to process the same message? It’s best to use identities/identifiers for just one purpose.
What happens when you can’t update the head to your newly inserted changeset identity? You’ll be left with a dangling changeset that didn’t make it into “the circle of trust”. No harm done, except for wasting some storage. If the changesets were blobs in the cloud it might be useful to have a special purpose daemon to hunt these down and remove them (depends on how much storage costs vs the cost of building the daemon) . In general you should optimize for having as few “concurrency conflicts” as possible (it’s bad for business).
Conclusion
I know there are a lot of holes in this story. That’s intentional, I have a lot of unanswered questions myself. I’m especially interested in any comments that can point me to prior art outside the realm of sourcecontrol systems. I have no intention of using this in production. It’s just a mental exercise.
Acknowledgements
Most of what I mention here exists in one shape or another. By now some of you will have noticed the resemblance to certain aspects of Git’s internal object model, although I’m inclined to say it’s closer to what Mercurial does. The convention based file storage, can be found in the NoDB implementation by Chris Nicola. Concepts such as commits and event streams (and much more) can be found in the EventStore project.
Nice post! I really like the parallels to DVCS and conscious separation between physical and logical levels.
Quick note, a lot of concurrency related challenges that you mention could be avoided, if you allow to streams to be written to from one thread only. But I’m just being too stupid and lazy here to deal with them myself 🙂
Pingback: Using Amazon Web Services to build a simple Event Store « The journeylism of @yreynhout
Pingback: Using Windows Azure to build a simple EventStore « The journeylism of @yreynhout
Pingback: Important links on Event Sourcing and CQRS – Techtonica Blog