How to handle concurrency by eventual consistency? Or I could ask how to ensure data integrity by eventual consistency?
By CQRS and event sourcing, eventual consistency means, that you put your domain events into a queue, and you set event handlers which are projections. Those projections update the read cache in an async way. Now if you validate using that read cache, you cannot be sure that the information you base your validation on, is still valid. There can be unprocessed (or unprojected?) domain events in the queue when you send your command, which can change the outcome of the validation. So this is just another type of concurrency... What do you think, how to handle these rare concurrency issues? Domain events are already saved in the storage, so you cannot do anything about them, you cannot just remove them from the event storage (because it supposed to be write only once), and tell the user in an email, that sorry, we made up our mind and cancelled your request. Or can you?
update:
A possible solution to handle concurrency by an event storage:
by write model
if
last-known-aggregate-version < stored-aggregate-version
then
throw error
else
execute command on aggregate
raise domain-event
store domain-event
++stored-aggregate-version (by aggregate-id)
by read model
process query
if
result contains aggregate-id
then
attach read-cached-aggregate-version
by projection
process domain-event
read-cached-aggregate-version = domain-event-related-aggregate-version (by aggregate-id)
As long as state changes you cannot assume anything will ever be 100% consistent. Technically you can ensure that various bits are 100% consistent with what you know.
Your queued domain event scenario is no different from a queue of work on a user's desk that still has to be input into the system.
Any other user performing an action dependent on the system state has no way to know that another user still needs to perform some action that may interfere with their operation.
I guess a lot is based on assuming the data is consistent and developing alternate flows and processes that can deal with these scenarios as they arise.
Related
I am developing an event-sourced Electric Vehicle Charging Station Management System, which is connected to several Charging Stations. In this domain, I've come up with an aggregate for the Charging Station, which includes the internal state of the Charging Station(whether it is network-connected, if a car is charging using the station's connectors).
The station notifies me about its state through messages defined in a standardized protocol:
Heartbeat: whether the station is still "alive"
StatusNotification: whether the station has encountered an error(under voltage), or if everything is correct
And my server can send commands to this station:
RemoteStartTransaction: tells the station to unlock and reserve one of its connectors, for a car to charge using the connector.
I've developed an Aggregate for this Charging Station. It contains the internal entities of its connector, whether it's charging or not, if it has a problem in the power system, ...
And the Aggregate, which its memory representation resides in the server that I control, not in the Charging Station itself, has a StationClient service, which is responsible for sending these commands to the physical Charging Station(pseudocode):
class StationAggregate {
stationClient: StationClient
URL: string
connector: Connector[]
unlock(connectorId) {
if this.connectors.find(connectorId).isAvailableToBeUnlocked() {
return ErrorConnectorNotAvailable
}
error = this.stationClient.sendRemoteStartTransaction(this.URL, connectorId)
if error {
return ErrorStationRejectedUnlock
}
this.applyEvents([
StationUnlockedEvent(connectorId, now())
])
return Ok
}
receiveHeartbeat(timestamp) {
this.applyEvents([
StationSentHeartbeat(timestamp)
])
return Ok
}
}
I am using a optimistic concurrency, which means that, I load the Aggregate from a list of events, and I store the current version of the Aggregate in its memory representation: StationAggregate in version #2032, when a command is successfully processed and event(s) applied, it would the in version #2033, for example. In that way, I can put a unique constraint on the (StationID, Version) tuple on my persistence layer, and guarantee that only one event is persisted.
If by any chance, occurs a receival of a Heartbeat message, and the receival of a Unlock command. In both threads, they would load the StationAggregate and would be both in version X, in the case of the Heartbeat receival, there would be no side-effects, but in the case of the Unlock command, there would be a side-effect that tells the physical Charging Station to be unlocked. However as I'm using optimistic concurrency, that StationUnlocked event could be rejected from the persistence layer. I don't know how I could handle that, as I can't retry the command, because it its inherently not idempotent(as the physical Station would reject the second request)
I don't know if I'm modelling something wrong, or if it's really a hard domain to model.
I am not sure I fully understand the problem, but the idea of optimistic concurrency is to prevent writes in case of a race condition. Versions are used to ensure that your write operation has the version that is +1 from the version you've got from the database before executing the command.
So, in case there's a parallel write that won and you got the wrong version exception back from the event store, you retry the command execution entirely, meaning you read the stream again and by doing so you get the latest state with the new version. Then, you give the command to the aggregate, which decides if it makes sense to perform the operation or not.
The issue is not particularly related to Event Sourcing, it is just as relevant for any persistence and it is resolved in the same way.
Event Sourcing could bring you additional benefits since you know what happened. Imagine that by accident you got the Unlock command twice. When you got the "wrong version" back from the store, you can read the last event and decide if the command has already been executed. It can be done logically (there's no need to unlock if it's already unlocked, by the same customer), technically (put the command id to the event metadata and compare), or both ways.
When handling duplicate commands, it makes sense to ensure a decent level of idempotence of the command handling, ignore the duplicate and return OK instead of failing to the user's face.
Another observation that I can deduce from the very limited amount of information about the domain, is that heartbeats are telemetry and locking and unlocking are business. I don't think it makes a lot of sense to combine those two distinctly different things in one domain object.
Update, following the discussion in comments:
What you got with sending the command to the station at the same time as producing the event, is the variation of two-phase commit. Since it's not executed in a transaction, any of the two operations could fail and lead the system to an inconsistent state. You either don't know if the station got the command to unlock itself if the command failed to send, or you don't know that it's unlocked if the event persistence failed. You only got as far as the second operation, but the first case could happen too.
There are quite a few ways to solve it.
First, solving it entirely technical. With MassTransit, it's quite easy to fix using the Outbox. It will not send any outgoing messages until the consumer of the original message is fully completed its work. Therefore, if the consumer of the Unlock command fails to persist the event, the command will not be sent. Then, the retry filter would engage and the whole operation would be executed again and you already get out of the race condition, so the operation would be properly completed.
But it won't solve the issue when your command to the physical station fails to send (I reckon it is an edge case).
This issue can also be easily solved and here Event Sourcing is helpful. You'd need to convert sending the command to the station from the original (user-driven) command consumer to the subscriber. You subscribe to the event stream of StationUnlocked event and let the subscriber send commands to the station. With that, you would only send commands to the station if the event was persisted and you can retry sending the command as many times as you'd need.
Finally, you can solve it in a more meaningful way and change the semantics. I already mentioned that heartbeats are telemetry messages. I could expect the station also to respond to lock and unlock commands, telling you if it actually did what you asked.
You can use the station telemetry to create a representation of the physical station, which is not a part of the aggregate. In fact, it's more like an ACL to the physical world, represented as a read model.
When you have such a mirror of the physical station on your side, when you execute the Unlock command in your domain, you can engage a domain server to consult with the current station state and make a decision. If you find out that the station is already unlocked and the session id matches (yes, I remember our previous discussion :)) - you return OK and safely ignore the command. If it's locked - you proceed. If it's unlocked and the session id doesn't match - it's obviously an error and you need to do something else.
In this last option, you would clearly separate telemetry processing from the business so you won't have heartbeats impact your domain model, so you really won't have the versioning issue. You also would always have a place to look at to understand what is the current state of the physical station.
Using akka-typed I'm trying to create an event-sourced application in which a command on an actor can cause effect on another actor. Specifically I have the following situation:
RootActor
BranchActor (it's the representation of a child of the Root)
When RootActor is issued a CreateBranch command, validation happens, and if everything is o.k. the results must be:
RootActor will update the list of children of that actor
BranchActor will be initialized with some contents (previously given in the command)
RootActor replies to the issuer of the command with OperationDone
Right now the only thing I could come up with is: RootActor processes the Event and as a side effect issues a command to the BranchActor, which in turn saves an initialization eventt, replies to the RootActor, which finally replies to the original issuer.
This looks way too complicated, though, because:
I need to use a pipe to self mechanism, which implies that
I need to manage internal commands as well that allow me to reply to the original issuer
I need to manage the case where that operation might fail, and if this fails, it means that the creation of a branch is not atomic, whereas saving two events is atomic, in the sense that either both are saved or neither is.
I need to issue another command to another actor, but I shouldn't need to do that, because the primary command should take care of everything
The new command should be validated, though it is not necessary because it comes from the system and not an "external" user in this case.
My question then is: can't I just save from the RootActor two events, one for self, and one for a target BranchActor?
Also, as a bonus question: is this even a good practice for event-sourcing?
My question then is: can't I just save from the RootActor two events, one for self, and one for a target BranchActor?
No. Not to sound trite, but the only thing you can do to an actor is to send a message to it. If you must do what you are doing you are doing, you are on the right path. (e.g. pipeTo etc.)
is this even a good practice for event-sourcing?
It's not a good practice. Whether it's suboptimal or a flat out anti-pattern is still debatable. (I feel like I say say this confidently because of this Lightbend Discussion thread where it was debated with one side arguing "tricky but I have no regrets" and the other side arguing "explicit anti-pattern".)
To quote someone from an internal Slack (I don't want attribute him without his permission, but I saved it because it seemed to so elegantly sum up this kind of scenario.)
If an event sourced actor needs to contact another actor to make the decision if it can persist an event, then we are not modeling a consistency boundary anymore. It should only rely on the state that [it has] in scope (own state and incoming command). … all the gymnastics (persist the fact that its awaiting confirmation, stash, pipe to self) to make it work properly is an indication that we are not respecting the consistency boundary.
If you can't fix your aggregates such that one actor is responsible for the entire consistency boundary the better practice is to enrich the command beforehand: essentially building a Saga pattern.
My Dataflow pipeline collates event data into typed per session and per user PCollections output. I have employed GroupByKey for events keyed by session id. Sessions are grouped into parent types keyed by user id and device id using the same pattern at this next level of hierarchy. So a single session might generate many events, but in turn a single user might generate many sessions.
I would now like to summarize this data across each level of the hierarchy. I have used a StateSpec declaration to persist state at the event level. So for example, an event count property can be incremented in my event processing ParDo. (Use Case : generating an error event per session across all users for example.)
But as each ParDo is static - I cannot access the ValueState outside of the ParDo context even though my understanding is this state is maintained at the Window scope. (Maybe this is by design.) Is there a way to access this Window level state using the Beam State persistence lib in another ParDo than where it was originally declared? Like as if I could declare it at the pipeline level?
I understand that this may introduce some performance overhead as the framework must manage concurrency, but the actual processing seems negligible. (Just incrementing values.) So I would prefer to write this to a window level state field rather than percolate values up via my hierarchy.
State sharing cross ParDos is not supported, and it shouldn't even be encouraged as it brings dependencies among ParDos that breaks the simple contract: ParDo can work on PCollection independently thus unblocks massive parallelism.
Let's say I have a command to edit a single entry of an article, called ArticleEditCommand.
User 1 issues an ArticleEditCommand based on V1 of the article.
User 2 issues an ArticleEditCommand based on V1 of the same
article.
If I can ensure that my nodes process the older ArticleEditCommand commands first, I can be sure that the command from User 2 will fail because User 1's command will have changed the version of the article to V2.
However, if I have two nodes process ArticleEditCommand messages concurrently, even though the commands will be taken of the queue in the correct order, I cannot guarantee that the nodes will actually process the first command before the second command, due to a spike in CPU or something similar. I could use a sql transaction to update an article where version = expectedVersion and make note of the number of records changed, but my rules are more complex, and can't live solely in SQL. I would like my entire logic of the command processing guaranteed to be concurrent between ArticleEditCommand messages that alter that same article.
I don't want to lock the queue while I process the command, because the point of having multiple command handlers is to handle commands concurrently for scalability. With that said, I don't mind these commands being processed consecutively, but only for a single instance/id of an article. I don't expect a high volume of ArticleEditCommand messages to be sent for a single article.
With the said, here is the question.
Is there a way to handle commands consecutively across multiple nodes for a single unique object (database record), but handle all other commands (distinct database records) concurrently?
Or, is this a problem I created myself because of a lack of understanding of CQRS and concurrency?
Is this a problem that message brokers typically have solved? Such as Windows Service Bus, MSMQ/NServiceBus, etc?
EDIT: I think I know how to handle this now. When User 2 issues the ArticleEditCommand, an exception should be throw to the user letting them know that there is a current pending operation on that article that must be completed before then can queue the ArticleEditCommand. That way, there is never two ArticleEditCommand messages in the queue that effect the same article.
First let me say, if you don't expect a high volume of ArticleEditCommand messages being sent, this sounds like premature optimization.
In other solutions, this problem is usually not solved by message brokers, but by optimistic locking enforced by the persistence implementation. I don't understand why a simple version field for optimistic locking that can be trivially handled by SQL contradicts complicated business logic/updates, maybe you could elaborate more?
It's actually quite simple and I did that. Basically, it looks like this ( pseudocode)
//message handler
ModelTools.TryUpdateEntity(
()=>{
var entity= _repo.Get(myId);
entity.Do(whateverCommand);
_repo.Save(entity);
}
10); //retry 10 times until giving up
//repository
long? _version;
public MyObject Get(Guid id)
{
//query data and version
_version=data.version;
return data.ToMyObject();
}
public void Save(MyObject data)
{
//update row in db where version=_version.Value
if (rowsUpdated==0)
{
//things have changed since we've retrieved the object
throw new NewerVersionExistsException();
}
}
ModelTools.TryUpdateEntity and NewerVersionExistsException are part of my CavemanTools generic purpose library (available on Nuget).
The idea is to try doing things normally, then if the object version (rowversion/timestamp in sql) has changed we'll retry the whole operation again after waiting a couple of miliseconds. And that's exactly what the TryUpdateEntity() method does. And you can tweak how much to wait between tries or how many times it should retry the operation.
If you need to notify the user, then forget about retrying, just catch the exception directly and then tell the user to refresh or something.
Partition based solution
Achieve node stickiness by routing the incoming command based on the object's ID (eg. articleId modulo your-number-of-nodes) to make sure the commands of User1 and User2 ends up on the same node, then process the commands consecutively. You can choose to process all commands one by one or if you want to parallelize the execution, partition the commands on something like ID, odd/even, by country or similar.
Grid based solution
Use an in-memory grid (eg. Hazelcast or Coherence) and use a distributed Executor Service (http://docs.hazelcast.org/docs/2.0/manual/html/ch09.html#DistributedExecution) or similar to coordinate the command processing across the cluster.
Regardless - before adding this kind of complexity, you should of course ask yourself if it's really a problem if User2's command would be accepted and User1 got a concurrency error back. As long as User1's changes are not lost and can be re-applied after a refresh of the article it might be perfectly fine.
The communication is based on socket and the it is keep-alive connection. User use account name to log in, I need to implement a feature when two user use same account to log in, the former one need to be kicked off.
Codes need to updated:
void session::login(accountname) // callback when server recv login request
{
boost::shared_ptr<UserData> d = database.get_user_data(accountname);
this->data = d;
this->send(login success);
}
boost::shared_ptr<UserData> Database::get_user_data(accountname)
{
// read from db and return the data
}
The most simple way is improve Database::get_user_data(accountname)
boost::shared_ptr<UserData> Database::get_user_data(accountname)
{
add a boost::unqiue_lock<> here
find session has same accountname or find user data has same accountname in cache,
if found, then kick this session offline first, then execute below codes
// read from db and return the data
}
This modification has 2 problems:
1, too bad concurrency because the scenario happen rarely. However, if I need to check account online or not, I must cache it somewhere(user data or session), that means I need to write to a container which must has exclusive lock whatever the account same or not. So the concurrency can hardly improved.
2, kick other one off by calling "other_session->offline()" in "this thread" that might concurrent with other operations executing in other thread at same time.
If I add lock in offline(), that will result in all others function belong to session also need to add that lock, obviously, not good. Or, I can push a event to other_session, and let other_session handle the event, that will make sure "offline" executing in its own thread. But the problem is that will make offline executing async, codes below "other one offline" must executed after "offline" runs complete.
I use boost::asio, but I try to describe this problem in common because I think this is a common problem in server writing. Is there a pattern to solve this? Notice that this problem gets complex when there are N same account log in at same time
If this scenario rarely happens, I wouldn't worry about it. lock and release of mutex are not long actions the user would notice (if you had to do it thousands of times a second it could be a problem).
In general trying to fix performance issues that are not there is a bad idea.