DDD - Concurrency and Command retrying with side-effects - concurrency

I am developing an event-sourced Electric Vehicle Charging Station Management System, which is connected to several Charging Stations. In this domain, I've come up with an aggregate for the Charging Station, which includes the internal state of the Charging Station(whether it is network-connected, if a car is charging using the station's connectors).
The station notifies me about its state through messages defined in a standardized protocol:
Heartbeat: whether the station is still "alive"
StatusNotification: whether the station has encountered an error(under voltage), or if everything is correct
And my server can send commands to this station:
RemoteStartTransaction: tells the station to unlock and reserve one of its connectors, for a car to charge using the connector.
I've developed an Aggregate for this Charging Station. It contains the internal entities of its connector, whether it's charging or not, if it has a problem in the power system, ...
And the Aggregate, which its memory representation resides in the server that I control, not in the Charging Station itself, has a StationClient service, which is responsible for sending these commands to the physical Charging Station(pseudocode):
class StationAggregate {
stationClient: StationClient
URL: string
connector: Connector[]
unlock(connectorId) {
if this.connectors.find(connectorId).isAvailableToBeUnlocked() {
return ErrorConnectorNotAvailable
}
error = this.stationClient.sendRemoteStartTransaction(this.URL, connectorId)
if error {
return ErrorStationRejectedUnlock
}
this.applyEvents([
StationUnlockedEvent(connectorId, now())
])
return Ok
}
receiveHeartbeat(timestamp) {
this.applyEvents([
StationSentHeartbeat(timestamp)
])
return Ok
}
}
I am using a optimistic concurrency, which means that, I load the Aggregate from a list of events, and I store the current version of the Aggregate in its memory representation: StationAggregate in version #2032, when a command is successfully processed and event(s) applied, it would the in version #2033, for example. In that way, I can put a unique constraint on the (StationID, Version) tuple on my persistence layer, and guarantee that only one event is persisted.
If by any chance, occurs a receival of a Heartbeat message, and the receival of a Unlock command. In both threads, they would load the StationAggregate and would be both in version X, in the case of the Heartbeat receival, there would be no side-effects, but in the case of the Unlock command, there would be a side-effect that tells the physical Charging Station to be unlocked. However as I'm using optimistic concurrency, that StationUnlocked event could be rejected from the persistence layer. I don't know how I could handle that, as I can't retry the command, because it its inherently not idempotent(as the physical Station would reject the second request)
I don't know if I'm modelling something wrong, or if it's really a hard domain to model.

I am not sure I fully understand the problem, but the idea of optimistic concurrency is to prevent writes in case of a race condition. Versions are used to ensure that your write operation has the version that is +1 from the version you've got from the database before executing the command.
So, in case there's a parallel write that won and you got the wrong version exception back from the event store, you retry the command execution entirely, meaning you read the stream again and by doing so you get the latest state with the new version. Then, you give the command to the aggregate, which decides if it makes sense to perform the operation or not.
The issue is not particularly related to Event Sourcing, it is just as relevant for any persistence and it is resolved in the same way.
Event Sourcing could bring you additional benefits since you know what happened. Imagine that by accident you got the Unlock command twice. When you got the "wrong version" back from the store, you can read the last event and decide if the command has already been executed. It can be done logically (there's no need to unlock if it's already unlocked, by the same customer), technically (put the command id to the event metadata and compare), or both ways.
When handling duplicate commands, it makes sense to ensure a decent level of idempotence of the command handling, ignore the duplicate and return OK instead of failing to the user's face.
Another observation that I can deduce from the very limited amount of information about the domain, is that heartbeats are telemetry and locking and unlocking are business. I don't think it makes a lot of sense to combine those two distinctly different things in one domain object.
Update, following the discussion in comments:
What you got with sending the command to the station at the same time as producing the event, is the variation of two-phase commit. Since it's not executed in a transaction, any of the two operations could fail and lead the system to an inconsistent state. You either don't know if the station got the command to unlock itself if the command failed to send, or you don't know that it's unlocked if the event persistence failed. You only got as far as the second operation, but the first case could happen too.
There are quite a few ways to solve it.
First, solving it entirely technical. With MassTransit, it's quite easy to fix using the Outbox. It will not send any outgoing messages until the consumer of the original message is fully completed its work. Therefore, if the consumer of the Unlock command fails to persist the event, the command will not be sent. Then, the retry filter would engage and the whole operation would be executed again and you already get out of the race condition, so the operation would be properly completed.
But it won't solve the issue when your command to the physical station fails to send (I reckon it is an edge case).
This issue can also be easily solved and here Event Sourcing is helpful. You'd need to convert sending the command to the station from the original (user-driven) command consumer to the subscriber. You subscribe to the event stream of StationUnlocked event and let the subscriber send commands to the station. With that, you would only send commands to the station if the event was persisted and you can retry sending the command as many times as you'd need.
Finally, you can solve it in a more meaningful way and change the semantics. I already mentioned that heartbeats are telemetry messages. I could expect the station also to respond to lock and unlock commands, telling you if it actually did what you asked.
You can use the station telemetry to create a representation of the physical station, which is not a part of the aggregate. In fact, it's more like an ACL to the physical world, represented as a read model.
When you have such a mirror of the physical station on your side, when you execute the Unlock command in your domain, you can engage a domain server to consult with the current station state and make a decision. If you find out that the station is already unlocked and the session id matches (yes, I remember our previous discussion :)) - you return OK and safely ignore the command. If it's locked - you proceed. If it's unlocked and the session id doesn't match - it's obviously an error and you need to do something else.
In this last option, you would clearly separate telemetry processing from the business so you won't have heartbeats impact your domain model, so you really won't have the versioning issue. You also would always have a place to look at to understand what is the current state of the physical station.

Related

Notifying a task from multiple other tasks without extra work

My application is futures-based with async/await, and has the following structure within one of its components:
a "manager", which is responsible for starting/stopping/restarting "workers", based both on external input and on the current state of "workers";
a dynamic set of "workers", which perform some continuous work, but may fail or be stopped externally.
A worker is just a spawned task which does some I/O work. Internally it is a loop which is intended to be infinite, but it may exit early due to errors or other reasons, and in this case the worker must be restarted from scratch by the manager.
The manager is implemented as a loop which awaits on several channels, including one returned by async_std::stream::interval, which essentially makes the manager into a poller - and indeed, I need this because I do need to poll some Mutex-protected external state. Based on this state, the manager, among everything else, creates or destroys its workers.
Additionally, the manager stores a set of async_std::task::JoinHandles representing live workers, and it uses these handles to check whether any workers has exited, restarting them if so. (BTW, I do this currently using select(handle, future::ready()), which is totally suboptimal because it relies on the select implementation detail, specifically that it polls the left future first. I couldn't find a better way of doing it; something like race() would make more sense, but race() consumes both futures, which won't work for me because I don't want to lose the JoinHandle if it is not ready. This is a matter for another question, though.)
You can see that in this design workers can only be restarted when the next poll "tick" in the manager occurs. However, I don't want to use a too small interval for polling, because in most cases polling just wastes CPU cycles. Large intervals, however, can delay restarting a failed/canceled worker by too much, leading to undesired latencies. Therefore, I though I'd set up another channel of ()s back from each worker to the manager, which I'd add to the main manager loop, so when a worker stops due to an error or otherwise, it will first send a message to its channel, resulting in the manager being woken up earlier than the next poll in order to restart the worker right away.
Unfortunately, with any kinds of channels this might result in more polls than needed, in case two or more workers stop at approximately the same time (which due to the nature of my application, is somewhat likely to happen). In such case it would make sense to only run the manager loop once, handling all of the stopped workers, but with channels it will necessarily result in the number of polls equal to the number of stopped workers, even if additional polls don't do anything.
Therefore, my question is: how do I notify the manager from its workers that they are finished, without resulting in extra polls in the manager? I've tried the following things:
As explained above, regular unbounded channels just won't work.
I thought that maybe bounded channels could work - if I used a channel with capacity 0, and there was a way to try and send a message into it but just drop the message if the channel is full (like the offer() method on Java's BlockingQueue), this seemingly would solve the problem. Unfortunately, the channels API, while providing such a method (try_send() seems to be like it), also has this property of having capacity larger than or equal to the number of senders, which means it can't really be used for such notifications.
Some kind of atomic or a mutex-protected boolean flag also look as if it could work, but there is no atomic or mutex API which would provide a future to wait on, and would also require polling.
Restructure the manager implementation to include JoinHandles into the main select somehow. It might do the trick, but it would result in large refactoring which I'm unwilling to make at this point. If there is a way to do what I want without this refactoring, I'd like to use that first.
I guess some kind of combination of atomics and channels might work, something like setting an atomic flag and sending a message, and then skipping any extra notifications in the manager based on the flag (which is flipped back to off after processing one notification), but this also seems like a complex approach, and I wonder if anything simpler is possible.
I recommend using the FuturesUnordered type from the futures crate. This collection allows you to push many futures of the same type into a collection and wait for any one of them to complete at once.
It implements Stream, so if you import StreamExt, you can use unordered.next() to obtain a future that completes once any future in the collection completes.
If you also need to wait for a timeout or mutex etc., you can use select to create a future that completes once either the timeout or one of the join handles completes. The future returned by next() implements Unpin, so it is usable with select without problems.

CQRS, multiple write nodes for a single aggregate entry, while maintaining concurrency

Let's say I have a command to edit a single entry of an article, called ArticleEditCommand.
User 1 issues an ArticleEditCommand based on V1 of the article.
User 2 issues an ArticleEditCommand based on V1 of the same
article.
If I can ensure that my nodes process the older ArticleEditCommand commands first, I can be sure that the command from User 2 will fail because User 1's command will have changed the version of the article to V2.
However, if I have two nodes process ArticleEditCommand messages concurrently, even though the commands will be taken of the queue in the correct order, I cannot guarantee that the nodes will actually process the first command before the second command, due to a spike in CPU or something similar. I could use a sql transaction to update an article where version = expectedVersion and make note of the number of records changed, but my rules are more complex, and can't live solely in SQL. I would like my entire logic of the command processing guaranteed to be concurrent between ArticleEditCommand messages that alter that same article.
I don't want to lock the queue while I process the command, because the point of having multiple command handlers is to handle commands concurrently for scalability. With that said, I don't mind these commands being processed consecutively, but only for a single instance/id of an article. I don't expect a high volume of ArticleEditCommand messages to be sent for a single article.
With the said, here is the question.
Is there a way to handle commands consecutively across multiple nodes for a single unique object (database record), but handle all other commands (distinct database records) concurrently?
Or, is this a problem I created myself because of a lack of understanding of CQRS and concurrency?
Is this a problem that message brokers typically have solved? Such as Windows Service Bus, MSMQ/NServiceBus, etc?
EDIT: I think I know how to handle this now. When User 2 issues the ArticleEditCommand, an exception should be throw to the user letting them know that there is a current pending operation on that article that must be completed before then can queue the ArticleEditCommand. That way, there is never two ArticleEditCommand messages in the queue that effect the same article.
First let me say, if you don't expect a high volume of ArticleEditCommand messages being sent, this sounds like premature optimization.
In other solutions, this problem is usually not solved by message brokers, but by optimistic locking enforced by the persistence implementation. I don't understand why a simple version field for optimistic locking that can be trivially handled by SQL contradicts complicated business logic/updates, maybe you could elaborate more?
It's actually quite simple and I did that. Basically, it looks like this ( pseudocode)
//message handler
ModelTools.TryUpdateEntity(
()=>{
var entity= _repo.Get(myId);
entity.Do(whateverCommand);
_repo.Save(entity);
}
10); //retry 10 times until giving up
//repository
long? _version;
public MyObject Get(Guid id)
{
//query data and version
_version=data.version;
return data.ToMyObject();
}
public void Save(MyObject data)
{
//update row in db where version=_version.Value
if (rowsUpdated==0)
{
//things have changed since we've retrieved the object
throw new NewerVersionExistsException();
}
}
ModelTools.TryUpdateEntity and NewerVersionExistsException are part of my CavemanTools generic purpose library (available on Nuget).
The idea is to try doing things normally, then if the object version (rowversion/timestamp in sql) has changed we'll retry the whole operation again after waiting a couple of miliseconds. And that's exactly what the TryUpdateEntity() method does. And you can tweak how much to wait between tries or how many times it should retry the operation.
If you need to notify the user, then forget about retrying, just catch the exception directly and then tell the user to refresh or something.
Partition based solution
Achieve node stickiness by routing the incoming command based on the object's ID (eg. articleId modulo your-number-of-nodes) to make sure the commands of User1 and User2 ends up on the same node, then process the commands consecutively. You can choose to process all commands one by one or if you want to parallelize the execution, partition the commands on something like ID, odd/even, by country or similar.
Grid based solution
Use an in-memory grid (eg. Hazelcast or Coherence) and use a distributed Executor Service (http://docs.hazelcast.org/docs/2.0/manual/html/ch09.html#DistributedExecution) or similar to coordinate the command processing across the cluster.
Regardless - before adding this kind of complexity, you should of course ask yourself if it's really a problem if User2's command would be accepted and User1 got a concurrency error back. As long as User1's changes are not lost and can be re-applied after a refresh of the article it might be perfectly fine.

How to handle concurrency by eventual consistency?

How to handle concurrency by eventual consistency? Or I could ask how to ensure data integrity by eventual consistency?
By CQRS and event sourcing, eventual consistency means, that you put your domain events into a queue, and you set event handlers which are projections. Those projections update the read cache in an async way. Now if you validate using that read cache, you cannot be sure that the information you base your validation on, is still valid. There can be unprocessed (or unprojected?) domain events in the queue when you send your command, which can change the outcome of the validation. So this is just another type of concurrency... What do you think, how to handle these rare concurrency issues? Domain events are already saved in the storage, so you cannot do anything about them, you cannot just remove them from the event storage (because it supposed to be write only once), and tell the user in an email, that sorry, we made up our mind and cancelled your request. Or can you?
update:
A possible solution to handle concurrency by an event storage:
by write model
if
last-known-aggregate-version < stored-aggregate-version
then
throw error
else
execute command on aggregate
raise domain-event
store domain-event
++stored-aggregate-version (by aggregate-id)
by read model
process query
if
result contains aggregate-id
then
attach read-cached-aggregate-version
by projection
process domain-event
read-cached-aggregate-version = domain-event-related-aggregate-version (by aggregate-id)
As long as state changes you cannot assume anything will ever be 100% consistent. Technically you can ensure that various bits are 100% consistent with what you know.
Your queued domain event scenario is no different from a queue of work on a user's desk that still has to be input into the system.
Any other user performing an action dependent on the system state has no way to know that another user still needs to perform some action that may interfere with their operation.
I guess a lot is based on assuming the data is consistent and developing alternate flows and processes that can deal with these scenarios as they arise.

duplicated account login checking in server

The communication is based on socket and the it is keep-alive connection. User use account name to log in, I need to implement a feature when two user use same account to log in, the former one need to be kicked off.
Codes need to updated:
void session::login(accountname) // callback when server recv login request
{
boost::shared_ptr<UserData> d = database.get_user_data(accountname);
this->data = d;
this->send(login success);
}
boost::shared_ptr<UserData> Database::get_user_data(accountname)
{
// read from db and return the data
}
The most simple way is improve Database::get_user_data(accountname)
boost::shared_ptr<UserData> Database::get_user_data(accountname)
{
add a boost::unqiue_lock<> here
find session has same accountname or find user data has same accountname in cache,
if found, then kick this session offline first, then execute below codes
// read from db and return the data
}
This modification has 2 problems:
1, too bad concurrency because the scenario happen rarely. However, if I need to check account online or not, I must cache it somewhere(user data or session), that means I need to write to a container which must has exclusive lock whatever the account same or not. So the concurrency can hardly improved.
2, kick other one off by calling "other_session->offline()" in "this thread" that might concurrent with other operations executing in other thread at same time.
If I add lock in offline(), that will result in all others function belong to session also need to add that lock, obviously, not good. Or, I can push a event to other_session, and let other_session handle the event, that will make sure "offline" executing in its own thread. But the problem is that will make offline executing async, codes below "other one offline" must executed after "offline" runs complete.
I use boost::asio, but I try to describe this problem in common because I think this is a common problem in server writing. Is there a pattern to solve this? Notice that this problem gets complex when there are N same account log in at same time
If this scenario rarely happens, I wouldn't worry about it. lock and release of mutex are not long actions the user would notice (if you had to do it thousands of times a second it could be a problem).
In general trying to fix performance issues that are not there is a bad idea.

How to design a state machine in face of non-blocking I/O?

I'm using Qt framework which has by default non-blocking I/O to develop an application navigating through several web pages (online stores) and carrying out different actions on these pages. I'm "mapping" specific web page to a state machine which I use to navigate through this page.
This state machine has these transitions;
Connect, LogIn, Query, LogOut, Disconnect
and these states;
Start, Connecting, Connected, LoggingIn, LoggedIn, Querying, QueryDone, LoggingOut, LoggedOut, Disconnecting, Disconnected
Transitions from *ing to *ed states (Connecting->Connected), are due to LoadFinished asynchronous network events received from network object when currently requested url is loaded. Transitions from *ed to *ing states (Connected->LoggingIn) are due to events send by me.
I want to be able to send several events (commands) to this machine (like Connect, LogIn, Query("productA"), Query("productB"), LogOut, LogIn, Query("productC"), LogOut, Disconnect) at once and have it process them. I don't want to block waiting for the machine to finish processing all events I sent to it. The problem is they have to be interleaved with the above mentioned network events informing machine about the url being downloaded. Without interleaving machine can't advance its state (and process my events) because advancing from *ing to *ed occurs only after receiving network type of event.
How can I achieve my design goal?
EDIT
The state machine I'm using has its own event loop and events are not queued in it so could be missed by machine if they come when the machine is busy.
Network I/O events are not posted directly to neither the state machine nor the event queue I'm using. They are posted to my code (handler) and I have to handle them. I can forward them as I wish but please have in mind remark no. 1.
Take a look at my answer to this question where I described my current design in details. The question is if and how can I improve this design by making it
More robust
Simpler
Sounds like you want the state machine to have an event queue. Queue up the events, start processing the first one, and when that completes pull the next event off the queue and start on that. So instead of the state machine being driven by the client code directly, it's driven by the queue.
This means that any logic which involves using the result of one transition in the next one has to be in the machine. For example, if the "login complete" page tells you where to go next. If that's not possible, then the event could perhaps include a callback which the machine can call, to return whatever it needs to know.
Asking this question I already had a working design which I didn't want to write about not to skew answers in any direction :) I'm going to describe in this pseudo answer what the design I have is.
In addition to the state machine I have a queue of events. Instead of posting events directly to the machine I'm placing them in the queue. There is however problem with network events which are asynchronous and come in any moment. If the queue is not empty and a network event comes I can't place it in the queue because the machine will be stuck waiting for it before processing events already in the queue. And the machine will wait forever because this network event is waiting behind all events placed in the queue earlier.
To overcome this problem I have two types of messages; normal and priority ones. Normal ones are those send by me and priority ones are all network ones. When I get network event I don't place it in the queue but instead I send it directly to the machine. This way it can finish its current task and progress to the next state before pulling the next event from the queue of events.
It works designed this way only because there is exactly 1:1 interleave of my events and network events. Because of this when the machine is waiting for a network event it's not busy doing anything (so it's ready to accept it and does not miss it) and vice versa - when the machine waits for my task it's only waiting for my task and not another network one.
I asked this question in hope for some more simple design than what I have now.
Strictly speaking, you can't. Because you only have state "Connecting", you don't know whether you need top login afterwards. You'd have to introduce a state "ConnectingWithIntentToLogin" to represent the result of a "Connect, then Login" event from the Start state.
Naturally there will be a lot of overlap between the "Connecting" and the "ConnectingWithIntentToLogin" states. This is most easily achieved by a state machine architecture that supports state hierarchies.
--- edit ---
Reading your later reactions, it's now clear what your actual problem is.
You do need extra state, obviously, whether that's ingrained in the FSM or outside it in a separate queue. Let's follow the model you prefer, with extra events in a queue. The rick here is that you're wondering how to "interleave" those queued events vis-a-vis the realtime events. You don't - events from the queue are actively extracted when entering specific states. In your case, those would be the "*ed" states like "Connected". Only when the queue is empty would you stay in the "Connected" state.
If you don't want to block, that means you don't care about the network replies. If on the other hand the replies interest you, you have to block waiting for them. Trying to design your FSM otherwise will quickly lead to your automaton's size reaching infinity.
How about moving the state machine to a different thread, i. e. QThread. I would implent a input queue in the state machine so I could send queries non blocking and a output queue to read the results of the queries. You could even call back a slotted function in your main thread via connect(...) if a result of a query arrives, Qt is thread safe in this regard.
This way your state machine could block as long as it needs without blocking your main program.
Sounds like you just want to do a list of blocking I/O in the background.
So have a thread execute:
while( !commands.empty() )
{
command = command.pop_back();
switch( command )
{
Connect:
DoBlockingConnect();
break;
...
}
}
NotifySenderDone();