CQRS, multiple write nodes for a single aggregate entry, while maintaining concurrency - concurrency

Let's say I have a command to edit a single entry of an article, called ArticleEditCommand.
User 1 issues an ArticleEditCommand based on V1 of the article.
User 2 issues an ArticleEditCommand based on V1 of the same
article.
If I can ensure that my nodes process the older ArticleEditCommand commands first, I can be sure that the command from User 2 will fail because User 1's command will have changed the version of the article to V2.
However, if I have two nodes process ArticleEditCommand messages concurrently, even though the commands will be taken of the queue in the correct order, I cannot guarantee that the nodes will actually process the first command before the second command, due to a spike in CPU or something similar. I could use a sql transaction to update an article where version = expectedVersion and make note of the number of records changed, but my rules are more complex, and can't live solely in SQL. I would like my entire logic of the command processing guaranteed to be concurrent between ArticleEditCommand messages that alter that same article.
I don't want to lock the queue while I process the command, because the point of having multiple command handlers is to handle commands concurrently for scalability. With that said, I don't mind these commands being processed consecutively, but only for a single instance/id of an article. I don't expect a high volume of ArticleEditCommand messages to be sent for a single article.
With the said, here is the question.
Is there a way to handle commands consecutively across multiple nodes for a single unique object (database record), but handle all other commands (distinct database records) concurrently?
Or, is this a problem I created myself because of a lack of understanding of CQRS and concurrency?
Is this a problem that message brokers typically have solved? Such as Windows Service Bus, MSMQ/NServiceBus, etc?
EDIT: I think I know how to handle this now. When User 2 issues the ArticleEditCommand, an exception should be throw to the user letting them know that there is a current pending operation on that article that must be completed before then can queue the ArticleEditCommand. That way, there is never two ArticleEditCommand messages in the queue that effect the same article.

First let me say, if you don't expect a high volume of ArticleEditCommand messages being sent, this sounds like premature optimization.
In other solutions, this problem is usually not solved by message brokers, but by optimistic locking enforced by the persistence implementation. I don't understand why a simple version field for optimistic locking that can be trivially handled by SQL contradicts complicated business logic/updates, maybe you could elaborate more?

It's actually quite simple and I did that. Basically, it looks like this ( pseudocode)
//message handler
ModelTools.TryUpdateEntity(
()=>{
var entity= _repo.Get(myId);
entity.Do(whateverCommand);
_repo.Save(entity);
}
10); //retry 10 times until giving up
//repository
long? _version;
public MyObject Get(Guid id)
{
//query data and version
_version=data.version;
return data.ToMyObject();
}
public void Save(MyObject data)
{
//update row in db where version=_version.Value
if (rowsUpdated==0)
{
//things have changed since we've retrieved the object
throw new NewerVersionExistsException();
}
}
ModelTools.TryUpdateEntity and NewerVersionExistsException are part of my CavemanTools generic purpose library (available on Nuget).
The idea is to try doing things normally, then if the object version (rowversion/timestamp in sql) has changed we'll retry the whole operation again after waiting a couple of miliseconds. And that's exactly what the TryUpdateEntity() method does. And you can tweak how much to wait between tries or how many times it should retry the operation.
If you need to notify the user, then forget about retrying, just catch the exception directly and then tell the user to refresh or something.

Partition based solution
Achieve node stickiness by routing the incoming command based on the object's ID (eg. articleId modulo your-number-of-nodes) to make sure the commands of User1 and User2 ends up on the same node, then process the commands consecutively. You can choose to process all commands one by one or if you want to parallelize the execution, partition the commands on something like ID, odd/even, by country or similar.
Grid based solution
Use an in-memory grid (eg. Hazelcast or Coherence) and use a distributed Executor Service (http://docs.hazelcast.org/docs/2.0/manual/html/ch09.html#DistributedExecution) or similar to coordinate the command processing across the cluster.
Regardless - before adding this kind of complexity, you should of course ask yourself if it's really a problem if User2's command would be accepted and User1 got a concurrency error back. As long as User1's changes are not lost and can be re-applied after a refresh of the article it might be perfectly fine.

Related

Launching all threads at exactly the same time in C++

I have Rosbag file which contains messages on various topics, each topic has its own frequency. This data has been captured from a hardware device streaming data, and data from all topics would "reach" at the same time to be used for different algorithms.
I wish to simulate this using the rosbag file(think of it as every topic has associated an array of data) and it is imperative that this data streaming process start at the same time so that the data can be in sync.
I do this via launching different publishers on different threads (I am open to other approaches as well, this was the only one I could think of.), but the threads do not start at the same time, by the time thread 3 starts, thread 1 would be considerably ahead.
How may I achieve this?
Edit - I understand that launching at the exact same time is not possible, but maybe I can get away with a launch extremely close to each other as well. Is there any way to ensure this?
Edit2 - Since the main aim is to get the data stream in Sync, I was wondering about the warmup effect of the thread(suppose a thread1 starts from 3.3GHz and reaches to 4.2GHz by the time thread2 starts at 3.2). Would this have a significant effect (I can always warm them up before starting the publishing process, but I am curious whether it would have a pronounced effect)
TIA
As others have stated in the comments you cannot guarantee threads launch at exactly the same time. To address your overall goal: you're going about solving this problem the wrong way, from a ROS perspective. Instead of manually publishing data and trying to get it in sync, you should be using the rosbag api. This way you can actually guarantee messages have the same timestamp. Note that this doesn't guarantee they will be sent out at the exact same time, because they won't. You can put a message into a bag file directly like this
import rosbag
from std_msgs.msg import Int32, String
bag = rosbag.Bag('test.bag', 'w')
try:
s = String()
s.data = 'foo'
i = Int32()
i.data = 42
bag.write('chatter', s)
bag.write('numbers', i)
finally:
bag.close()
For more complex types that include a Header field simply edit the header.stamp portion to keep timestamps consistent

DDD - Concurrency and Command retrying with side-effects

I am developing an event-sourced Electric Vehicle Charging Station Management System, which is connected to several Charging Stations. In this domain, I've come up with an aggregate for the Charging Station, which includes the internal state of the Charging Station(whether it is network-connected, if a car is charging using the station's connectors).
The station notifies me about its state through messages defined in a standardized protocol:
Heartbeat: whether the station is still "alive"
StatusNotification: whether the station has encountered an error(under voltage), or if everything is correct
And my server can send commands to this station:
RemoteStartTransaction: tells the station to unlock and reserve one of its connectors, for a car to charge using the connector.
I've developed an Aggregate for this Charging Station. It contains the internal entities of its connector, whether it's charging or not, if it has a problem in the power system, ...
And the Aggregate, which its memory representation resides in the server that I control, not in the Charging Station itself, has a StationClient service, which is responsible for sending these commands to the physical Charging Station(pseudocode):
class StationAggregate {
stationClient: StationClient
URL: string
connector: Connector[]
unlock(connectorId) {
if this.connectors.find(connectorId).isAvailableToBeUnlocked() {
return ErrorConnectorNotAvailable
}
error = this.stationClient.sendRemoteStartTransaction(this.URL, connectorId)
if error {
return ErrorStationRejectedUnlock
}
this.applyEvents([
StationUnlockedEvent(connectorId, now())
])
return Ok
}
receiveHeartbeat(timestamp) {
this.applyEvents([
StationSentHeartbeat(timestamp)
])
return Ok
}
}
I am using a optimistic concurrency, which means that, I load the Aggregate from a list of events, and I store the current version of the Aggregate in its memory representation: StationAggregate in version #2032, when a command is successfully processed and event(s) applied, it would the in version #2033, for example. In that way, I can put a unique constraint on the (StationID, Version) tuple on my persistence layer, and guarantee that only one event is persisted.
If by any chance, occurs a receival of a Heartbeat message, and the receival of a Unlock command. In both threads, they would load the StationAggregate and would be both in version X, in the case of the Heartbeat receival, there would be no side-effects, but in the case of the Unlock command, there would be a side-effect that tells the physical Charging Station to be unlocked. However as I'm using optimistic concurrency, that StationUnlocked event could be rejected from the persistence layer. I don't know how I could handle that, as I can't retry the command, because it its inherently not idempotent(as the physical Station would reject the second request)
I don't know if I'm modelling something wrong, or if it's really a hard domain to model.
I am not sure I fully understand the problem, but the idea of optimistic concurrency is to prevent writes in case of a race condition. Versions are used to ensure that your write operation has the version that is +1 from the version you've got from the database before executing the command.
So, in case there's a parallel write that won and you got the wrong version exception back from the event store, you retry the command execution entirely, meaning you read the stream again and by doing so you get the latest state with the new version. Then, you give the command to the aggregate, which decides if it makes sense to perform the operation or not.
The issue is not particularly related to Event Sourcing, it is just as relevant for any persistence and it is resolved in the same way.
Event Sourcing could bring you additional benefits since you know what happened. Imagine that by accident you got the Unlock command twice. When you got the "wrong version" back from the store, you can read the last event and decide if the command has already been executed. It can be done logically (there's no need to unlock if it's already unlocked, by the same customer), technically (put the command id to the event metadata and compare), or both ways.
When handling duplicate commands, it makes sense to ensure a decent level of idempotence of the command handling, ignore the duplicate and return OK instead of failing to the user's face.
Another observation that I can deduce from the very limited amount of information about the domain, is that heartbeats are telemetry and locking and unlocking are business. I don't think it makes a lot of sense to combine those two distinctly different things in one domain object.
Update, following the discussion in comments:
What you got with sending the command to the station at the same time as producing the event, is the variation of two-phase commit. Since it's not executed in a transaction, any of the two operations could fail and lead the system to an inconsistent state. You either don't know if the station got the command to unlock itself if the command failed to send, or you don't know that it's unlocked if the event persistence failed. You only got as far as the second operation, but the first case could happen too.
There are quite a few ways to solve it.
First, solving it entirely technical. With MassTransit, it's quite easy to fix using the Outbox. It will not send any outgoing messages until the consumer of the original message is fully completed its work. Therefore, if the consumer of the Unlock command fails to persist the event, the command will not be sent. Then, the retry filter would engage and the whole operation would be executed again and you already get out of the race condition, so the operation would be properly completed.
But it won't solve the issue when your command to the physical station fails to send (I reckon it is an edge case).
This issue can also be easily solved and here Event Sourcing is helpful. You'd need to convert sending the command to the station from the original (user-driven) command consumer to the subscriber. You subscribe to the event stream of StationUnlocked event and let the subscriber send commands to the station. With that, you would only send commands to the station if the event was persisted and you can retry sending the command as many times as you'd need.
Finally, you can solve it in a more meaningful way and change the semantics. I already mentioned that heartbeats are telemetry messages. I could expect the station also to respond to lock and unlock commands, telling you if it actually did what you asked.
You can use the station telemetry to create a representation of the physical station, which is not a part of the aggregate. In fact, it's more like an ACL to the physical world, represented as a read model.
When you have such a mirror of the physical station on your side, when you execute the Unlock command in your domain, you can engage a domain server to consult with the current station state and make a decision. If you find out that the station is already unlocked and the session id matches (yes, I remember our previous discussion :)) - you return OK and safely ignore the command. If it's locked - you proceed. If it's unlocked and the session id doesn't match - it's obviously an error and you need to do something else.
In this last option, you would clearly separate telemetry processing from the business so you won't have heartbeats impact your domain model, so you really won't have the versioning issue. You also would always have a place to look at to understand what is the current state of the physical station.

Doubts regarding to table, producer/consumer locking solutions

Context: Let's say we have a service with 500 clients connected that are on constant activity, and I want to log most part of it by inserting them on a MySQL InnoDB based table. The server is working on a simple thread.
From an external program (website), I proceed selecting data from that table, will it cause it to be locked?
In case it does, I assume the server won't be able to proceed on inserting or updating data until the selecting task is finished from the external program.
The first thing that came to my mind was to implement a producer-consumer concurrency, where one will push data into a queue, and another would insert that data into the database; so, in case of selecting data from an external program, the consumer won't proceed and not the whole server.
I've seen some consumer/producer examples where the producer is not able to push data while it is being processed. In this case, is it ok to make two containers, and simply push it on the one that is not being used? Since, if the consumer is procesing the data, the producer won't proceed pushing it into the queue, making me doubt about its efficiency.
Also, I have been looking into this example:
http://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular
In case it works as it describes, is there anything I should be worried about? Is there something I'm missing?
In case the select query takes too much and the inserting query returns a time out, would it be recomendable to increase the timeout value or to retry the query in case of failure?
Thanks

How to handle concurrency by eventual consistency?

How to handle concurrency by eventual consistency? Or I could ask how to ensure data integrity by eventual consistency?
By CQRS and event sourcing, eventual consistency means, that you put your domain events into a queue, and you set event handlers which are projections. Those projections update the read cache in an async way. Now if you validate using that read cache, you cannot be sure that the information you base your validation on, is still valid. There can be unprocessed (or unprojected?) domain events in the queue when you send your command, which can change the outcome of the validation. So this is just another type of concurrency... What do you think, how to handle these rare concurrency issues? Domain events are already saved in the storage, so you cannot do anything about them, you cannot just remove them from the event storage (because it supposed to be write only once), and tell the user in an email, that sorry, we made up our mind and cancelled your request. Or can you?
update:
A possible solution to handle concurrency by an event storage:
by write model
if
last-known-aggregate-version < stored-aggregate-version
then
throw error
else
execute command on aggregate
raise domain-event
store domain-event
++stored-aggregate-version (by aggregate-id)
by read model
process query
if
result contains aggregate-id
then
attach read-cached-aggregate-version
by projection
process domain-event
read-cached-aggregate-version = domain-event-related-aggregate-version (by aggregate-id)
As long as state changes you cannot assume anything will ever be 100% consistent. Technically you can ensure that various bits are 100% consistent with what you know.
Your queued domain event scenario is no different from a queue of work on a user's desk that still has to be input into the system.
Any other user performing an action dependent on the system state has no way to know that another user still needs to perform some action that may interfere with their operation.
I guess a lot is based on assuming the data is consistent and developing alternate flows and processes that can deal with these scenarios as they arise.

duplicated account login checking in server

The communication is based on socket and the it is keep-alive connection. User use account name to log in, I need to implement a feature when two user use same account to log in, the former one need to be kicked off.
Codes need to updated:
void session::login(accountname) // callback when server recv login request
{
boost::shared_ptr<UserData> d = database.get_user_data(accountname);
this->data = d;
this->send(login success);
}
boost::shared_ptr<UserData> Database::get_user_data(accountname)
{
// read from db and return the data
}
The most simple way is improve Database::get_user_data(accountname)
boost::shared_ptr<UserData> Database::get_user_data(accountname)
{
add a boost::unqiue_lock<> here
find session has same accountname or find user data has same accountname in cache,
if found, then kick this session offline first, then execute below codes
// read from db and return the data
}
This modification has 2 problems:
1, too bad concurrency because the scenario happen rarely. However, if I need to check account online or not, I must cache it somewhere(user data or session), that means I need to write to a container which must has exclusive lock whatever the account same or not. So the concurrency can hardly improved.
2, kick other one off by calling "other_session->offline()" in "this thread" that might concurrent with other operations executing in other thread at same time.
If I add lock in offline(), that will result in all others function belong to session also need to add that lock, obviously, not good. Or, I can push a event to other_session, and let other_session handle the event, that will make sure "offline" executing in its own thread. But the problem is that will make offline executing async, codes below "other one offline" must executed after "offline" runs complete.
I use boost::asio, but I try to describe this problem in common because I think this is a common problem in server writing. Is there a pattern to solve this? Notice that this problem gets complex when there are N same account log in at same time
If this scenario rarely happens, I wouldn't worry about it. lock and release of mutex are not long actions the user would notice (if you had to do it thousands of times a second it could be a problem).
In general trying to fix performance issues that are not there is a bad idea.