Apache Kafka consume events from different topics in specific order

Apache Kafka consume events from different topics in specific order - concurrency

Let's say, I have topicA, topicB and topicC, both topic segregated by separate event types, based on domain entities. topicA operate with eventA only, topicB keeps eventB, topicC operates with eventC only. All events relate to each other by business domain but produced by separate microservices and should be processed in specific order.
The question is, how to using Apache Kafka introduce consuming events in specific order, eventA then wait for receiving eventB then when eventC received consume all of them.
Appreciate any feedback, any questions are welcome.
Some notes:
Kafka Streams is a good approach, but restricted by company policies.
Also, I've looked through Join Pattern but haven't found any reliable approaches for implementation.

Probably, there are many approaches to solve the problem. Here are couple, that I can suggest:
Introduce correlation ID, that will link events from topics A, B and C. Then, use correlation ID in following manner:
Services A, B and C produce events to corresponding topics, but related events have the same correlation ID
Service D consumes events from separate topics. Each time it receives event from any topic, service D either inserts event data to database by correlation ID, or performs some action if all data is received.
For example, when service D receives event C it first issues query to check if there is record in database with correlation ID from event C:
if there is no record, then incoming event C is stored,
if some record already exist, then service D checks whether event C is the last one needed to consume all data and either performs final action, or inserts event C to the database.
And so on for each consumed event.
Chain services that produce events (A, B and C). For example, chain can be formed in following manner:
Service A produces event to topic A
Service B consumes event from topic A, and produces event to topic B (possibly, aggregating events A and B)
Service C consumes event from topic B, and produces event to topic C (possibly, aggregating events A, B and C)
Finally, service D consumes event from topic C (possibly, aggregated with A, B and C) and executes required action.
Variation of this approach (without aggregating events on each intermediate stage), would be to chain services and listen for last event in the chain. When last event is consumed, then issue Kafka pull to corresponding topics to get events produced by other services.

If events are related to each other then they should go to one topic. So microservice-1 should push eventA with (key, value) & label (eventA). In the same way, microservice-2 and microservice-3 should push data to a common topic.
This would help you at consumer side.

Since you're asking about about ordering the consumption of messages between different topics, then the first option would be to have one consumer produce a message, feeding the next consumer (these consumers may or may not be part of the same process):
consumerA processes message -> consumerA puts new message on a different topic -> consumerB picks up that new message and processes -> consumerB puts new message on a second topics -> etc... etc..
I would not be surprised if streams is essentially doing this or a similar process under the hood. Any other kind of interface for inter-process communication could be used instead: RDP, memory-mapped files, mutex, pipes; take your pick.
Unless as a last resort, I would try to avoid putting different events on the same topic. When you put multiple events on a single queue/topic, you constrain your consumers in a couple ways:
Your contracts are now tightly coupled for both events. To change the shape of just one of the events on that single topic, your consumers have to dynamically deserialize those events based on metadata (a magic number, key-value, etc.)
Your consumption patterns may be less efficient. What if I'm just interested in one of those events? I have to read the event and then throw it out if its not the one I'm looking for.
A real-life example of this is in amusement parks. Let's say you have two types of amusement park visitors: Fast-Pass and Standard customers. Your business rules state that Fast-Pass customers get to skip the line ahead of standard customers.
If you merge them into a single queue/topic, how do you do that? The answer is priority queueing; you ask everyone who joins in line if they're fast pass, which is prone to mistakes and is inefficient (this is priority queueing; it can work, but it may not be the best solution). Most amusement parks solved this by setting up two separate queues (one for each type of customer [event/message]). Now they can feed customers into two separate attendants (one FastPass one Standard), or they might have one attendant do both queues, emptying the fast-pass queue first.
At the end of the day, it depends on your constraints: is it 10 messages a day, or 1 billion, do you require immediate consistency or eventual consistency, is it on an IoT device?

Related

Is it possible to define several stop messages in Akka Clustering

I am trying to configure an Akka Actor for Cluster Sharding, one thing that I am not quite sure, is it possible to configure several Stop Messages for an Entity for graceful Shutdown.
for ex, Entity configuration like following will trigger graceful shutdown for both 'onDelete' and 'onExit' or it will do it only for 'onExit'?
sharding
.init(
Entity(Actor1Key) {
context => ....
}
)
.withStopMessage(Actor1.onDelete)
.withStopMessage(Actor1.onExit)
if not do you have any idea how I can achieve this Behaviour?
Thx for answers

I think there may some confusion around what the purpose of the stopMessage is. There should not be a need for multiple stop messages.
The stopMessage sent by sharding after passivation has been requested by the actor, which is done by sending Passivate from the sharded actor itself.
You can let any of the messages that the actor accepts trigger passivation, the shard will send back the stopMessage when it is safe for the actor to actually stop.
The reason you should passivate rather than just Behaviors.stopped the actor is that there may be messages that was en route to the actor (mailbox and I think possibly a buffer in the shard in some circumstances) before the message causing it deciding to stop and you want to process those first. Passivation allows for that to happen by including a roundtrip to the shard actor which is charge of routing messages to the sharded actor.
A bit more details in the docs here: https://doc.akka.io/docs/akka/current/typed/cluster-sharding.html#passivation

What you have specified would only trigger the stop message for Actor1.onExit. The reason is how a stop message is defined for an Entity:
val stopMessage: Optional[M],
So you see that this is a plain optional thus no multiple elements are possible. You can also check how the withStopMessage is implemented here:
def withStopMessage(newStopMessage: M): Entity[M, E] =
copy(stopMessage = Optional.ofNullable(newStopMessage))
So you are basically going to "overwrite" the message any time you call withStopMessage. Unfortunately, I am not aware of any other way of specifying multiple stop messages (besides combining multiple messages in a common trait but I think this is not what you are looking for).

How to do queries on a collection of actors

I have an actor system that at the moment accepts commands/messages. The state of these actors is persisted Akka.Persistance. We now want to build the query system for this actor system. Basically our problem is that we want to have a way to get an aggregate/list of all the states of these particular actors. While I'm not strictly subscribing to the CQRS pattern I think that it might be a neat way to go about it.
My initial thoughts was to have an actor for querying that holds as part of its state an aggregation of the states of the other actors that are doing the "data writes". And to do this this actor will subscribe to the actors its interested to and these actors would just send the query actor their states when they undergo some sort of state change. Is this the way to go about this? is there a better way to do this?

My recommendation for implementing this type of pattern is to use a combination of pub-sub and push-and-pull messaging for your actors here.
For each "aggregate," this actor should be able to subscribe to events from the individual child actors you want to query. Whenever a child's state changes, a message is pushed into all subscribed aggregates and each aggregate's state is updated automatically.
When a new aggegrate comes online and needs to retrieve state it missed (from before it existed) it should be able to pull the current state from each child and use that to build its current state, using incremental updates from children going forward to keep its aggregated view of the children's state consistent.
This is the pattern I use for this sort of work and it works well locally out of the box. Over the network, you may have to ensure deliverability guarantees and that's generally easy to do. You can read a bit more on how to do that there: https://petabridge.com/blog/akkadotnet-at-least-once-message-delivery/

Some of Akka.Persistence backends (i.e. those working with SQL) also implement something known as Akka.Persistence.Query. It allows you to subscribe to a stream of events that are produced, and use this as a source for Akka.Streams semantics.
If you're using SQL-journals you'll need Akka.Persistence.Query.Sql and Akka.Streams packages. From there you can create a live (that means continuously updated) source of events for a particular actor and use it for any operations you like i.e print them:
using (var system = ActorSystem.Create("system"))
using (var materializer = system.Materializer())
{
var queries = Sys.ReadJournalFor<SqlReadJournal>(SqlReadJournal.Identifier)
queries.EventsByPersistenceId("<persistence-id>", 0, long.MaxValue)
.Select(envelope => envelope.Event)
.RunForEach(e => Console.WriteLine(e), materializer);
}

c++ observer pattern: adding another dimension

I'm trying to implement this pattern on a "smart building" system design (using STL library). Various "sensors" placed in rooms, floors etc, dispatch signals that are handled by "controllers" (also placed in different rooms, floors etc.). The problem I'm facing is that the controller's subscription to an event isn't just event based, it is also location based.
For example, controller A can subscribe to a fire signal from room #1 in floor #4 and to a motion signal in floor #5. A floor-based subscription means that controller A will get an motion event about every room in the floor he's subscribed to (assuming the appropriate sensor is placed there). There's also a building-wide subscription for that matter.
The topology of the system is read from a configuration file at start up, so I don't want to map the whole building, just the relevant places that contain sensors and controllers.
What I've managed to think of :
Option 1: MonitoredArea class that contains the name of the area (Building1, Floor 2, Room 3) and a vector where the vector's index is an enumerated event type each member of the vector contains a list of controllers that are subscribed to this event. The class will also contain a pointer to a parent MonitoredArea, in the case it is a room in a floor, or a floor in a building.
A Sensor class will dispatch an Event to a center hub along with the sensor's name. The hub will run it through his sensor-name-to-location map, acquire the matching MonitoredArea and will alert all the controllers in the vector.
Cons:
Coupling of the location to the controller
Events are enumerated and are hard coded in the MonitoredArea class, adding future events is difficult.
Option 2:
Keeping all the subscriptions in the Controller class.
Cons:
Very inefficient. Every event will make the control center to iterate through all the controller and find out which are subscribed to this particular event.
Option 3:
Event based functionality. Event class (ie. FireEvent) will contain all the locations it can happen in (according to the sensor's setup) and for every location, a list of the controllers that are subscribed to it.
Cons:
A map of maps
Strong data duplication
No way to alert floor-based subscriptions about events in the various rooms.
As you can see, I'm not happy with any of the mentioned solutions. I'm sure I've reached the over-thinking stage and would be happy for a feedback or alternative suggestions as to how I approach this. Thanks.

There is design pattern (sort of speak) used a lot in game development called "Message Bus". And it is sometimes used to replace event based operations.
"A message bus is a connection between one or more senders and/or receivers. Think of it like a connection between computers in a bus topology: Every node can send a message by passing it to the bus, and all connected nodes will receive that message. If the node is processed and if a reply is sent is completely up to each receiver itself.
Having modules connected to a message bus gives us some advantages:
Every module is isolated, it does not need to know of any others.
Every module can react to any message that’s being sent to the bus; that means you get extra flexibility for free, without increasing dependencies at all.
It’s much easier to follow the YAGNI workflow: For example you’re going to add weapons. At first you implement the physics, then you add visuals in the renderer, and then playing sounds. All of those features can be implemented independently at any time, without interrupting each other.
You save yourself from thinking a lot about how to connect certain modules to each other. Sometimes it takes a huge amount of time, including drawing diagrams/dependency graphs."
Sources:
http://gameprogrammingpatterns.com/event-queue.html
http://www.optank.org/2013/04/02/game-development-design-3-message-bus/

CQRS, multiple write nodes for a single aggregate entry, while maintaining concurrency

Let's say I have a command to edit a single entry of an article, called ArticleEditCommand.
User 1 issues an ArticleEditCommand based on V1 of the article.
User 2 issues an ArticleEditCommand based on V1 of the same
article.
If I can ensure that my nodes process the older ArticleEditCommand commands first, I can be sure that the command from User 2 will fail because User 1's command will have changed the version of the article to V2.
However, if I have two nodes process ArticleEditCommand messages concurrently, even though the commands will be taken of the queue in the correct order, I cannot guarantee that the nodes will actually process the first command before the second command, due to a spike in CPU or something similar. I could use a sql transaction to update an article where version = expectedVersion and make note of the number of records changed, but my rules are more complex, and can't live solely in SQL. I would like my entire logic of the command processing guaranteed to be concurrent between ArticleEditCommand messages that alter that same article.
I don't want to lock the queue while I process the command, because the point of having multiple command handlers is to handle commands concurrently for scalability. With that said, I don't mind these commands being processed consecutively, but only for a single instance/id of an article. I don't expect a high volume of ArticleEditCommand messages to be sent for a single article.
With the said, here is the question.
Is there a way to handle commands consecutively across multiple nodes for a single unique object (database record), but handle all other commands (distinct database records) concurrently?
Or, is this a problem I created myself because of a lack of understanding of CQRS and concurrency?
Is this a problem that message brokers typically have solved? Such as Windows Service Bus, MSMQ/NServiceBus, etc?
EDIT: I think I know how to handle this now. When User 2 issues the ArticleEditCommand, an exception should be throw to the user letting them know that there is a current pending operation on that article that must be completed before then can queue the ArticleEditCommand. That way, there is never two ArticleEditCommand messages in the queue that effect the same article.

First let me say, if you don't expect a high volume of ArticleEditCommand messages being sent, this sounds like premature optimization.
In other solutions, this problem is usually not solved by message brokers, but by optimistic locking enforced by the persistence implementation. I don't understand why a simple version field for optimistic locking that can be trivially handled by SQL contradicts complicated business logic/updates, maybe you could elaborate more?

It's actually quite simple and I did that. Basically, it looks like this ( pseudocode)
//message handler
ModelTools.TryUpdateEntity(
()=>{
var entity= _repo.Get(myId);
entity.Do(whateverCommand);
_repo.Save(entity);
}
10); //retry 10 times until giving up
//repository
long? _version;
public MyObject Get(Guid id)
{
//query data and version
_version=data.version;
return data.ToMyObject();
}
public void Save(MyObject data)
{
//update row in db where version=_version.Value
if (rowsUpdated==0)
{
//things have changed since we've retrieved the object
throw new NewerVersionExistsException();
}
}
ModelTools.TryUpdateEntity and NewerVersionExistsException are part of my CavemanTools generic purpose library (available on Nuget).
The idea is to try doing things normally, then if the object version (rowversion/timestamp in sql) has changed we'll retry the whole operation again after waiting a couple of miliseconds. And that's exactly what the TryUpdateEntity() method does. And you can tweak how much to wait between tries or how many times it should retry the operation.
If you need to notify the user, then forget about retrying, just catch the exception directly and then tell the user to refresh or something.

Partition based solution
Achieve node stickiness by routing the incoming command based on the object's ID (eg. articleId modulo your-number-of-nodes) to make sure the commands of User1 and User2 ends up on the same node, then process the commands consecutively. You can choose to process all commands one by one or if you want to parallelize the execution, partition the commands on something like ID, odd/even, by country or similar.
Grid based solution
Use an in-memory grid (eg. Hazelcast or Coherence) and use a distributed Executor Service (http://docs.hazelcast.org/docs/2.0/manual/html/ch09.html#DistributedExecution) or similar to coordinate the command processing across the cluster.
Regardless - before adding this kind of complexity, you should of course ask yourself if it's really a problem if User2's command would be accepted and User1 got a concurrency error back. As long as User1's changes are not lost and can be re-applied after a refresh of the article it might be perfectly fine.

How do you process messages in parallel while ensuring FIFO per entity?

Let's say you have an entity, say, "Person" in your system and you want to process events that modify various Person entities. It is important that:
Events for the same Person are processed in FIFO order
Multiple Person event streams be processed in parallel by different threads/processes
We have an implementation that solves this using a shared database and locks. Threads compete to acquire the lock for a Person and then process events in order after acquiring the lock. We'd like to move to a message queue to avoid polling and locking, which we feel would reduce load on the DB and simplify the implementation of the consumer code.
I've done some research into ActiveMQ, RabbitMQ, and HornetQ but I don't see an obvious way to implement this.
ActiveMQ supports consumer subscription wildcards, but I don't see a way to limit the concurrency on each queue to 1. If I could do that, then the solution would be straightforward:
Somehow tell broker to allow a concurrency of 1 for all queues starting with: /queue/person.
Publisher writes event to queue using Person ID in the queue name. e.g.: /queue/person.20
Consumers subscribe to the queue using wildcards: /queue/person.>
Each consumer would receive messages for different person queues. If all person queues were in use, some consumers may sit idle, which is ok
After processing a message, the consumer sends an ACK, which tells the broker it's done with the message, and allows another message for that Person queue to be sent to another consumer (possibly the same one)
ActiveMQ came close: You can do wildcard subscriptions and enable "exclusive consumer", but that combination results in a single consumer receiving all messages sent to all matching queues, reducing your concurrency to 1 across all Persons. I feel like I'm missing something obvious.
Questions:
Is there way to implement the above approach with any major message queue implementation? We are fairly open to options. The only requirement is that it run on Linux.
Is there a different way to solve the general problem that I'm not considering?
Thanks!

It looks like JMSXGroupID is what I'm looking for. From the ActiveMQ docs:
http://activemq.apache.org/message-groups.html
Their example use case with stock prices is exactly what I'm after. My only concern is what happens if the single consumer dies. Hopefully the broker will detect that and pick another consumer to associate with that group id.

One general way to solve this problem (if I got your problem right) is to introduce some unique property for Person (say, database-level id of Person) and use hash of that property as index of FIFO queue to put that Person in.
Since hash of that property can be unwieldy big (you can't afford 2^32 queues/threads), use only N the least significant bits of that hash.
Each FIFO queue should have dedicated worker that will work upon it -- voila, your requirements are satisfied!
This approach have one drawback -- your Persons must have well-distributed ids to make all queues work with more-or-less equal load. If you can't guarantee that, consider using round-robin set of queues and track which Persons are being processed now to ensure sequential processing for same person.

If you already have a system that allows shared locks, why not have a lock for every queue, which consumers must acquire before they read from the queue?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js