Performing an asynchronous transformation within a Kafka Stream - concurrency

Assume I have two Kafka topics, A and B. I am trying to develop a system that pulls records from A, applies a transformation to each record, then publishes the transformed records to B. In this case, the transformation involves calling a REST endpoint over HTTP.
Being relatively new to Kafka, I was glad to see that the Kafka Streams project already solved this type of problem (consume-transform-publish). Unfortunately, I discovered that transformations in Kafka streams are blocking operations. Instinctively, I try to call HTTP endpoints in a non-blocking, asynchronous manner.
Does this mean that Kafka Streams will not work in this situation? Does this mean that I must revert back to calling the REST endpoint in a blocking manner? Is this even an acceptable pattern for Kafka Streams? Stream-based data processing is still relatively new to me, so I am not entirely familiar with its concurrency models.

Update: after looking in to this further, I am not sure that this is the right answer...
I am new to Kafka and Kafka Streams (hereafter referred to as "Kafka"), but having encountered and considered similar questions, here is my perspective:
Kafka has two salient features:
All parallelism is achieved through the partitioning of topics
Within a partition of a topic, processing is strongly ordered, one-at-a-time.
Many really nice properties fall out from these features. For example, stream-based "transactions", I think, is one of the coolest.
But whether these properties are actually "features" in the sense that you want them, of course, depends on the application. If you don't want strongly ordered processing with parallelism based on topic partitioning, then you might not want to be using Kafka for that application.
So, with regard to:
Does this mean that Kafka Streams will not work in this situation?
It will work, but increased parallelism is achieved through increased partitioning.
Does this mean that I must revert back to calling the REST endpoint in a blocking manner?
Yes, I think it does—but I'm not sure why that would be a "reversion". Personally, that's what I like about Kafka: blocking code is simpler. If I want more parallelism, I can run more threads. There's no shared state, after all.

Related

Communication between bounded contexts in akka cluster

I'm struggling with proper design of communication between 2 separate akka microservices/bounded contexts in akka cluster.
Lets say we have 2 microservices in cluster per node.
Both microservices are based on akka.
It's clear for me that communication within particular bounded context will be handled by sending messages from actor to actor or from actor on node1 to actor on node2 (if necessary)
Q: but is it OK to use similar communication between separate akka application? e.g. boundedContext1.actor --message--> boundedContext2.actor
or it should be done via much clearer boundaries: In bc1 raise an event - publish to broker and read the event in bc2?
//Edit currently we've implemented a service registry an we're publishing events to service registry via Akka streams.
I think there is no universal answer here.
It is like if your BCs are simple enough you can hold the BCs in one application and even in one project/library with very weak boundaries i.e. just placing them into separate namespaces and providing an API for other BCs.
But if your BCs become more complex, more independent and require its own deployment cycle then it is definitely better to construct more strong boundaries and separate microservices that communicate through message broker.
So, my answer is you should just "feel" the right way according to your particular needs. If you don't "feel" it then follow KISS principle and start with an easier way i.e. using built-in akka communication system. And if in the future your BCs will become more complex then you will have to refactor them. But this decision will be justified and it will not be an unnecessary overhead.

Couchbase read / write concurrency

I have a question regarding how does Couchbase internally handle concurrency.
I tried researching in their documentation and all I found was that it depends on which locking mechanism you use in your application, the two main being :
Optimistic locking
Pessimistic locking
However, both of the above are related to how we want our strategy to be for saving data , meaning if we prefer to lock it or not.
In our case, IF we are not using either of those locking in our application, how would couchbase serve the document in the scenario below :
If application A writes a document A
At the very same instance application B tries to read Document A
My question is will Application B have to queue up to read the document, or by default it will get served the older version (all of this is not going through sync gateway and we are using .Net DLL directly for writing and reading).
Couchbase version 4.5.0
If you are using the Couchbase SDK and connecting directly to the Data Service, Couchbase is strongly consistent. If application A writes the document and immediately after application B reads it, application B will get that version. The consistency comes from how Couchbase distributes the data and how the client SDK accesses it. Couchbase distributes each object to one of 1024 active shards (Couchbase calls them vBuckets). There are replicas, but I will not get into that here. When the SDK goes to read/write objects directly, it take the object ID you give, passed it into a consistent CRC32 hash. The output of that hash is a number between 0-1023, the vBucket number. The SDK then looks into the cluster map (a JSON document distributed by the cluster) and finds where in the cluster that vBucket lives. The SDK then goes and talks directly to that node and vBucket. That is how application A can write an object and then microseconds later application B reads it. They are both reading and writing to the same place. Couchbase does not scale reads from replicas. Replicas are only for HA.
Because, as Kirk mentioned in his reply, Couchbase is consistent, both the read and write requests in your scenario will go to the same node and access the same object in the server's memory. However, concepts like "at the same time" get fuzzy when talking about distributed systems with network latency and various IO queues involved. Ultimately, the order of execution of the two "simultaneous" requests will depend on the order that the server receives them, but there is no deterministic way to predict what it will be. There are too many variables on the way; what if the CLR of one of the client decides to do garbage collection just then, delaying the request, or one of the client machines experiences momentary network lag, etc. This is one of the reasons that the documentation recommends using explicit locking for concurrent writes, to enforce predictable behavior in the face of unpredictable request ordering.
In your scenario though, there is simply no way to know in which order the write and read will occur, because "at the same time" is not an exact concept. One possible solution in your case might be to use explicit versioning for the data. This can be a timestamp or some sort of revision number that's incremented every time the document is changed. Of course, using a timestamp runs into a similar problem as above, because it's recorded according to the clock of the machine application A runs on, which is most likely different from the clock where application B runs.

Akka concurrent message proccesing by preserving the messages order

In our project, we publish and consume messages to/from a JMS broker. We have a PL/SOL producer and a Java consumer. The problem is however; he producer is 10 times faster than the consumer. Theofore we want to change the consumerr to work with multiple threads while reading and processing the messages.
But we need to preserve the order of the messages as well. That said, the messages shall be sent to the target system in the order they were published to the jms broker. I'm new to Akka and i'm trying to understand its features. Can we achieve that using akka dispatchers ?
Assuming you want to parallelize the consumption inside a single instance of JVM, what you describe is a good case for Akka Streams. This can be solved with the Actors, but you risk of running out of memory if the producer is too fast, because you'll need to queue the results for re-ordering.
Akka Streams handle this problem with the introduction of backpressure. If consumer can't keep up with the producer, it will indicate this, and producer will reduce the rate. Akka Streams also can maintain the order of the messages.
Akka Streams is a 1.0 software, so it's not yet as battle-hardened as pure Akka, but it's based on Akka and is coming from the Akka team, so it should be good and become even better in the future. Also the documentation is not organized in the best way possible yet.
It's also important to mention that Akka Streams, while implemented using Akka, is quite different paradigm than Actor Model or Future combinations. It's based on stream processing paradigm, so you'll have to adjust the way you think about your programs. Might be an issue for some teams.

Is ActiveMQ thread safe?

We would like to run our cms::MessageConsumer and cms::MessageProducer on different threads of the same process.
How do we do this safely?
Would having two cms::Connection objects and two cms::Session objects, one each for consumer and producer, be sufficient to guarantee safety? Is this necessary?
Is there shared state between objects at the static library level that would prevent this type of usage?
You should read the JMS v1.1 specification, it calls out clearly which objects are valid to use in multiple threads and which are not. Namely the Session, MessageConsumer and MessageProducer are considered unsafe to share amongst threads. We generally try to make them as thread safe as we can but there are certainly ways in which you can get yourself into trouble. Its generally a good idea to use a single session in each thread and in general its a good idea to use a session for each MessageConsumer / MessageProducer since the Session contains a single dispatch thread which means that a session with many consumers must share its dispatch thread for sending messages on to each consumer which can lower latency depending on the scenario.
I'm answering my own question to supplement Tim Bish's answer, which I am accepting as having provided the essential pieces of information.
From http://activemq.apache.org/cms/cms-api-overview.html
What is CMS?
The CMS API is a C++ corollary to the JMS API in Java which is used to
send and receive messages from clients spread out across a network or
located on the same machine. In CMS we've made every attempt to
maintain as much parity with the JMS api as possible, diverging only
when a JMS feature depended strongly on features in the Java
programming language itself. Even though there are some differences
most are quite minor and for the most part CMS adheres to the JMS
spec, so having a firm grasp on how JMS works should make using CMS
that much easier.
What does the JMS spec say about thread safety?
Download spec here:
http://download.oracle.com/otndocs/jcp/7195-jms-1.1-fr-spec-oth-JSpec/
2.8 Multithreading JMS could have required that all its objects support concurrent use. Since support for concurrent access typically
adds some overhead and complexity, the JMS design restricts its
requirement for concurrent access to those objects that would
naturally be shared by a multithreaded client. The remainder are
designed to be accessed by one logical thread of control at a time.
JMS defines some specific rules that restrict the concurrent use of
Sessions. Since they require more knowledge of JMS specifics than we
have presented at
Table 2-2 JMS Objects that Support Concurrent Use
Destination: YES
ConnectionFactory: YES
Connection: YES
Session: NO
MessageProducer: NO
MessageConsumer: NO
this point, they will be described later. Here we will describe the
rationale for imposing them.
There are two reasons for restricting concurrent access to Sessions.
First, Sessions are the JMS entity that supports transactions. It is
very difficult to implement transactions that are multithreaded.
Second, Sessions support asynchronous message consumption. It is
important that JMS not require that client code used for asynchronous
message consumption be capable of handling multiple, concurrent
messages. In addition, if a Session has been set up with multiple,
asynchronous consumers, it is important that the client is not forced
to handle the case where these separate consumers are concurrently
executing. These restrictions make JMS easier to use for typical
clients. More sophisticated clients can get the concurrency they
desire by using multiple sessions.
As far as I know from the Java side, the connection is thread safe (and rather expensive to create) but Session and messageProducer are not thread safe. Therefore it seems you should create a Session for each of your threads.

Designing an architecture for exchanging data between two systems

I've been tasked with creating an intermediate layer which needs to exchange data (over HTTP) between two independent systems (e.g. Receiver <=> Intermediate Layer (IL) <=> Sender). Receiver and Sender both expose a set of API's via Web Services. Everytime a transaction occurs in the Sender system, the IL should know about it (I'm thinking of creating a Windows Service which constantly pings the Sender), massage the data, then deliver it to the Receiver. The IL can temporarily store the data in a SQL database until it is transferred to the Receiver. I have the following questions -
Can WCF (haven't used it a lot) be used to talk to the Sender and Receiver (both expose web services)?
How do I ensure guaranteed delivery?
How do I ensure security of the messages over the Internet?
What are best practices for handling concurrency issues?
What are best practices for error handling?
How do I ensure reliability of the data (data is not tampered along the way)
How do I ensure the receipt of the data back to the Sender?
What are the constraints that I need to be aware of?
I need to implement this on MS platform using a custom .NET solution. I was told not to use any middleware like BizTalk. The receiver is an SDFC instance, if that matters.
Any pointers are greatly appreciated. Thank you.
A Windows Service that orchestras the exchange sounds fine.
Yes WCF can deal with traditional Web Services.
How do I ensure guaranteed delivery?
To ensure delivery you can use TransactionScope to handle the passing of data between the
Receiver <=> Intermediate Layer and Intermediate Layer <=> Sender but I wouldn't try and do them together.
You might want to consider some sort of queuing mechanism to send the data to the receiver; I guess I'm thinking more of a logical queue rather than an actual queuing component. A workflow framework could also be an option.
make sure you have good logging / auditing in place; make sure it's rock solid, has the right information and is easy to read. Assuming you write a service it will execute without supervision so the operational / support aspects are more demanding.
Think about scenarios:
How do you manage failed deliveries?
What happens if the reciever (or sender) is unavailbale for periods of time (and how long is that?); for example: do you need to "escalate" to an operator via email?
How do I ensure security of the messages over the Internet?
HTTPS. Assuming other existing clients make calls to the Web Services how do they ensure security? (I'm thinking encryption).
What are best practices for handling concurrency issues?
Hmm probably a separate question. You should be able to find information on that easily enough. How much data are we taking? what sort of frequency? How many instances of the Windows Service were you thinking of having - if one is enough why would concurrency be an issue?
What are best practices for error handling?
Same as for concurrency, but I can offer some pointers:
Use an established logging framework, I quite like MS EntLibs but there are others (re-using whatever's currently used is probably going to make more sense - if there is anything).
Remember that execution is unattended so ensure information is complete, clear and unambiguous. I'd be tempted to log more and dial it down once a level of comfort is reached.
use a top level handler to ensure nothing get's lost; but don;t be afraid to log deep in the application where you can still get useful context (like the metadata of the data being sent / recieved).
How do I ensure the receipt of the data back to the Sender?
Include it (sending the receipt) as a step that is part of the transaction.
On a different angle - have a look on CodePlex for ESB type libraries, you might find something useful: http://www.codeplex.com/site/search?query=ESB&ac=8
For example ESBasic which seems to be a class library which you could reuse.