How Databricks processes the incoming messages from EventHub? - azure-eventhub

Being novice to real time continuous data processing scenarios, would like to know how the incoming continuous series of messages get processed via databricks, whether those are processed Sequential one by one or in Parallel way?
Thanks.

One way to achieve this is to use Spark on Databricks to ingest data from EventHub. This is done by consuming a message queue. If only one consumer is used to read from the queue, the messages will be processed sequentially. However, if multiple consumers are used it is possible to process multiple messages in parallel.
Check out these examples for more info as well:
https://lenadroid.github.io/posts/connecting-spark-and-eventhubs.html
https://learn.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs

Related

Can you do batch pull messages with Google Pub Sub?

Trying to optimize our application but doing batch pulling. Pub Sub seems to allow asynchronously pulling one message at a time with different client nodes, but is there no way for a single node to do a batch pull from pub sub?
Both Streaming Pull and Pull RPC both only allow the subscriber to consume one message at a time. Right now, it looks like we would have to pull one message at a time and do application level batching.
Any insight would be helpful. Pretty new to this GCP in general.
The underlying pull and streaming pull operations can receive batches of messages in the same response. The Cloud Pub/Sub client library, which uses streaming pull, breaks these batches apart and hands them to the provided user callback one at a time. Therefore, you need not worry about optimizing the underlying receiving of messages.
If your concern is optimizing the subscriber code at the application level, e.g., you want to batch writes into a database, then you have a couple of options:
Use Pull directly, which allows one to process all of the messages in a batch at a time. Note that using pull effectively requires many simultaneously outstanding pull requests and replacing requests that return with new requests immediately.
In your user callback, re-batch messages and once the batch reaches a desired size (or you've waited a sufficient amount of time to fill the batch), process all of the messages together and then ack them.
You probably can implement that by using Dataflow (Apache Beam). You can have a running streaming job, where you group, window, transform messages according to your requirements. The results of processing can be saved in batches or steam further. It probably makes sense in case the number of messages is really big.

I want to know when a batch of messages has completed in a AWS SQS Queue

I think this is more of a 'architecture design' question.
I have a lambda producer that will put ~600 messages on a SQS queue (there are multiple producers) as a batch (so not 1 message with a body of ~600 messages). A consumer lambda that will take individual messages and deal with them (at scale). What I want to do is run another lambda when each batch is complete.
Initial ideas was to create a 'unique batch number', a 'total batch number' and a 'batch position number' and add it to the messages attributes for every message. And then in the consumer lambda check the these to decide if the batch is complete.
But does that mean I would need to use a FIFO queue and partition on the batch number and only have one lambda consumer per batch. Or do I run some sort of state management in DynamoDB (is the a pattern out there for this? please guide me on this).
Regards, J
It seems like the goal is to achieve Fork-Join capabilities in a distributed system. One way to handle this in AWS is using Step Functions. Assuming a queue service needs to be used, state of the overall operation will need to be tracked. Some ways to do this are:
Store state of the overall operation in a DB.
Put a 'terminatation' message in the queue after all others and process FIFO.
Create a metadata service which receives 'start' and 'stop' messages for each service and handles them accordingly.
Reference: Fork and Join with Amazon Lambda

What is the difference between Kinesis and SQS?

I know there is a lot materials online for this question, however I have not found any that can explain this question quite clearly to a rookie like me... Appreciate it if some one can help me understand the key differences between these two services and use cases with real life examples. Thank you!
Amazon SQS is a queue. The basic process is:
Messages are sent to the queue. They stay there for up to 14 days.
Worker programs can request a message (or up to 10 messages) from the queue.
When a message is retrieved from the queue:
It stays in the queue but is marked as invisible
When the worker has finished processing the message, it tells SQS to delete the message from the queue
If the worker does not delete the message within the queue's invisibility timeout period, then the message reappears on the queue for another worker to process
The worker can, if desired, periodically tell SQS to keep a message invisible because it is still being processed
Thus, once a message is processed, it is deleted.
In Amazon Kinesis, a message is sent to a stream. The stream is divided into shards (think of them as mini-streams). When a message is received, Kinesis stores the message in sequential order. Then, workers can request a message from the start of the stream, or from a specific spot in the stream. For example, if it has already processed 5 messages, it can ask for the 6th message. The messages are retained in the stream for a period of time (eg 24 hours).
I like to think of it like a film strip — each frame in a film is kept in order. You can play a film from the start, or you can fast-forward to the middle and start playing from there. In addition, you can rewind to an earlier part and watch it. The same is true for a Kinesis stream, and multiple consumers can read from various parts of the stream simultaneously.
So, which to choose?
If a message is used once and then discarded, a queue is probably the better choice.
If retaining message order is important and/or messages will be used more than once, then a stream is probably better.
This article sums it up pretty nicely, imo:
https://sookocheff.com/post/aws/comparing-kinesis-and-sqs/
but basically, if you don't know which one you need, start with SQS until it can't do what you want. SQS is dead-simple to setup and use, and requires almost no experise to use it well.
Kinesis takes a lot more time and expertise to setup to use, so unless you need it, don't bother - even though it could be used for many of the same things as SQS.
One big difference, with SQS if you have multiple consumers reading from the queue, than each consumer will only ever see thge messages they consume - because other consumers will be blocked from seeing them; with Kinesis, many consumers can access the stream at the same time, and each consumer sees the entire streem - so SQS is good for taking a large number of tasks and doling out pieces to lots of consumers to work on in parallel (among other things), where as with Kinesis multiple consumers could read and see the entire streem and do something with ALL of the data in the stream.
The linked article explains it better than me.
I try to give a simple answer based on my practical experience:
Consider SQS as temporary storage service. Use cases:
manage data with different queue priorities
store data for a limited period of time
Lambda DLQ
reduce costs with long polling
create a FIFO
Consider Kinesis as a collector of large stream of real-time data. Use cases:
very very large stream of data from different sources
backup of data just enabling Firehose (you get a data lake for free)
get statistics at once during the collecting phase integrating Kinesis Analytics
have checkpoints to keep track in DynamoDB of records processed/failed
Note: consider that both services can be integrated with Lambda Functions very easily, so there are a plenty of use cases that can be solved both with SQS and Kinesis. Anyway, I tried to list some use cases where I found that one of the two performed peculiarly better than the other. Hope it can be helpful :)

How to ensure once-only processing of data in an AWS serverless architecture?

I have some data that needs to be processed at a point in time.
My current strategy is to pull the data every minute and load it into a queue and process it.
I have two concerns with this strategy:
I can't guarantee that the last minute captures all data so I pull the last two minutes; and
Lambdas as far as I know can fire multiple times depending on the trigger (in this case SQS.)
I'm trying to avoid writing a flag to the data because of the spikey nature of batch processing.
The only other solution I can think of is using S3 to create a lock-file.
Is there a better way to 'kick off' future events? Is there a strategy outside database and S3 flags?
Have a look at SQS FIFO Queues, they are designed to deliver once and only once.
You can now use Amazon Simple Queue Service (SQS) for applications that require messages to be processed in a strict sequence and exactly once using First-in, First-out (FIFO) queues. FIFO queues are designed to ensure that the order in which messages are sent and received is strictly preserved and that each message is processed exactly once. ...source

Is there a way to get RabbitMQ to deliver messages in batches, instead of one at a time?

I am using RabbitMQ with Clojure and Langohr, and I want to process messages off the queue in batches rather than one at a time. I could batch the messages myself after they're pulled off the queue, of course, but I'm curious if there's an API call or setting I'm missing that would get RMQ to deliver, say, 500 messages at a time to a consumer. Is this possible?