Getting Data from two different Streams in Kinesis? - amazon-web-services

I am trying to make a Kinesis Consumer Client. To work on it I went through the Developer Guide of Kinesis and AWS Document http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-implementation-app-java.html.
I was wondering is It possible to get Data from two different Streams and process it accordingly.
Say I have two different Streams stream1 and stream2 .
Is it possible to get Data from both stream and process individually ?

Why not? Do get_records from both streams.
If your streams only have a single shard each, you will also see all the events, as it is recommended to process each shard with a single worker, but if your logic is somehow to join events from different sources/streams, you can implement it with a single worker reading from both streams.
Note that if you have streams with multiple shards, each one of your workers will see only a part of the events. You can have the following options:
Both streams have a single shard each - in this case you can read with a single worker from bout streams and see all events from both streams. You can add timestamps or other keys to allow you to "join" these events in the worker.
One stream (stream1) with one shard and the second streams (stream2) with multiple shards - in this case you can read from stream1 from all your workers, that will also process single shard from the stream2 each. Each one of your workers will see all the events of stream1 and its share of events of stream2. Note that you have a limit of the speed that you can read the events from stream1 with the single shard (2MB/second or 5 reads/second), and if you have many shards in stream2, this can be a real limit.
Both streams can have multiple shards - in this case it will be more complex for you to ensure that you are able to "join" these events, as you need to sync both the writes and the reads to these streams. You can also read from all shards of both streams with a single worker, but this is not a good practice as it is limiting your ability to scale since you don't have a distributed system anymore. Another option is to use the same partition_key in both streams, and have the same number of shards and partition definition for both streams, and verify that you are reading from the "right" shard from each stream in each of your workers, and that you are doing it correctly every time one of your workers is failing and restarting, which might be a bit complex.
Another option that you can consider is to write both types of events in a single stream, again using the same partition_key, and then filter them on the reader side if you need to process them differently (for example, to write them to different log files in S3).

Related

What is the difference between Kinesis and SQS?

I know there is a lot materials online for this question, however I have not found any that can explain this question quite clearly to a rookie like me... Appreciate it if some one can help me understand the key differences between these two services and use cases with real life examples. Thank you!
Amazon SQS is a queue. The basic process is:
Messages are sent to the queue. They stay there for up to 14 days.
Worker programs can request a message (or up to 10 messages) from the queue.
When a message is retrieved from the queue:
It stays in the queue but is marked as invisible
When the worker has finished processing the message, it tells SQS to delete the message from the queue
If the worker does not delete the message within the queue's invisibility timeout period, then the message reappears on the queue for another worker to process
The worker can, if desired, periodically tell SQS to keep a message invisible because it is still being processed
Thus, once a message is processed, it is deleted.
In Amazon Kinesis, a message is sent to a stream. The stream is divided into shards (think of them as mini-streams). When a message is received, Kinesis stores the message in sequential order. Then, workers can request a message from the start of the stream, or from a specific spot in the stream. For example, if it has already processed 5 messages, it can ask for the 6th message. The messages are retained in the stream for a period of time (eg 24 hours).
I like to think of it like a film strip — each frame in a film is kept in order. You can play a film from the start, or you can fast-forward to the middle and start playing from there. In addition, you can rewind to an earlier part and watch it. The same is true for a Kinesis stream, and multiple consumers can read from various parts of the stream simultaneously.
So, which to choose?
If a message is used once and then discarded, a queue is probably the better choice.
If retaining message order is important and/or messages will be used more than once, then a stream is probably better.
This article sums it up pretty nicely, imo:
https://sookocheff.com/post/aws/comparing-kinesis-and-sqs/
but basically, if you don't know which one you need, start with SQS until it can't do what you want. SQS is dead-simple to setup and use, and requires almost no experise to use it well.
Kinesis takes a lot more time and expertise to setup to use, so unless you need it, don't bother - even though it could be used for many of the same things as SQS.
One big difference, with SQS if you have multiple consumers reading from the queue, than each consumer will only ever see thge messages they consume - because other consumers will be blocked from seeing them; with Kinesis, many consumers can access the stream at the same time, and each consumer sees the entire streem - so SQS is good for taking a large number of tasks and doling out pieces to lots of consumers to work on in parallel (among other things), where as with Kinesis multiple consumers could read and see the entire streem and do something with ALL of the data in the stream.
The linked article explains it better than me.
I try to give a simple answer based on my practical experience:
Consider SQS as temporary storage service. Use cases:
manage data with different queue priorities
store data for a limited period of time
Lambda DLQ
reduce costs with long polling
create a FIFO
Consider Kinesis as a collector of large stream of real-time data. Use cases:
very very large stream of data from different sources
backup of data just enabling Firehose (you get a data lake for free)
get statistics at once during the collecting phase integrating Kinesis Analytics
have checkpoints to keep track in DynamoDB of records processed/failed
Note: consider that both services can be integrated with Lambda Functions very easily, so there are a plenty of use cases that can be solved both with SQS and Kinesis. Anyway, I tried to list some use cases where I found that one of the two performed peculiarly better than the other. Hope it can be helpful :)

How Databricks processes the incoming messages from EventHub?

Being novice to real time continuous data processing scenarios, would like to know how the incoming continuous series of messages get processed via databricks, whether those are processed Sequential one by one or in Parallel way?
Thanks.
One way to achieve this is to use Spark on Databricks to ingest data from EventHub. This is done by consuming a message queue. If only one consumer is used to read from the queue, the messages will be processed sequentially. However, if multiple consumers are used it is possible to process multiple messages in parallel.
Check out these examples for more info as well:
https://lenadroid.github.io/posts/connecting-spark-and-eventhubs.html
https://learn.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs

How to ensure once-only processing of data in an AWS serverless architecture?

I have some data that needs to be processed at a point in time.
My current strategy is to pull the data every minute and load it into a queue and process it.
I have two concerns with this strategy:
I can't guarantee that the last minute captures all data so I pull the last two minutes; and
Lambdas as far as I know can fire multiple times depending on the trigger (in this case SQS.)
I'm trying to avoid writing a flag to the data because of the spikey nature of batch processing.
The only other solution I can think of is using S3 to create a lock-file.
Is there a better way to 'kick off' future events? Is there a strategy outside database and S3 flags?
Have a look at SQS FIFO Queues, they are designed to deliver once and only once.
You can now use Amazon Simple Queue Service (SQS) for applications that require messages to be processed in a strict sequence and exactly once using First-in, First-out (FIFO) queues. FIFO queues are designed to ensure that the order in which messages are sent and received is strictly preserved and that each message is processed exactly once. ...source

AWS Kinesis Stream as FIFO queue

We currently have an application that receives a large amount of sensor data. Each sensor has its own unique sensor id (eg '5834f7718273f92cc326f620') and emits its status at different intervals. The processing order of the messages that come in is not important, for example a newer message of one sensor can be processed before an older message of another sensor. What does matter though, is that each message for a given sensor must be processed sequentially; in the order that that they arrived in the stream.
I have taken a look at the Kinesis Client Library and understand that KCL pushes messages to a single processor per shard. Does this mean that if a stream has only one shard it will have only one processor and couldn't this create a bottleneck? Or does KCL have more than one processor, and somehow, perhaps using the partition key ensures messages with the same partition key are never processed concurrently?
Note: We have taken a look at sqs fifo, but ruled it out as the 300 messages per second limit would soon become an issue.
Yes, each shard can only have one processor at a given moment (per application).
But, you can use the sensor id as the partition key for your kinesis put record request. (see here)
This will make sure that all of this sensor events will get into the same shard and processor.
If you will do that you'll be able to scale your processes and shards and still get each sensor events processed in a single processor

Kinesis Client Library : multiple workers for a stream

I have a .war in which we have a Kinesis Application which processes a stream which contains a single shard. We deploy two instances of the war in production. As a result, I would end up with two workers working on a single stream with a single shard. What is the recommended way to deal with this issue? I tried deploying two wars on my dev machine locally, and it seems to be fine in the sense that each record is being processed only once. I know that AWS recommends one instance per shard. From their docs:
Typically, when you use the KCL, you should ensure that the number of instances does not exceed the number of shards (except for failure standby purposes). Each shard is processed by exactly one KCL worker and has exactly one corresponding record processor, so you never need multiple instances to process one shard.
You're probably fine most of the time.
The KCL will handle will this for you.
http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-ddb.html
The case you want to check if a worker dying after processing some records, but before checkpointing. In this case, the worker that takes over will reprocess some records (from last checkpoint).
Can your application handle reprocessing of records?
see: http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-duplicates.html