Using a 24 Hour Window for Amazon Kinesis Analytics - amazon-web-services

I have a Streaming Data use case where for every record inserted into a Kinesis Stream, I would like to calculate a value based on the last 24 hours worth of records.
Now, Kinesis Analytics seems to fit this bill nicely. I can use a WINDOW with a RANGE INTERVAL '1' DAY PRECEDING to get my aggregation. My Kinesis Stream is set to persist records for 26 hours.
My conflict comes from the Best Practices documentation.
Specifically:
In your SQL statement, we recommend that you do not specify time-based window that is longer than one hour for the following reasons:
If an application needs to be restarted, either because you updated the application or for Amazon Kinesis Data Analytics internal reasons, all data included in the window must be read again from the streaming data source. This will take time before Amazon Kinesis Data Analytics can emit output for that window.
Amazon Kinesis Data Analytics must maintain everything related to the application's state, including relevant data, for the duration. This will consume significant Amazon Kinesis Data Analytics processing units.
I want to be sure I understand the consequences of this.
Say I proceed with my plans and implement this using Kinesis Analytics. Then for whatever reason my Kinesis Analytics Application has to be restarted. For context, my application deals with approximately 1 million records a day and each record is approximately 550 bytes.
If my Kinesis Analytics Application restarts, what will be I be dealing with since I would be ignoring this recommendation:
we recommend that you do not specify time-based window that is longer than one hour
I have also considered skipping Kinesis Analytics and just feeding my rows into RDS (postgres - Single Deployment) and triggering a lambda to run my calculations or doing something with Redis (still exploring those consequences).

Related

Firehose datapipeline limitations

My use-case is as follows:
I have JSON data coming in which needs to be stored in S3 in parquet format. So far so good, I can create a schema in Glue and attach a "DataFormatConversionConfiguration" to my firehose stream. BUT the data is coming from different "topics". Each topic has a particular "schema". From my understanding I will have to create multiple firehose streams, as one stream can only have one schema. But I have thousands of such topics with very high volume high throughput data incoming. It does not look feasible to create so many firehose resources (https://docs.aws.amazon.com/firehose/latest/dev/limits.html)
How should I go about building my pipeline.
IMO you can:
ask for upgrade of your Firehose limit and do everything with 1 Firehose/stream + add Lambda transformation to convert the data into a common schema - IMO not cost-efficient but you should see with your load.
create a Lambda for each Kinesis data stream, convert each event to the schema managed by a single Firehose and at the end can send the events directly to your Firehose stream with Firehose API https://docs.aws.amazon.com/firehose/latest/APIReference/API_PutRecord.html (see "Q: How do I add data to my Amazon Kinesis Data Firehose delivery stream?" here https://aws.amazon.com/kinesis/data-firehose/faqs/) - but also, check the costs before because even though your Lambdas are invoked "on demand", you may have a lot of them invoked during long period of time.
use one of data processing frameworks (Apache Spark, Apache Flink, ...) and read your data from Kinesis in batches of 1 hour, starting every time when you terminated last time --> use available sinks to convert the data and write it in Parquet format. The frameworks use the notion of checkpointing and store the last processed offset in an external storage. Now, if you restart them every hour, they will start to read the data directly from the last seen entry. - It may be cost-efficient, especially if you consider using spot instances. On the other side, it requires more coding that 2 previous solutions and obviously may have higher latency.
Hope that helps. You could please give a feedback about chosen solution ?

Partition Kinesis firehose S3 records by event time

Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.

Amazon Kinesis Analytics for archival data

Background
I have found that Amazon Kinesis Data Analytics can be used for streaming data as well as data present in an S3 bucket.
However, there are some parts of the Kinesis documentation that make me question whether Amazon Kinesis Analytics can be used for a huge amount of existing data in an S3 bucket:
Authoring Application Code
We recommend the following:
In your SQL statement, don't specify a time-based window that is longer than one hour for the following reasons:
Sometimes an application needs to be restarted, either because you updated the application or for Kinesis Data Analytics internal reasons. When it restarts, all data included in the window must be read again from the streaming data source. This takes time before Kinesis Data Analytics can emit output for that window.
Kinesis Data Analytics must maintain everything related to the application's state, including relevant data, for the duration. This consumes significant Kinesis Data Analytics processing units.
Question
Will Amazon Kinesis Analytics be good for this task?
The primary use case for Amazon Kinesis Analytics is stream data processing. For this reason, you attach an Amazon Kinesis Analytics application to a streaming data source. You can optionally include reference data from S3, which is limited in size to 1 GB at this time. We will load data from an S3 object into a SQL table that you can use to enrich the incoming stream.
It sounds like want a more general purpose tool for querying data from S3, not a stream data processing solution. I would recommend looking at Presto and Amazon EMR instead of using Amazon Kinesis Analytics.
Disclaimer: I work for the Amazon Kinesis team.

Why should I use Amazon Kinesis and not SNS-SQS?

I have a use case where there will be stream of data coming and I cannot consume it at the same pace and need a buffer. This can be solved using an SNS-SQS queue. I came to know the Kinesis solves the same purpose, so what is the difference? Why should I prefer (or should not prefer) Kinesis?
Keep in mind this answer was correct for Jun 2015
After studying the issue for a while, having the same question in mind, I found that SQS (with SNS) is preferred for most use cases unless the order of the messages is important to you (SQS doesn't guarantee FIFO on messages).
There are 2 main advantages for Kinesis:
you can read the same message from several applications
you can re-read messages in case you need to.
Both advantages can be achieved by using SNS as a fan out to SQS. That means that the producer of the message sends only one message to SNS, Then the SNS fans-out the message to multiple SQSs, one for each consumer application. In this way you can have as many consumers as you want without thinking about sharding capacity.
Moreover, we added one more SQS that is subscribed to the SNS that will hold messages for 14 days. In normal case no one reads from this SQS but in case of a bug that makes us want to rewind the data we can easily read all the messages from this SQS and re-send them to the SNS. While Kinesis only provides a 7 days retention.
In conclusion, SNS+SQSs is much easier and provides most capabilities. IMO you need a really strong case to choose Kinesis over it.
On the surface they are vaguely similar, but your use case will determine which tool is appropriate. IMO, if you can get by with SQS then you should - if it will do what you want, it will be simpler and cheaper, but here is a better explanation from the AWS FAQ which gives examples of appropriate use-cases for both tools to help you decide:
FAQ's
Semantics of these technologies are different because they were designed to support different scenarios:
SNS/SQS: the items in the stream are not related to each other
Kinesis: the items in the stream are related to each other
Let's understand the difference by example.
Suppose we have a stream of orders, for each order we need to reserve some stock and schedule a delivery. Once this is complete, we can safely remove the item from the stream and start processing the next order. We are fully done with the previous order before we start the next one.
Again, we have the same stream of orders, but now our goal is to group orders by destinations. Once we have, say, 10 orders to the same place, we want to deliver them together (delivery optimization). Now the story is different: when we get a new item from the stream, we cannot finish processing it; rather we "wait" for more items to come in order to meet our goal. Moreover, if the processor process crashes, we must "restore" the state (so no order will be lost).
Once processing of one item cannot be separated from processing another one, we must have Kinesis semantics in order to handle all the cases safely.
Kinesis support multiple consumers capabilities that means same data records can be processed at a same time or different time within 24 hrs at different consumers, similar behavior in SQS can be achieved by writing into multiple queues and consumers can read from multiple queues. However writing again into multiple queue will add sub seconds {few milliseconds} latency in system.
Second, Kinesis provides routing capability to selective route data records to different shards using partition key which can be processed by particular EC2 instances and can enable micro batch calculation {Counting & aggregation}.
Working on any AWS software is easy but with SQS is easiest one. With Kinesis, there is a need to provision enough shards ahead of time, dynamically increasing number of shards to manage spike load and decrease to save cost also required to manage. it's pain in Kinesis, No such things are required with SQS. SQS is infinitely scalable.
Excerpt from AWS Documentation:
We recommend Amazon Kinesis Streams for use cases with requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are simpler when all records for a given key are routed to the same record processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a real-time dashboard and another that archives data to Amazon Redshift. You want both applications to consume data from the same stream concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that runs a few hours behind the billing application. Because Amazon Kinesis Streams stores data for up to 7 days, you can run the audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track the successful completion of each item independently. Amazon SQS tracks the ack/fail, so the application does not have to maintain a persistent checkpoint/cursor. Amazon SQS will delete acked messages and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the backlog is cleared. With Amazon Kinesis Streams, you can scale up to a sufficient number of shards (note, however, that you'll need to provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional load spikes or the natural growth of your business. Because each buffered request can be processed independently, Amazon SQS can scale transparently to handle the load without any provisioning instructions from you.
The biggest advantage for me is the fact that Kinesis is a replayable queue, and SQS is not. So you can have multiple consumers of the same messages of Kinesis (or the same consumer at different times) where with SQS, once a message has been ack'd, it's gone from that queue.
SQS is better for worker queues because of that.
Another thing: Kinesis can trigger a Lambda, while SQS cannot. So with SQS you either have to provide an EC2 instance to process SQS messages (and deal with it if it fails), or you have to have a scheduled Lambda (which doesn't scale up or down - you get just one per minute).
Edit: This answer is no longer correct. SQS can directly trigger Lambda as of June 2018
https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
The pricing models are different, so depending on your use case one or the other may be cheaper. Using the simplest case (not including SNS):
SQS charges per message (each 64 KB counts as one request).
Kinesis charges per shard per hour (1 shard can handle up to 1000 messages or 1 MB/second) and also for the amount of data you put in (every 25 KB).
Plugging in the current prices and not taking into account the free tier, if you send 1 GB of messages per day at the maximum message size, Kinesis will cost much more than SQS ($10.82/month for Kinesis vs. $0.20/month for SQS). But if you send 1 TB per day, Kinesis is somewhat cheaper ($158/month vs. $201/month for SQS).
Details: SQS charges $0.40 per million requests (64 KB each), so $0.00655 per GB. At 1 GB per day, this is just under $0.20 per month; at 1 TB per day, it comes to a little over $201 per month.
Kinesis charges $0.014 per million requests (25 KB each), so $0.00059 per GB. At 1 GB per day, this is less than $0.02 per month; at 1 TB per day, it is about $18 per month. However, Kinesis also charges $0.015 per shard-hour. You need at least 1 shard per 1 MB per second. At 1 GB per day, 1 shard will be plenty, so that will add another $0.36 per day, for a total cost of $10.82 per month. At 1 TB per day, you will need at least 13 shards, which adds another $4.68 per day, for a total cost of $158 per month.
Kinesis solves the problem of map part in a typical map-reduce scenario for streaming data. While SQS doesnt make sure of that. If you have streaming data that needs to be aggregated on a key, kinesis makes sure that all the data for that key goes to a specific shard and the shard can be consumed on a single host making the aggregation on key easier compared to SQS
Kinesis Use Cases
Log and Event Data Collection
Real-time Analytics
Mobile Data Capture
“Internet of Things” Data Feed
SQS Use Cases
Application integration
Decoupling microservices
Allocate tasks to multiple worker nodes
Decouple live user requests from intensive background work
Batch messages for future processing
I'll add one more thing nobody else has mentioned -- SQS is several orders of magnitude more expensive.
In very simple terms, and keeping costs out of the picture, the real intention of SNS-SQS are to make services loosely coupled. And this is only primary reason to use SQS where the order of the msgs are not so important and where you have more control of the messages. If you want a pattern of job queue using an SQS is again much better. Kinesis shouldn't be used be used in such cases because it is difficult to remove messages from kinesis because kinesis replays the whole batch on error. You can also use SQS as a dead letter queue for more control. With kinesis all these are possible but unheard of unless you are really critical of SQS.
If you want a nice partitioning then SQS won't be useful.

Amazon Kinesis dynamically stream resize

I am working on Amazon Kinesis Api and Kinesis Client library, I have created one producer to put data into stream and have multiple consumer applications to read data from that stream.
I have scenario to increase and decrease size of stream dynamically as per input stream size and output stream size and also using count of Consumer application.
I found some useful source to count number of shard from amazon website but don't get how to calculate . Source URL is :
http://docs.aws.amazon.com/kinesis/latest/dev/how-do-i-size-a-stream.html
Need some understanding on this.
Thanks
AWS support suggests looking at the following open source project. It was created by one of their solution architects.
https://github.com/awslabs/amazon-kinesis-scaling-utils
It can be run manually (cli) or automatically (deployed WAR) to scale up/down with your application.
You could take a look at Themis, a framework that supports autoscaling of Kinesis streams, developed at Atlassian. The tool is very easy to configure, comes with a Web UI, and supports different autoscaling modes (e.g., proactive and reactive autoscaling).
(Apologies for posting in an old thread, but the answer may still be interesting for readers discovering this thread.)
You can dynamically resize stream with using Amazon Cloud Watch service, you just create Alarms based on stream using different metrics like put.byteRecords and get.byteRecords and detect alarm state.
After that based on those alarm state as "ALARM", increase capacity of your stream using resharding, you can do same scenario to decrease capacity of your stream.
for more information visit this link : http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-using-api-java.html
Since November 2016, you can easily scale your Amazon Kinesis streams using the updateShardCount function, Lambda functions and Amazon Cloud Watch Alarms.
You may find this post really useful.
I have created npm module which helps in auto scaling kinesis stream.
You can find detailed information at Amazon Kinesis Scaling
This is npm module which scale amazon kinesis as per current traffic needs. This module continuously monitor traffic in kinesis stream and split and merge shards as needed.
E.g. if your application needs to handle 5000 req/sec then you need to have 5 shards. Since traffic on your application can varies a lot so does number of shards.
If your application needs to handle 20000 req/sec at peak time then you need to have 20 shards but when at other time you may required only 5 shards.