Amazon Kinesis dynamically stream resize - amazon-web-services

I am working on Amazon Kinesis Api and Kinesis Client library, I have created one producer to put data into stream and have multiple consumer applications to read data from that stream.
I have scenario to increase and decrease size of stream dynamically as per input stream size and output stream size and also using count of Consumer application.
I found some useful source to count number of shard from amazon website but don't get how to calculate . Source URL is :
http://docs.aws.amazon.com/kinesis/latest/dev/how-do-i-size-a-stream.html
Need some understanding on this.
Thanks

AWS support suggests looking at the following open source project. It was created by one of their solution architects.
https://github.com/awslabs/amazon-kinesis-scaling-utils
It can be run manually (cli) or automatically (deployed WAR) to scale up/down with your application.

You could take a look at Themis, a framework that supports autoscaling of Kinesis streams, developed at Atlassian. The tool is very easy to configure, comes with a Web UI, and supports different autoscaling modes (e.g., proactive and reactive autoscaling).
(Apologies for posting in an old thread, but the answer may still be interesting for readers discovering this thread.)

You can dynamically resize stream with using Amazon Cloud Watch service, you just create Alarms based on stream using different metrics like put.byteRecords and get.byteRecords and detect alarm state.
After that based on those alarm state as "ALARM", increase capacity of your stream using resharding, you can do same scenario to decrease capacity of your stream.
for more information visit this link : http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-using-api-java.html

Since November 2016, you can easily scale your Amazon Kinesis streams using the updateShardCount function, Lambda functions and Amazon Cloud Watch Alarms.
You may find this post really useful.

I have created npm module which helps in auto scaling kinesis stream.
You can find detailed information at Amazon Kinesis Scaling
This is npm module which scale amazon kinesis as per current traffic needs. This module continuously monitor traffic in kinesis stream and split and merge shards as needed.
E.g. if your application needs to handle 5000 req/sec then you need to have 5 shards. Since traffic on your application can varies a lot so does number of shards.
If your application needs to handle 20000 req/sec at peak time then you need to have 20 shards but when at other time you may required only 5 shards.

Related

What to use AWS Fargate or AWS Beanstalk

I have a java application that reads from a SQS queue and does some business processing and finally writes it to a datastore. As the SQS queue grows I want to be able to scale to read more messages and process them. Each SQS message will take about 15 to 20 minutes to process. I was looking at a service like AWS Fargate or AWS Beanstalk to deploy my application. Money is not a concern but usability is. What would be the best platform?
Fargate would be an ideal solution, as it has following advantages over Beanstalk:
It's serverless
More fine-grained control for custom application architectures.
No need to write EB extensions.
Build and Test image locally and Promote same to Fargate.
With application autoscaling, you can scale on the go.
Pricing is per second with a 1-minute minimum
FAQ:
https://aws.amazon.com/fargate/faqs/
Pricing:
https://aws.amazon.com/fargate/pricing/
I've had a very similar use case to this and I used Batch. (which was not available in 2014 when the question was asked)
https://aws.amazon.com/batch/
In my case I was processing audio and video files from the queue.
You can set a lambda to fire on the SQS queue and have that drop the job onto batch for processing.
If you have the minimum cluster size set to zero then you will have no servers running when there is no work to do, but you can have them autoscale up to process as much work as you require when the jobs come in.
The advantage compared to lambda is that the code that executes can be any container with as much resource as you want to throw at it.
For your use case it will be perfect, but for anything that can complete processing in a a few seconds or a minute it's worth making each job process more than one task per execution or all of the time will be spent firing up and shutting down containers.

SQS or Kinesis which one is good for queuing?

I have a server which can only process 20 request at a time. When lots of request coming, I want to store the request data, in some queues. and read a set of request (i.e 20) and process them by batch. What would be the ideal way to that ? Using SQS, or kinesis. I'm totally confused.
SQS = Simple Queue Service is for queuing messages in a 1:1 (once the message is consumed, it is removed from the queue)
Kinesis = low latency, high volumetry data streaming ... typically for 1:N (many consumers of messages)
As Kinesis is also storing the data for a period of time, both are often confused, but their architectural patterns are totally different.
Queue => SQS.
Data Streams => Kinesis.
Taken from https://aws.amazon.com/kinesis/data-streams/faqs/ :
Q: How does Amazon Kinesis Data Streams differ from Amazon SQS?
Amazon Kinesis Data Streams enables real-time processing of streaming
big data. It provides ordering of records, as well as the ability to
read and/or replay records in the same order to multiple Amazon
Kinesis Applications. The Amazon Kinesis Client Library (KCL) delivers
all records for a given partition key to the same record processor,
making it easier to build multiple applications reading from the same
Amazon Kinesis data stream (for example, to perform counting,
aggregation, and filtering).
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly
scalable hosted queue for storing messages as they travel between
computers. Amazon SQS lets you easily move data between distributed
application components and helps you build applications in which
messages are processed independently (with message-level ack/fail
semantics), such as automated workflows.
Q: When should I use Amazon Kinesis Data Streams, and when should I
use Amazon SQS?
We recommend Amazon Kinesis Data Streams for use cases with
requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are
simpler when all records for a given key are routed to the same record
processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining
the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a
real-time dashboard and another that archives data to Amazon Redshift.
You want both applications to consume data from the same stream
concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that
runs a few hours behind the billing application. Because Amazon
Kinesis Data Streams stores data for up to 7 days, you can run the
audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are
similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track
the successful completion of each item independently. Amazon SQS
tracks the ack/fail, so the application does not have to maintain a
persistent checkpoint/cursor. Amazon SQS will delete acked messages
and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can
configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the
backlog is cleared. With Amazon Kinesis Data Streams, you can scale up
to a sufficient number of shards (note, however, that you'll need to
provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional
load spikes or the natural growth of your business. Because each
buffered request can be processed independently, Amazon SQS can scale
transparently to handle the load without any provisioning instructions
from you.

Using a 24 Hour Window for Amazon Kinesis Analytics

I have a Streaming Data use case where for every record inserted into a Kinesis Stream, I would like to calculate a value based on the last 24 hours worth of records.
Now, Kinesis Analytics seems to fit this bill nicely. I can use a WINDOW with a RANGE INTERVAL '1' DAY PRECEDING to get my aggregation. My Kinesis Stream is set to persist records for 26 hours.
My conflict comes from the Best Practices documentation.
Specifically:
In your SQL statement, we recommend that you do not specify time-based window that is longer than one hour for the following reasons:
If an application needs to be restarted, either because you updated the application or for Amazon Kinesis Data Analytics internal reasons, all data included in the window must be read again from the streaming data source. This will take time before Amazon Kinesis Data Analytics can emit output for that window.
Amazon Kinesis Data Analytics must maintain everything related to the application's state, including relevant data, for the duration. This will consume significant Amazon Kinesis Data Analytics processing units.
I want to be sure I understand the consequences of this.
Say I proceed with my plans and implement this using Kinesis Analytics. Then for whatever reason my Kinesis Analytics Application has to be restarted. For context, my application deals with approximately 1 million records a day and each record is approximately 550 bytes.
If my Kinesis Analytics Application restarts, what will be I be dealing with since I would be ignoring this recommendation:
we recommend that you do not specify time-based window that is longer than one hour
I have also considered skipping Kinesis Analytics and just feeding my rows into RDS (postgres - Single Deployment) and triggering a lambda to run my calculations or doing something with Redis (still exploring those consequences).

AWS SQS and other services

my company has a messaging system which sends real-time messages in JSON format, and it's not built on AWS
our team is trying to use AWS SQS to receive these messages, which will then have DynamoDB to storage this messages
im thinking to use EC2 to read this messages then save them
any better solution ?? or how to do it i don't have a good experience
First of All EC2 is infrastructure on Cloud, It is similar to physical machine with OS on local setup. If you want to create any application that will fetch the data from Amazon SQS(Messages in Json Format) and Push it in dynamodb(No Sql database), Your design is correct as both SQS and DynamoDb have thorough Json Support. Once your application is ready then you deploy that application on EC2 machine.
For achieving this, your application must have the asyc Buffered SQS consumer that will consume the messages(limit of sqs messages is 256KB), Hence whichever application is publishing messages size of messages needs to be less thab 256Kb.
Please refer below link for sqs consumer
is putting sqs-consumer to detect receiveMessage event in sqs scalable
Once you had consumed the message from sqs queue you need to save it in dynamodb, that you can easily do it using crud repository. With Repository you can directly save the json in Dynamodb table but please sure to configure the provisioning write capacity based on requests, because more will be the provisioning capacity more will be the cost. Please refer below link for configuring the write capacity of table.
Dynamodb reading and writing units
In general, you'll have a setup something like this:
The EC2 instances (one or more) will read your queue every few seconds to see if there is anything there. If so, they will write this data to DynamoDB.
Based on what you're saying you'll have less than 1,000,000 reads from SQS in a month so you can start out on the free tier for that. You can have a single EC2 instance initially and that can be a very small instance - a T2.micro should be more than sufficient. And you don't need more than a few writes per second on DynamoDB.
The advantage of SQS is that if for some reason your EC2 instance is temporarily unavailable the messages continue to queue up and you won't lose any of them.
From a coding perspective, you don't mention your development environment but there are AWS libraries available for a pretty wide variety of environments. I develop in Java and the code to do this would be maybe 100 lines. I would guess that other languages would be similar. Make sure you look at long polling in the language you're using - it can help to speed up the processing and save you money.

Amazon Kinesis Vs EC2

Sorry for the silly question, I am new to cloud development. I am trying to develop a realtime processing app in cloud, which can process the data from a sensor in realtime. the data stream is very low data rate, <50Kbps per sensor. probably <10 sensors will be running at once.
I am confused, what is the use of Amazon Kinesis for this application. I can use EC2 directly to receive my stream and process it. Why do I need Kinesis?
Why do I need Kinesis?
Short answer, you don't.
Yes, you can use EC2 - and probably dozens of other technologies.
Here is the first two sentences of the Kinesis product page:
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. You can configure hundreds of thousands of data producers to continuously put data into an Amazon Kinesis stream.
So, if want to manage the stack yourelf, and/or you don't need massive scale and/or you don't need the ability to scale this processing to hundreds of thousands of simulataneous producers, then Kinesis may be overkill.
On the other hand, if the ingestion of this data is mission critical, and you don't have the time, skills or ability to manage the underlying infrastructure - or there is a chance the scale of your application will grow exponentially, then maybe Kinesis is the right choice - only you can decide based on your requirements.
Along with what E.J Brennan Just said, there are many other ways to solve your problem as the rate of data is very low.
As far as I know, amazon kinesis runs on ec2 under the hood, so may be your question is why to use kinesis as a streaming solution.
for scalability reasons ,you might need the streaming solution in future, as your volume of data grows and as the cost of maintaining the on premises resources increases and the focus shifts from application development to administration.
So kinesis for that matter ,would provide pay per use model instead of you worrying about increasing/reducing your resource stack.