Amazon Kinesis Vs EC2 - amazon-web-services

Sorry for the silly question, I am new to cloud development. I am trying to develop a realtime processing app in cloud, which can process the data from a sensor in realtime. the data stream is very low data rate, <50Kbps per sensor. probably <10 sensors will be running at once.
I am confused, what is the use of Amazon Kinesis for this application. I can use EC2 directly to receive my stream and process it. Why do I need Kinesis?

Why do I need Kinesis?
Short answer, you don't.
Yes, you can use EC2 - and probably dozens of other technologies.
Here is the first two sentences of the Kinesis product page:
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. You can configure hundreds of thousands of data producers to continuously put data into an Amazon Kinesis stream.
So, if want to manage the stack yourelf, and/or you don't need massive scale and/or you don't need the ability to scale this processing to hundreds of thousands of simulataneous producers, then Kinesis may be overkill.
On the other hand, if the ingestion of this data is mission critical, and you don't have the time, skills or ability to manage the underlying infrastructure - or there is a chance the scale of your application will grow exponentially, then maybe Kinesis is the right choice - only you can decide based on your requirements.

Along with what E.J Brennan Just said, there are many other ways to solve your problem as the rate of data is very low.
As far as I know, amazon kinesis runs on ec2 under the hood, so may be your question is why to use kinesis as a streaming solution.
for scalability reasons ,you might need the streaming solution in future, as your volume of data grows and as the cost of maintaining the on premises resources increases and the focus shifts from application development to administration.
So kinesis for that matter ,would provide pay per use model instead of you worrying about increasing/reducing your resource stack.

Related

Options Transfer Logs from Opensearch to S3

There are various ways to transfer logs from S3 to Opensearch:
Amazon Glue
Kinesis Data Firehose
Lambda as an Event Handler
What should be used in what situation? What is the cheapest? I would imagine that Kinesis and/or Event Handler method would be the quickest, but that might also put a big load on your cluster given that many calls would be made very often and there is not as much opportunity for bulk uploads. But with Glue you could for example do this operation say every 10 minutes, and then have a lot of bulk uploads, or schedule this operation in low usage periods for logs you do not need to be inserted into opensearch so quickly. I'd be interested to hear under what situation what strategy is used. I want to minimize the load on my cluster as I feel at the end of the day, putting a higher load on opensearch will cost the most.

What is the right architecture\design to perform javascript-client to aws-database website tracking system

We wish to build data pipeline system which tracks website interactions/events.
The goal is to track user behavior in a website so we would like to choose the right architecture to implement it having the following two constraints :
1) the system is Amazon
2) this is budgetary project so we cannot use redshift for this purpose
Based on the above two constraints my plan is to implement the following architecture:
website-javascript --> AWS-S3 -->(AWS-Lambda)--> AWS-RDS
website javascript client -
aws-firehose data delivery system to S3 - tracking user interaction and load them to aws-firehose which eventually write them in aws-S3.
AWS Lambda (Python) - Periodically task which pulls daily events from AWS-S3 and load them to AWS-RDS.
The reason I have chosen AWS-RDS is due to its cost-effectiveness for this objective
Appreciate any comment to the above mentioned implementation or any other architecture proposal that you may recommend to use instead of the above
If I understand your question correctly, you are proposing below solution to perform web analytics for your application:
WebServer --> Firehose --> AWS-S3 --> AWS-Lambda --> AWS-RDS
I see below pros and cons with above design
Pros:
low cost
easy to implement
Cons:
RDS may not be salable enough to handle analytics on massive amounts of web-streaming data, which tend to grow rapidly
Need to handle load balancing, failure scenarios and other complexities for lambda
You need to handle data transformation for RDS as it expects structured data to be ingested into relational tables
Proposal to store the data in S3 through Firehose sounds a good solution. But please keep in mind that minimum interval for Firehose is one minute, so your application needs to tolerate this minor latency. You may use Kinesis Streams to have millisecond latency, but then you need to manage your own application code and instances to handle Streams.
After ingesting data in Kinesis Firehose or Streams, you may also explore below alternatives:
Use Kinesis Analytics to track web users activity in real-time if it's available in your AWS region. It's only available in selected AWS regions currently
Within Firehose, transform your data using lambda and store it in S3 in optimized format for further analysis with AWS Athena
Use Elastic Search as a destination and perform web analytics with ELK stack instead of RDS
Though you mentioned that you can not use RedShift, it still may be the best solution for time series analysis. Exploring RedShift, RedShift Spectrum and formatted data stored in S3 may still be a cost effective solution with better cababilities
Adding few references from AWS, which you may go through before deciding on the solution:
Real-Time Web Analytics with Kinesis Data Analytics Solution
Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch
Schema-On-Read Analytics Pipeline Using Amazon Athena
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required
Hey folky this is a getting more and more common.
Generally the pattern is click events to Kinesis streams then you can monitor user interaction with the website in real time using Kinesis analytics. You can connect the stream to firehose to offload data in to an S3 bucket as well as incorporate Lambdas to transform the data.
There is some major complexity around handling Lambdas and Kinesis streams in parallel so this solution might not be as scalable as using AWS Kafka. Or perhaps run a job to move your s3 data into rds for whatever reporting you might need that is adhoc.
Here is a pattern AWS already has real-time-web-analytics-with-kinesis

Using AWS To Process Large Amounts Of Data With Serverless

I have about 300,000 transactions for each user in my DynamoDB database.
I would like to calculate the taxes based on those transactions in a serverless manner, if that is the cheapest way.
My thought process was that I should use AWS Step Functions to grab all of the transactions, store them into Amazon S3, then use AWS Step Functions to iterate over each row in the CSV file. The problem is that once I read a row in the CSV, I would have to store it in memory so that I can use it for later calculations. If this Lambda function runs out of time, then I have no way to save the state, so this route is not plausible.
Another route which would be expensive, is to have two copies of each transaction in DynamoDB and perform the operations on the copy Table, keeping the original data untouched. The problem with this is that the DynamoDB table is eventually consistent and there could be a scenario where I read a dirty item.
Serverless is ideal for event-driven processing but for your batch use-case, it is probably easier to use an EC2 instance.
An Amazon EC2 t2.nano instance is under 1c/hour, as is a t2.micro instance with spot pricing, and they are per-second pricing.
There really isn't enough detail here to make a good suggestion. For example, how is the data organized in your DynamoDB table? How often do you plan on running this job? How quickly do you need the job to complete?
You mentioned price so I'm assuming that is the biggest factor for you.
Lambda tends to be cheapest for event-driven processing. The idea is that with any EC2/ECS event driven system you would need to over provision by some amount to handle spikes in traffic. The over provisioned compute power is idle most of the time but you still pay for it. In the case of lambda, you pay a little more for the compute power but you save money by needing less since you don't need to over provision.
Batch processing systems tend to lend themselves nicely to EC2 since they typically use 100% of the compute power throughout the duration of the job. At the end of the job, you shutdown all of the instances and you don't pay for them anymore. Also, if you use spot pricing, you can really push the price of your compute power down.

Event Sourcing with Kinesis - Replaying and Persistence

I am trying to implement an event-driven architecture using Amazon Kinesis as the central event log of the platform. The idea is pretty much the same to the one presented by Nordstrom's with the Hello-Retail project.
I have done similar things with Apache Kafka before, but Kinesis seems to be a cost-effective alternative to Kafka and I decided to give it a shot. I am, however, facing some challenges related to event persistence and replaying. I have two questions:
Are you guys using Kinesis for such use-case OR do you recommend using it?
Since Kinesis is not able to retain the events forever (like Kafka does), how to handle replays from consumers?
I'm currently using a lambda function (Firehose is also an option) to persist all events to Amazon S3. Then, one could read past events from the storage and then start listening to new events coming from the stream. But I'm not happy with this solution. Consumers are not be able to use Kinesis' checkpoints (Kafka's consumer offsets). Plus, Java's KCL does not support the AFTER_SEQUENCE_NUMBER yet, which would be useful in such implementation.
First question. Yes I am using Kinesis Streams when I need to process the received log / event data before storing in S3. When I don't I use Kinesis Firehose.
Second question. Kinesis Streams can store data up to seven days. This is not forever, but should be enough time to process your events. Depending on the value of the events being processed ....
If I do not need to process the event stream before storing in S3, then I use Kinesis Firehose writing to S3. Now I do not have to worry about event failures, persistence, etc. I then process the data stored in S3 with the best tool. I use Amazon Athena often and Amazon Redshift too.
You don't mention how much data you are processing or how it is being processed. If it is large, multiple MB / sec or higher, then I would definitely use Kinesis Firehose. You have to manage performance with Kinesis Streams.
One issue that I have with Kinesis Streams is that I don't like the client libraries, so I prefer to write everything myself. Kinesis Firehose reduces coding for custom applications as you just store the data in S3 and then process afterwards.
I like to think of S3 as my big data lake. I prefer to throw everything into S3 without preprocessing and then use various tools to pull out the data that I need. By doing this I remove lots of points of failure that need to be managed.

Amazon Kinesis dynamically stream resize

I am working on Amazon Kinesis Api and Kinesis Client library, I have created one producer to put data into stream and have multiple consumer applications to read data from that stream.
I have scenario to increase and decrease size of stream dynamically as per input stream size and output stream size and also using count of Consumer application.
I found some useful source to count number of shard from amazon website but don't get how to calculate . Source URL is :
http://docs.aws.amazon.com/kinesis/latest/dev/how-do-i-size-a-stream.html
Need some understanding on this.
Thanks
AWS support suggests looking at the following open source project. It was created by one of their solution architects.
https://github.com/awslabs/amazon-kinesis-scaling-utils
It can be run manually (cli) or automatically (deployed WAR) to scale up/down with your application.
You could take a look at Themis, a framework that supports autoscaling of Kinesis streams, developed at Atlassian. The tool is very easy to configure, comes with a Web UI, and supports different autoscaling modes (e.g., proactive and reactive autoscaling).
(Apologies for posting in an old thread, but the answer may still be interesting for readers discovering this thread.)
You can dynamically resize stream with using Amazon Cloud Watch service, you just create Alarms based on stream using different metrics like put.byteRecords and get.byteRecords and detect alarm state.
After that based on those alarm state as "ALARM", increase capacity of your stream using resharding, you can do same scenario to decrease capacity of your stream.
for more information visit this link : http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-using-api-java.html
Since November 2016, you can easily scale your Amazon Kinesis streams using the updateShardCount function, Lambda functions and Amazon Cloud Watch Alarms.
You may find this post really useful.
I have created npm module which helps in auto scaling kinesis stream.
You can find detailed information at Amazon Kinesis Scaling
This is npm module which scale amazon kinesis as per current traffic needs. This module continuously monitor traffic in kinesis stream and split and merge shards as needed.
E.g. if your application needs to handle 5000 req/sec then you need to have 5 shards. Since traffic on your application can varies a lot so does number of shards.
If your application needs to handle 20000 req/sec at peak time then you need to have 20 shards but when at other time you may required only 5 shards.