Amazon Kinesis Analytics for archival data - amazon-web-services

Background
I have found that Amazon Kinesis Data Analytics can be used for streaming data as well as data present in an S3 bucket.
However, there are some parts of the Kinesis documentation that make me question whether Amazon Kinesis Analytics can be used for a huge amount of existing data in an S3 bucket:
Authoring Application Code
We recommend the following:
In your SQL statement, don't specify a time-based window that is longer than one hour for the following reasons:
Sometimes an application needs to be restarted, either because you updated the application or for Kinesis Data Analytics internal reasons. When it restarts, all data included in the window must be read again from the streaming data source. This takes time before Kinesis Data Analytics can emit output for that window.
Kinesis Data Analytics must maintain everything related to the application's state, including relevant data, for the duration. This consumes significant Kinesis Data Analytics processing units.
Question
Will Amazon Kinesis Analytics be good for this task?

The primary use case for Amazon Kinesis Analytics is stream data processing. For this reason, you attach an Amazon Kinesis Analytics application to a streaming data source. You can optionally include reference data from S3, which is limited in size to 1 GB at this time. We will load data from an S3 object into a SQL table that you can use to enrich the incoming stream.
It sounds like want a more general purpose tool for querying data from S3, not a stream data processing solution. I would recommend looking at Presto and Amazon EMR instead of using Amazon Kinesis Analytics.
Disclaimer: I work for the Amazon Kinesis team.

Related

Streaming Data From different Sources to AWS S3

I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. I know that AWS Kinesis Data Stream offers Real-time data streaming and I can process data using AWS lambda before sending them to S3. However, it is not clear for me that can we use AWS Glue Streaming instead of AWS Kinesis Data Stream and AWS Lambda? I have seen some documentations about using AWS Glue Streaming for processing real-time data on the fly and send them to S3. So, what is the real differences here? Is AWS Glue Streaming ETL a good choice for streaming and processing data in real-time and store them into S3?
Kinesis data stream with lambda consumer will fit as long as the lambda execution environment limits is sufficient
15 mins execution time
Memory config
Concurrency limits
When going with glue consumer, your glue jobs can run longer and also supports Apache spark for massive parallel processing
You can also use Kinesis firehose which has native integration to deliver data to S3, ElasticSearch etc..., which doesn't require any changes to data. You can also have a lambda to do minimal processing intercepting the data before delivering using firehose.

What's the use cases of Streams and Firehose?

I am working on an application that will read and analyze the logs of payment transactions. I know I will use Kinesis Analytics as per my requirements, which takes the input from the Data Streams and Firehose. But I am having trouble deciding which input method should I use for my system. My requirements are:
It can tolerate latency, but Data shouldn't lose data.
Must record all the errors in DynamoDB or S3 buckets.
Which input stream is suitable for my use case?
Data Streams vs Firehose
Streams:
Kinesis data streams is highly customizable and best suited for developers building custom applications or streaming data for specialized needs.
Going to write custom code
Real time (200ms latency for classic, 70ms latency for enhanced fan-out)
You must manage scaling (shard splitting/merging)
Data storage for 1 to 7 days, replay capability, multi consumers
Use with Lambda to insert data in real-time to ElasticSearch
Firehose:
Firehose handles loading data streams directly into AWS products for processing.
Fully managed, send to S3, Splunk, Redshift, ElasticSearch
Serverless data transformations with Lambda
Near real time (lowest buffer time is 1 minute)
Automated Scaling
No data storage
Kinesis Data Streams allows consumers to READ streaming data. And it gives you a plenty of options to do so. It is best suitable for use cases that require custom processing, choice of stream processing frameworks, and sub-second processing latency.
Data is reliably stored in streams up to 7 days and distributed across 3 Availability Zones.
Kinesis Firehose is used to LOAD streaming data to a target destination (S3, Elasticsearch, Splunk, etc). You can also transform streaming data (by using Lambda) before loading it to destination.
Data from failed attempts will be saved to S3.
So, if your goal is to only load data to Kinesis Data Analytics service with minimal or no pre-processing then try Kinesis Firehose first.
Please note, that you also would need to consider such aspects as cost, development efforts, scaling options, volume of the data when choosing a proper service.
Please take a look at the following AWS Solutions Implementation for reference:
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
https://aws.amazon.com/solutions/implementations/real-time-iot-device-monitoring-with-kinesis/
There are some key differences between Kinesis Stream (KS) and Firehose (FH):
KS is real time, while FH is near-real time.
KS requires manual scaling and setup of its provisioning (shards) , while FH is basically serverless.
KS records are immutable (they persist in stream for its retention period - default 24h), while records in FH are gone from FH the moment they are delivered to destination.
From what you wrote, I think FH should be considered first, as you are not concerned about non-real-time nature of FH, it is much easier to manage and setup, and you can specify S3 as a backup for failed or all messages:
Kinesis Data Firehose uses Amazon S3 to backup all or failed only data that it attempts to deliver to your chosen destination.
The S3 backup ensures you are not loosing records, if delivery or lambda processing fail. Subsequently, in my view, Firehose addresses your two points well.
You can use firehose to feed into analytics, but question is how firehose gets data? You can write your own code to feed data or use kinesis data steams. Firehose mainly is delivery system for stream data that can be written in to various destinations such as S3, Redshift or others with optional capability to perform data transformation.
Check this link https://www.slideshare.net/AmazonWebServices/abd217from-batch-to-streaming?from_action=save and see how your use case can benefit from the information.
More info: https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
If you are creating s3 files from the kinesis stream but you dont require cleaning of those s3 files then go with the firehose option. Also if you dont have any partitioning key requirement that makes many small s3 files then firehose is a good solution. If you are doing more cleaning up the FH files than you would have created those s3 files yourself then FH isnt a good option.
Also depends on what do you with those s3 files. You need to find out if you are saving any work/money because of using Firehose against the manual creation of S3 files. Remember you cant reorder the content of the s3 files.

What is the difference/use case for Kinesis services of Firehose, pipeline, data stream

I am confused on the different Kinesis services, I've read the following terms:
Kinesis streaming data platform
Kinesis Data Stream
Kinesis Data Firehose
Kinesis Video Stream
Kinesis Data Analytics
Kinesis Data Pipeline
Can any one shed me with some lights on what is each of the services or maybe just a nickname? what are their use cases?
Thanks.
There are 4 flavors of Kinesis. Some of the other ones you've presented seem to be aliases, yes. You can confirm this under "Amazon Kinesis capabilities" at https://aws.amazon.com/kinesis/. I've pulled the descriptions from the FAQs.
Data Streams:
Amazon Kinesis Data Streams enables you to build custom applications that process or analyze streaming data for specialized needs.
Data Analytics:
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. (TL;DR, you can process data, in near-realtime using SQL application code)
Video Streams:
Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.
Data Firehose:
Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today.
Firehose and Data Streams are very similar. Biggest difference is Firehose will scale for you, where Data Streams gives you control on the number of "shards" your stream has. Shards control how much throughput your stream gets.

What is the right architecture\design to perform javascript-client to aws-database website tracking system

We wish to build data pipeline system which tracks website interactions/events.
The goal is to track user behavior in a website so we would like to choose the right architecture to implement it having the following two constraints :
1) the system is Amazon
2) this is budgetary project so we cannot use redshift for this purpose
Based on the above two constraints my plan is to implement the following architecture:
website-javascript --> AWS-S3 -->(AWS-Lambda)--> AWS-RDS
website javascript client -
aws-firehose data delivery system to S3 - tracking user interaction and load them to aws-firehose which eventually write them in aws-S3.
AWS Lambda (Python) - Periodically task which pulls daily events from AWS-S3 and load them to AWS-RDS.
The reason I have chosen AWS-RDS is due to its cost-effectiveness for this objective
Appreciate any comment to the above mentioned implementation or any other architecture proposal that you may recommend to use instead of the above
If I understand your question correctly, you are proposing below solution to perform web analytics for your application:
WebServer --> Firehose --> AWS-S3 --> AWS-Lambda --> AWS-RDS
I see below pros and cons with above design
Pros:
low cost
easy to implement
Cons:
RDS may not be salable enough to handle analytics on massive amounts of web-streaming data, which tend to grow rapidly
Need to handle load balancing, failure scenarios and other complexities for lambda
You need to handle data transformation for RDS as it expects structured data to be ingested into relational tables
Proposal to store the data in S3 through Firehose sounds a good solution. But please keep in mind that minimum interval for Firehose is one minute, so your application needs to tolerate this minor latency. You may use Kinesis Streams to have millisecond latency, but then you need to manage your own application code and instances to handle Streams.
After ingesting data in Kinesis Firehose or Streams, you may also explore below alternatives:
Use Kinesis Analytics to track web users activity in real-time if it's available in your AWS region. It's only available in selected AWS regions currently
Within Firehose, transform your data using lambda and store it in S3 in optimized format for further analysis with AWS Athena
Use Elastic Search as a destination and perform web analytics with ELK stack instead of RDS
Though you mentioned that you can not use RedShift, it still may be the best solution for time series analysis. Exploring RedShift, RedShift Spectrum and formatted data stored in S3 may still be a cost effective solution with better cababilities
Adding few references from AWS, which you may go through before deciding on the solution:
Real-Time Web Analytics with Kinesis Data Analytics Solution
Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch
Schema-On-Read Analytics Pipeline Using Amazon Athena
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required
Hey folky this is a getting more and more common.
Generally the pattern is click events to Kinesis streams then you can monitor user interaction with the website in real time using Kinesis analytics. You can connect the stream to firehose to offload data in to an S3 bucket as well as incorporate Lambdas to transform the data.
There is some major complexity around handling Lambdas and Kinesis streams in parallel so this solution might not be as scalable as using AWS Kafka. Or perhaps run a job to move your s3 data into rds for whatever reporting you might need that is adhoc.
Here is a pattern AWS already has real-time-web-analytics-with-kinesis

what is difference between Kinesis Streams and Kinesis Firehose?

Firehose is fully managed whereas Streams is manually managed.
If other people are aware of other major differences, please add them. I'm just learning.
Thanks..
Amazon Kinesis Data Firehose can send data to:
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
Splunk
To do the same thing with Amazon Kinesis Data Streams, you would need to write an application that consumes data from the stream and then connects to the destination to store data.
So, think of Firehose as a pre-configured streaming application with a few specific options. Anything outside of those options would require you to write your own code.