Amazon Kinesis Firehose Buffering to S3 - amazon-web-services

I'm attempting to price out a streaming data / analytic application deployed to AWS and looking at using Kinesis Firehose to dump the data into S3.
My question is, when pricing out the S3 costs for this, I need to figure out out how many PUT's I will need.
So, I know the Firehose buffers the data and then flushes out to S3, however I'm unclear on whether it will write a single "file" with all of the records accumulated up to that point or if it will write each record individually.
So, assuming I set the buffer size / interval to an optimal amount based on size of records, does the number of S3 PUT's still equal the number of records OR the number of flushes that the Firehose performs?

Having read a substantial amount of AWS documentation, I respectfully disagree with the assertion that S3 will not charge you.
You will be billed separately for charges associated with Amazon S3 and Amazon Redshift usage including storage and read/write requests. However, you will not be billed for data transfer charges for the data that Amazon Kinesis Firehose loads into Amazon S3 and Amazon Redshift. For further details, see Amazon S3 pricing and Amazon Redshift pricing. [emphasis mine]
https://aws.amazon.com/kinesis/firehose/pricing/
What they are saying you will not be charged is anything additional by Kinesis Firehose for the transfers, other than the $0.035/GB, but you'll pay for the interactions with your bucket. (Data inbound to a bucket is always free of actual per-gigabyte transfer charges).
In the final analysis, though, you appear to be in control of the rough number of PUT requests against your bucket, based on some tunable parameters:
Q: What is buffer size and buffer interval?
Amazon Kinesis Firehose buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations. You can configure buffer size and buffer interval while creating your delivery stream. Buffer size is in MBs and ranges from 1MB to 128MB. Buffer interval is in seconds and ranges from 60 seconds to 900 seconds.
https://aws.amazon.com/kinesis/firehose/faqs/#creating-delivery-streams
Unless it is collecting and aggregating the records into large files, I don't see why there would be a point in the buffer size and buffer interval... however, without firing up the service and taking it for a spin, I can (unfortunately) only really speculate.

I don't believe you pay anything extra for the write operation to S3 from Firehose.
You will be billed separately for charges associated with Amazon S3
and Amazon Redshift usage including storage and read/write requests.
However, you will not be billed for data transfer charges for the data
that Amazon Kinesis Firehose loads into Amazon S3 and Amazon Redshift.
For further details, see Amazon S3 pricing and Amazon Redshift
pricing.
https://aws.amazon.com/kinesis/firehose/pricing/

the cost is one S3 PUT for any operation done by kinesis, not for a single object.
so one flush of firehose is one put:
https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/data-ingestion-methods.html
https://forums.aws.amazon.com/thread.jspa?threadID=219275&tstart=0

Related

AWS SNS create million files

I have an event stream which send millions of events through SNS every day. Through a lambda, these topics are then stored in s3 but each in its own file. Total size of these events is not much (less than 1 GB) but moving/deleting one day files each the size of a few bytes becomes a long process. Is there a way I can store these SNS topics into larger files (or even a single file)?
I'd have the Lambda write the events to Kinesis Data Firehose and use that to batch the events up to a certain size-threshold or time-window and then have Firehose deliver those to S3.
Here are some resources for that:
S3 Destination for the delivery stream
S3 Destination buffer size & interval
SNS with kinesis firehose looks perfect fit for this use case.
Recently aws announced kinesis firehouse support with SNS, on kinesis you can add bufferring conditions to s3.
Kinesis Data Firehose buffers incoming data before delivering it (backing it up) to Amazon S3. You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds. The condition that is satisfied first triggers data delivery to Amazon S3
In case you want to transform your events or process them you can use lambda as well.

What's the use cases of Streams and Firehose?

I am working on an application that will read and analyze the logs of payment transactions. I know I will use Kinesis Analytics as per my requirements, which takes the input from the Data Streams and Firehose. But I am having trouble deciding which input method should I use for my system. My requirements are:
It can tolerate latency, but Data shouldn't lose data.
Must record all the errors in DynamoDB or S3 buckets.
Which input stream is suitable for my use case?
Data Streams vs Firehose
Streams:
Kinesis data streams is highly customizable and best suited for developers building custom applications or streaming data for specialized needs.
Going to write custom code
Real time (200ms latency for classic, 70ms latency for enhanced fan-out)
You must manage scaling (shard splitting/merging)
Data storage for 1 to 7 days, replay capability, multi consumers
Use with Lambda to insert data in real-time to ElasticSearch
Firehose:
Firehose handles loading data streams directly into AWS products for processing.
Fully managed, send to S3, Splunk, Redshift, ElasticSearch
Serverless data transformations with Lambda
Near real time (lowest buffer time is 1 minute)
Automated Scaling
No data storage
Kinesis Data Streams allows consumers to READ streaming data. And it gives you a plenty of options to do so. It is best suitable for use cases that require custom processing, choice of stream processing frameworks, and sub-second processing latency.
Data is reliably stored in streams up to 7 days and distributed across 3 Availability Zones.
Kinesis Firehose is used to LOAD streaming data to a target destination (S3, Elasticsearch, Splunk, etc). You can also transform streaming data (by using Lambda) before loading it to destination.
Data from failed attempts will be saved to S3.
So, if your goal is to only load data to Kinesis Data Analytics service with minimal or no pre-processing then try Kinesis Firehose first.
Please note, that you also would need to consider such aspects as cost, development efforts, scaling options, volume of the data when choosing a proper service.
Please take a look at the following AWS Solutions Implementation for reference:
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
https://aws.amazon.com/solutions/implementations/real-time-iot-device-monitoring-with-kinesis/
There are some key differences between Kinesis Stream (KS) and Firehose (FH):
KS is real time, while FH is near-real time.
KS requires manual scaling and setup of its provisioning (shards) , while FH is basically serverless.
KS records are immutable (they persist in stream for its retention period - default 24h), while records in FH are gone from FH the moment they are delivered to destination.
From what you wrote, I think FH should be considered first, as you are not concerned about non-real-time nature of FH, it is much easier to manage and setup, and you can specify S3 as a backup for failed or all messages:
Kinesis Data Firehose uses Amazon S3 to backup all or failed only data that it attempts to deliver to your chosen destination.
The S3 backup ensures you are not loosing records, if delivery or lambda processing fail. Subsequently, in my view, Firehose addresses your two points well.
You can use firehose to feed into analytics, but question is how firehose gets data? You can write your own code to feed data or use kinesis data steams. Firehose mainly is delivery system for stream data that can be written in to various destinations such as S3, Redshift or others with optional capability to perform data transformation.
Check this link https://www.slideshare.net/AmazonWebServices/abd217from-batch-to-streaming?from_action=save and see how your use case can benefit from the information.
More info: https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
If you are creating s3 files from the kinesis stream but you dont require cleaning of those s3 files then go with the firehose option. Also if you dont have any partitioning key requirement that makes many small s3 files then firehose is a good solution. If you are doing more cleaning up the FH files than you would have created those s3 files yourself then FH isnt a good option.
Also depends on what do you with those s3 files. You need to find out if you are saving any work/money because of using Firehose against the manual creation of S3 files. Remember you cant reorder the content of the s3 files.

How to save data from a Lambda function into a S3 when we have too much incoming per millisecond?

I have a process that publishes data into a IoT-Core and that triggers a Lambda function that inserts the payload into an Amazon S3 bucket.
I have a process that send around 1.2 million records in some seconds, and when I check in the bucket I see I have lost around 10% of the data. If I set a sleep in the Lambda function it goes beyond 15 minutes.
What is the solution for this scenario?
It appears that your requirement is to capture the events coming into IoT-Core and save them to Amazon S3.
It also sounds like your Lambda functions are being throttled due to hitting concurrency limits and data is being lost. By default, there is a limit of 10,000 concurrent AWS Lambda functions. This could potentially be fixed by requesting an increase in the maximum number of concurrent functions.
Here is a diagram from How AWS IoT works:
As shown in the digram, the Rules engine can actually be used to send data to Amazon S3 without requiring Lambda. However, this creates a separate object in Amazon S3 for every message.
If you wish to combine messages together, you can Write to Kinesis Data Firehose Using AWS IoT. Firehose will buffer the data by time or size, and then output multiple messages to an Amazon S3 object. This could be a good way to handle large volumes of data, and it also makes it easier to work with the resulting objects in S3 because there are less objects created. This makes them faster to query and process later (eg with Amazon Athena).
Going from IoT-Core rule direct to a Lambda can be fragile.
You can use Kinesis to buffer the data or Firehose to stream it directly to S3. These are standard patterns that AWS recommend for IoT in the AWS Well-Architected framework (https://d1.awsstatic.com/whitepapers/architecture/AWS-IoT-Lens.pdf).

Cloudwatch log store costing vs S3 costing

I have an ec2 instance which is running apache application.
I have to store my apache log somewhere. For this, I have used two approaches:
Cloudwatch Agent to push logs to cloudwatch
CronJob to push log file to s3
I have used both of the methods. Both methods suit fine for me. But, here I am little worried about the costing.
Which of these will have minimum cost?
S3 Pricing is basically is based upon three factors:
The amount of storage.
The amount of data transferred every month.
The number of requests made monthly.
The cost for data transfer between S3 and AWS resources within the same region is zero.
According to Cloudwatch pricing for logs :
All log types. There is no Data Transfer IN charge for any of CloudWatch.Data Transfer OUT from CloudWatch Logs is priced.
Pricing details for Cloudwatch logs:
Collect (Data Ingestion) :$0.50/GB
Store (Archival) :$0.03/GB
Analyze (Logs Insights queries) :$0.005/GB of data scanned
Refer CloudWatch pricing for more details.
Similarly, according to AWS, S3 pricing differs region wise.
e.g For N.Virginia :
S3 Standard Storage
First 50 TB / Month :$0.023 per GB
Next 450 TB / Month :$0.022 per GB
Over 500 TB / Month :$0.021 per GB
Refer S3 pricing for more details.
Hence, we can conclude that sending logs to S3 will be more cost effective than sending them to CloudWatch.
They both have similar storage costs, but CloudWatch Logs has an additional ingest charge.
Therefore, it would be lower cost to send straight to Amazon S3.
See: Amazon CloudWatch Pricing – Amazon Web Services (AWS)

Amazon Kinesis - Usage and Shard Hours

My team is using Amazon Kinesis to output the results of queries on other datasets to our own S3 bucket.
Although we were not running the queries often at all, we saw that in our billing console, we were still using 24,000 shard hours so far in December alone.
Does anyone now if Kinesis charges when the shards are actually running, or if they are just up and existing?
You are charged for each shard at an hourly rate. If you enable Extended Data Protection you are charged an additional rate on each shard hour. It does not matter if you are actually using the shard.
Amazon Kinesis Data Streams Pricing