AWS SNS create million files - amazon-web-services

I have an event stream which send millions of events through SNS every day. Through a lambda, these topics are then stored in s3 but each in its own file. Total size of these events is not much (less than 1 GB) but moving/deleting one day files each the size of a few bytes becomes a long process. Is there a way I can store these SNS topics into larger files (or even a single file)?

I'd have the Lambda write the events to Kinesis Data Firehose and use that to batch the events up to a certain size-threshold or time-window and then have Firehose deliver those to S3.
Here are some resources for that:
S3 Destination for the delivery stream
S3 Destination buffer size & interval

SNS with kinesis firehose looks perfect fit for this use case.
Recently aws announced kinesis firehouse support with SNS, on kinesis you can add bufferring conditions to s3.
Kinesis Data Firehose buffers incoming data before delivering it (backing it up) to Amazon S3. You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds. The condition that is satisfied first triggers data delivery to Amazon S3
In case you want to transform your events or process them you can use lambda as well.

Related

Kinesis Firehose delivers data from DynamoDB Steam to S3: Why the numbers of JSON objects in files is different?

I'm new to AWS, and I'm working on archiving data from DynamoDB to S3. This is my solution and I have done the pipeline.
DynamoDB -> DynamoDB TTL + DynamoDB Stream -> Lambda -> Kinesis Firehose -> S3
But I found that the files in S3 has different number of JSON objects. Some files has 7 JSON objects, some has 6 or 4 objects. I have done ETL in lambda, the S3 only saves REMOVE item, and the JSON has been unmarshall.
I thought it would be a JSON object in a file, since the TTL value is different for each item, and the lambda would deliver the item immediately when the item is deleted by TTL.
Does it because the Kinesis Firehose batches the items? (It would wait for sometime after collecting more items then saving them to a file) Or there's other reason? Could I estimate how many files it will save if DynamoDB has a new item is deleted by TTL every 5 minutes?
Thank you in advance.
Kinesis Firehose splits your data based on buffer size or interval.
Let's say you have a buffer size of 1MB and an interval of 1 minute.
If you receive less than 1MB within the 1 minute interval, Kinesis Firehose will anyway create a batch file out of the received data, even if it is less than 1MB of data.
This is likely happening in scenarios with few data arriving. You can adjust your buffer size and interval to your needs. E.g. You could increase the interval to collect more items within a single batch.
You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds. The condition that is satisfied first triggers data delivery to Amazon S3.
From the AWS Kinesis Firehose Docs: https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html

Can Kinesis Firehose receive content uncompressed from CloudWatch Logs subscription?

I'm using Kinesis Firehose to copy application logs from CloudWatch Logs into S3 buckets.
Application logs are written to CloudWatch
A Kinesis subscription on the log group pulls the log events into a Kinesis stream.
A firehose delivery stream uses a Lambda function to decompress and transform the source record.
Firehose writes the transformed record to an S3 destination with GZIP compression enabled.
However, there is a problem with this flow. Often I've noticed that the Lambda transform function fails because the output data exceeds the 6 MiB response payload limit for Lambda synchronous invocation. It makes sense this would happen because the input is compressed but the output is not compressed. Doing it this way seems like the only way to get the file extension and MIME type set correctly on the resultant object in S3.
Is there any way to deliver the input to the Lambda transform function uncompressed?
This would align the input/output sizes. I've already tried reducing the buffer size on the Firehose delivery stream, but the buffer size limit seems to be on compressed data, not raw data.
No, it doesn't seem possible to change whether the input from CloudWatch Logs is compressed. CloudWatch Logs will always push GZIP-compressed payloads onto the Kinesis stream.
For confirmation, take a look at the AWS reference implementation kinesis-firehose-cloudwatch-logs-processor of the newline handler for CloudWatch Logs. This handler accepts GZIP-compressed input and returns the decompressed message as output. In order to work around the 6 MiB limit and avoid body size is too long error messages, the reference handler slices the input into two parts: payloads that fit within the 6 MiB limit, and the remainder. The remainder is re-inserted into Kinesis using PutRecordBatch.
CloudWatch Logs always delivery in compressed format which is a benefit from a cost and performance perspective. But I understand your frustration from not having file extension correct in the S3.
What you could do:
1) Have your lambda uncompress on read and compress on write.
2) Create an S3 event trigger on ObjectCreate that renames the file with the correct extension. Due to the way the firehose writes to S3 you can not use a suffix filter so you're lambda will need to check if it did the rename already.
lambda logic
if object does not end .gz
then
aws s3 mv object object.gz
end if

Call Kinesis Firehose vs Kinesis Stream directly from Lambda

I have a need where I wanted to push some data to S3 from lambda. The data coming to Lambda is from a Dynamodb streams. Since, for pushing to S3 bucket, use of Firehose is considered best as it batches and buffers the data before pushing to S3 as well as provide the retry strategy. So, I am using Firehose instead of directly pushing to S3.
But I observe lot of people push data from Lambda to Kinesis Stream from which data is pushed to Kinesis Firehose instead of directly pushing to Firehose from AWS Lambda. Is there any reason of doing it this way? Any benefits? What are the drawbacks of pushing to Kinesis firehose directly?
If Amazon Kinesis Data Firehose meets your needs, then definitely use it! It takes care of most of the work for you, compared to normal Kinesis Streams.
The only time you would not use Firehose is when you have a different destination (eg you want to process the data on Amazon EC2 instances) or you want more control of the streams and shards (eg to process certain producers on specific shards to retain ordering on a per-shard basis).

Is Kinesis Firehose a replacement to Kinesis Streams?

Kinesis Firehose, as well as Kinesis Streams, are used to load streaming data as per the details mentioned in the AWS blogs. There is no concept of shards or maintenance in case of Firehose. In such a case, Is Kinesis Firehose a replacement to Kinesis Streams?
Amazon Kinesis Firehose is an easy way to create a stream where data is sent to one of:
Amazon S3
Amazon Redshift
Amazon Elasticache
You can also create a Lambda function that can manipulate the data on the way through.
If the above suits your needs, then Firehose could be considered a replacement for Kinesis Streams. However, Kinesis Streams offers more flexibility so it is not an exact replacement.
Kinesis Firehose is not a replacement to Kinesis Streams although there are several use cases, Kinesis Firehose has taken over after its introduction.
Kinesis Streams is used to buffer the streaming data from producers and streaming it into custom applications for data processing and analysis which will consume the temporary buffered stream data.
Data producers push data to Kinesis Streams -> Applications read the data from stream and process.
Kinesis Firehose is used to capture and load streaming data into other Amazon services such as S3 and Redshift so that analysis can take place later on.
Data producers push data to Kinesis Firehose -> Data Transformation using Lambda -> Store in S3 or Redshift.
These two can also be used in combination where, Kinesis Streams can stream the data in to Kinesis Firehose so that, it could be persisted after processing.
A thing to take into account when choosing which service to use are the limits and scalability of each solution.
AWS Firehose has a fixed limit of 5mb/sec or 5000 rec/sec (details here), although it can be increased by contacting AWS through a request form.
On the other hand, AWS Kinesis can be scaled easily by increasing the number of shards for each Stream (up to 500 shards by default). The main issue here is that each shard has its own cost and you can only scale up or down by doubling the current amount of shards.
As Ashan said, these services serve different purposes, but you can use each one on its own, or combine them according to your needs. The main advantage here, is that Kinesis Stream can be consumed by many consumers, and be fed by many producers. On the other hand, Firehose Streams act as a consumer for other source of data (such as a Kinesis Stream) and can output data to only one destination (S3, Redshit, Elasticsearch, Splunk).
Not sure how it would be a replacement if there is no persistence of data with Kinesis Firehose, unless you mean it in the context of there is no need for data persistence or perhaps its an issue of cost, then your option would be to analyze that data as soon as it comes in which is Kinesis Firehose and eventually storing it in S3 or ElasticSearch Cluster.
No, just different purposes.
With Kinesis Streams, you build applications using the Kinesis Producer Library put the data into a stream and then process it with an application that uses the Kinesis Client Library and with Kinesis Connector Library send the processed data to S3, Redshift, DynamoDB or ElasticSearch.
With Kinesis Firehose it’s a bit simpler where you create the delivery stream and send the data to S3, Redshift or ElasticSearch (using the Kinesis Agent or API) directly and storing it in those services.
Kinesis Streams, on the other hand, can store the data for up to 7 days.
You may use Kinesis Streams if you want to do some custom processing with streaming data. With Kinesis Firehose you are simply ingesting it into S3, Redshift, DynamoDB or ElasticSearch.

Is there a way to put data into Kinesis Firehose from S3 bucket?

I am want to write streaming data from S3 bucket into Redshift through Firehose as the data is streaming in real time (600 files every minute) and I dont want any form of data loss.
How to put data from S3 into Kinesis Firehose?
It appears that your situation is:
Files randomly appear in S3 from an SFTP server
You would like to load the data into Redshift
There's two basic ways you could do this:
Load the data directly from Amazon S3 into Amazon Redshift, or
Send the data through Amazon Kinesis Firehose
Frankly, there's little benefit in sending it via Kinesis Firehose because Kinesis will simply batch it up, store it into temporary S3 files and then load it into Redshift. Therefore, this would not be a beneficial approach.
Instead, I would recommend:
Configure an event on the Amazon S3 bucket to send a message to an Amazon SQS queue whenever a file is created
Configure Amazon CloudWatch Events to trigger an AWS Lambda function periodically (eg every hour, or 15 minutes, or whatever meets your business need)
The AWS Lambda function reads the messages from SQS and constructs a manifest file, then triggers Redshift to import the files listed in the manifest file
This is a simple, loosely-coupled solution that will be much simpler than the Firehose approach (which would require somehow reading each file and sending the contents to Firehose).
Its actually designed to do the opposite, Firehose sends incoming streaming data to Amazon S3 not from Amazon S3, and other than S3 it can send data to other services like Redshift and Elasticsearch Service.
I don't know whether this will solve your problem but you can use COPY from S3 to redshift.
Hope it will help!