I'm using Kinesis Firehose to copy application logs from CloudWatch Logs into S3 buckets.
Application logs are written to CloudWatch
A Kinesis subscription on the log group pulls the log events into a Kinesis stream.
A firehose delivery stream uses a Lambda function to decompress and transform the source record.
Firehose writes the transformed record to an S3 destination with GZIP compression enabled.
However, there is a problem with this flow. Often I've noticed that the Lambda transform function fails because the output data exceeds the 6 MiB response payload limit for Lambda synchronous invocation. It makes sense this would happen because the input is compressed but the output is not compressed. Doing it this way seems like the only way to get the file extension and MIME type set correctly on the resultant object in S3.
Is there any way to deliver the input to the Lambda transform function uncompressed?
This would align the input/output sizes. I've already tried reducing the buffer size on the Firehose delivery stream, but the buffer size limit seems to be on compressed data, not raw data.
No, it doesn't seem possible to change whether the input from CloudWatch Logs is compressed. CloudWatch Logs will always push GZIP-compressed payloads onto the Kinesis stream.
For confirmation, take a look at the AWS reference implementation kinesis-firehose-cloudwatch-logs-processor of the newline handler for CloudWatch Logs. This handler accepts GZIP-compressed input and returns the decompressed message as output. In order to work around the 6 MiB limit and avoid body size is too long error messages, the reference handler slices the input into two parts: payloads that fit within the 6 MiB limit, and the remainder. The remainder is re-inserted into Kinesis using PutRecordBatch.
CloudWatch Logs always delivery in compressed format which is a benefit from a cost and performance perspective. But I understand your frustration from not having file extension correct in the S3.
What you could do:
1) Have your lambda uncompress on read and compress on write.
2) Create an S3 event trigger on ObjectCreate that renames the file with the correct extension. Due to the way the firehose writes to S3 you can not use a suffix filter so you're lambda will need to check if it did the rename already.
lambda logic
if object does not end .gz
then
aws s3 mv object object.gz
end if
Related
I have some fairly large datasets (upwards of 60k rows in a csv file) that I need to ingest in elasticsearch on a daily basis (to keep the data updated).
I currently have two lambda functions handling this.
Lambda 1:
A python lambda (nodejs would run out of memory doing this task) is triggered when a .csv file is added to S3 (this file could have upwards of 60k rows). The lambda converts this to JSON and saves to another S3 bucket.
Lambda 2:
A nodejs lambda that is triggered by .json files generated from the Lambda 1. This lambda uses the elasticsearch bulk api to try and insert all of the data into ES.
However because of the large amount of data we hit the ES api rate limiting and fail to insert much of the data.
I have tried splitting the data and uploading smaller amount at a time, however this would then be a very long running lambda function.
I have also looked at adding the data to a kinesis stream however even that has a limit to the data you can add to it in each operation.
I am wondering what the best solution may be to insert large amounts of data like this into ES. My next thought is possibly splitting the .json files into multiple .json files and trigger the lambda that adds data to ES for each smaller .json file. However I am concerned that I would still just hit the rate limiting of the ES domain.
Edit* Looking into the kinesis firehose option this seems like it is the best option as I can set the buffer size to maximum 5mb (this is the ES bulk api limit).
However firehose has an import limit of 1mb so I'd still need to do some form of processing on the lambda that pushes to firehose to split up the data before pushing.
I'd suggest designing the application to use SQS for queuing your messages rather than using firehose (which is more expensive and perhaps not the best option for your use case). Amazon SQS provides a lightweight queueing solution and is cheaper than Firehose (https://aws.amazon.com/sqs/pricing/)
Below is how it can work -
Lambda 1 converts each row to JSON and posts each JSON to SQS.
(Assuming each JSON is less than 256KB)
the SQS queue acts as an event source for Lambda 2 and triggers it in batches of, say 5000 messages.
(Ref - https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html)
Lambda 2 uses the payload received from SQS to insert into ElasticSeach using the Bulk API.
Illustration -
The batch size can be adjusted based on your observation of how the lambda is performing. Make sure to adjust the visibility timeout and set up DLQ for efficient running.
This also reduces the S3 cost by avoiding storing the JSON in S3 for the 2nd Lambda to pick up. The data gets stored into ElasticSearch, hence duplication of data should be avoided.
You could potentially create one record per row and push it to firehose. When the data in the firehose stream reaches the buffer size configured it will be flushed to ES. This way only one lambda is required which can process rhe records from the csv and push to firehose
I have an event stream which send millions of events through SNS every day. Through a lambda, these topics are then stored in s3 but each in its own file. Total size of these events is not much (less than 1 GB) but moving/deleting one day files each the size of a few bytes becomes a long process. Is there a way I can store these SNS topics into larger files (or even a single file)?
I'd have the Lambda write the events to Kinesis Data Firehose and use that to batch the events up to a certain size-threshold or time-window and then have Firehose deliver those to S3.
Here are some resources for that:
S3 Destination for the delivery stream
S3 Destination buffer size & interval
SNS with kinesis firehose looks perfect fit for this use case.
Recently aws announced kinesis firehouse support with SNS, on kinesis you can add bufferring conditions to s3.
Kinesis Data Firehose buffers incoming data before delivering it (backing it up) to Amazon S3. You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds. The condition that is satisfied first triggers data delivery to Amazon S3
In case you want to transform your events or process them you can use lambda as well.
I'm using a Firehose delivery stream to write JSONs to S3. These JSONs represent calls. The stream will often receive a new version of a JSON, that bring new info about the represented call.
I would want my Firehose to write each JSON record to a separate S3 object, so not grouping them together as it seems to do by default. Each JSON would be written at an S3 key that identifies the call, so that when a new version of a JSON shows up, Firehose replaces its previous version in S3. Is this possible?
I see that I can set up the buffer size that triggers writing to S3, but can I explicitly configure my Firehose stream so it writes exactly one S3 object per record?
There's no Redshift involved.
This is not possible with Amazon Kinesis Data Firehose. It is a simplified service that only has a few configuration options.
Instead, you could use Amazon Kinesis Data Streams:
Send data to the stream
Create an AWS Lambda function that will be triggered whenever data is received by the stream
Code the Lambda function to write the data to the appropriate Amazon S3 object
See: Using AWS Lambda with Amazon Kinesis - AWS Lambda
There is a plenty of examples how data is stores by AWS Firehose to S3 bucket and parallelly passed to some processing app (like on the picture above).
But I can't find anything about good practice of replaying this data from s3 bucket in case if processing app was crushed. And we need to supply it with historical data, which we have in s3, but which is already not in the Firehose.
I can think of replaying it with Firehose or Lambda, but:
Kinesis Firehose could not consume from bucket
Lambda will need to deserialize .parquet file to send it to Firehose or Kinesis Data Stream. And I'm confused with this implicit deserializing, because Firehose was serializing it explicitly.
Or maybe there is some other way to put data back from s3 to stream which I completely miss?
EDIT: More over if we will run lambda for pushing records to stream probably it will have to rum more that 15 min. So another option is to run a script doing it which runs on separate EC2 instance. But this methods of extracting data from s3 looks so much more complicated than storing it there with Firehose, that is makes me think there should be some easier approach
The problem which stuck me was actually that I expect some more advanced serialization than just converting to JSON (as Kafka support AVRO for example).
Regarding replaying records from s3 bucket: this part of solution seems to be really significantly more complicated, than the one needed for archiving records. So if we can archive stream with out of the box functions of Firehose, for replaying it we will need two lambda functions and two streams.
Lambda 1 (pushes filenames to stream)
Lambda 2 (activated for every filename in the first stream, pushes records from files to second stream)
First lambda is triggered manually, scans through all s3 bucket files and write their names to first stream. Second lambda function is triggered by every event is stream with file names, reads all the records in the file and sends them to final stream. From which there could be consumed but Kinesis Data Analytics or another Lambda.
This solution expects that there are multiple files generated per day, and there are multiple records in every file.
Similar to this solution, but destination is Kinesis in my case instead of Dynamo in the article.
I would like to be able to send data sent to kinesis firehose based on the content inside the data. For example if I sent this JSON data:
{
"name": "John",
"id": 345
}
I would like to filter the data based on id and send it to a subfolder of my s3 bucket like: S3://myS3Bucket/345_2018_03_05. Is this at all possible with Kinesis Firehose or AWS Lambda?
The only way I can think of right now is to resort to creating a kinesis stream for every single one of my possible IDs and point them to the same bucket and then send my events to those streams in my application, but I would like to avoid that since there are many possible IDs.
You probably want to use an S3 event notification that gets fired each time Firehose places a new file in your S3 bucket (a PUT); the S3 event notification should call a custom lambda function that you write that reads the contents of the S3 file and splits it up and writes it out to the separate buckets, keeping in mind that each S3 file is likely going to contain many records, not just one.
https://aws.amazon.com/blogs/aws/s3-event-notification/
This is not possible out-of-the box, but here's some ideas...
You can write a Data Transformation in Lambda that is triggered by Amazon Kinesis Firehose for every record. You could code Lambda to save to save the data to a specific file in S3, rather than having Firehose do it. However, you'd miss-out on the record aggregation features of Firehose.
You could use Amazon Kinesis Analytics to look at the record and send the data to a different output stream based on the content. For example, you could have a separate Firehose stream per delivery channel, with Kinesis Analytics queries choosing the destination.
If you use a lambda to save the data you would end up with duplicate data onto s3. One stored by lambda and the other stored by firehose since transformation lambda will add the data back to firehose. Unless there is a way to avoid transformed data from lambda being re-added to the stream. I am not aware of a way to avoid that