buffer s3 object inputs - amazon-web-services

Does anyone know other than kinesis firehose, is there any other service from AWS can catch the S3 inject event? I am trying to do some analysis on VPC flow logs, currently setup is cloud-watch-logs -> Kinesis Firehose -> S3 -> Athena.
The problem is kinesis firehose can only buffer up to 128MB which is to small for me.

Events from Amazon S3 can go to:
AWS Lambda functions
Amazon SNS topic
Amazon SQS queue
So, you could send the messages to an SQS queue and then have a regular process (every hour?) that retrieves many messages and writes them to a single file.
Alternatively, you could use your current setup but use Amazon Athena on a regular basis to join multiple files by using CREATE TABLE AS. This would select from the existing files and store the output in a new table (with a new location). You could even use it to transform the files into a format that is easier to query in Athena (eg Snappy-compressed Parquet). The hard part is to only include each input file once into this concatenation process (possibly using SymlinkTextInputFormat).

Related

Is there a way to connect Kinesis Firehose to Lambda and then to S3?

I am creating an architecture that takes data from amazon connect --> kinesis data streams --> kinesis firehose --> raw data S3 bucket and lambda for processing --> processed data S3 bucket. I have enabled record pre-processing via the lambda instance in my firehose configuration but it does not show as the trigger for the function. I also am confused how to send the processed data to S3 from Lambda, as that is not a default option. I know I need to write some sort of code to do so but am struggling.. please help
does this make sense? let me know if more information is needed.

Configure Firehose so it writes only one record per S3 object?

I'm using a Firehose delivery stream to write JSONs to S3. These JSONs represent calls. The stream will often receive a new version of a JSON, that bring new info about the represented call.
I would want my Firehose to write each JSON record to a separate S3 object, so not grouping them together as it seems to do by default. Each JSON would be written at an S3 key that identifies the call, so that when a new version of a JSON shows up, Firehose replaces its previous version in S3. Is this possible?
I see that I can set up the buffer size that triggers writing to S3, but can I explicitly configure my Firehose stream so it writes exactly one S3 object per record?
There's no Redshift involved.
This is not possible with Amazon Kinesis Data Firehose. It is a simplified service that only has a few configuration options.
Instead, you could use Amazon Kinesis Data Streams:
Send data to the stream
Create an AWS Lambda function that will be triggered whenever data is received by the stream
Code the Lambda function to write the data to the appropriate Amazon S3 object
See: Using AWS Lambda with Amazon Kinesis - AWS Lambda

How to transfer data from S3 bucket to Kafka

There are examples and documentation on copying data from Kafka topics to S3 but how do you copy data from S3 to Kafka?
When you read an S3 object, you get a byte stream. And you can send any byte array to Kafka with ByteArraySerializer.
Or you can parse that InputStream to some custom object, then send that using whatever serializer you can configure.
You can find one example of a Kafka Connect process here (which I assume you are comparing to Confluent's S3 Connect writer) - https://jobs.zalando.com/tech/blog/backing-up-kafka-zookeeper/index.html that can be configured to read binary archives or line-delimted text from S3.
Similarly, Apache Spark, Flink, Beam, NiFi, etc. simlar Hadoop related tools can read from S3 and write events to Kafka as well.
The problems with this approach is that you need to keep track of what files have been read so far, as well as handle partially read files.
Depending upon your scenario or desired frequency for uploading the objects, you can either use Lambda function on each event (e.g. every time a file is uploaded) or as a cron. This lambda works as a producer by using Kafka API and publishes to a topic.
Specifics:
The trigger for the Lambda function can be the s3:PutObject event coming from directly s3 or cloudwatch events.
You can run lambda as a cron if you don't need the objects instantly. The alternate in this case could also be running a cron on an EC2 instance which has Kafka producer and permissions to read objects from s3 and it keeps pushing them to the kafka topics.

Write to a specific folder in S3 bucket using AWS Kinesis Firehose

I would like to be able to send data sent to kinesis firehose based on the content inside the data. For example if I sent this JSON data:
{
"name": "John",
"id": 345
}
I would like to filter the data based on id and send it to a subfolder of my s3 bucket like: S3://myS3Bucket/345_2018_03_05. Is this at all possible with Kinesis Firehose or AWS Lambda?
The only way I can think of right now is to resort to creating a kinesis stream for every single one of my possible IDs and point them to the same bucket and then send my events to those streams in my application, but I would like to avoid that since there are many possible IDs.
You probably want to use an S3 event notification that gets fired each time Firehose places a new file in your S3 bucket (a PUT); the S3 event notification should call a custom lambda function that you write that reads the contents of the S3 file and splits it up and writes it out to the separate buckets, keeping in mind that each S3 file is likely going to contain many records, not just one.
https://aws.amazon.com/blogs/aws/s3-event-notification/
This is not possible out-of-the box, but here's some ideas...
You can write a Data Transformation in Lambda that is triggered by Amazon Kinesis Firehose for every record. You could code Lambda to save to save the data to a specific file in S3, rather than having Firehose do it. However, you'd miss-out on the record aggregation features of Firehose.
You could use Amazon Kinesis Analytics to look at the record and send the data to a different output stream based on the content. For example, you could have a separate Firehose stream per delivery channel, with Kinesis Analytics queries choosing the destination.
If you use a lambda to save the data you would end up with duplicate data onto s3. One stored by lambda and the other stored by firehose since transformation lambda will add the data back to firehose. Unless there is a way to avoid transformed data from lambda being re-added to the stream. I am not aware of a way to avoid that

Is there a way to put data into Kinesis Firehose from S3 bucket?

I am want to write streaming data from S3 bucket into Redshift through Firehose as the data is streaming in real time (600 files every minute) and I dont want any form of data loss.
How to put data from S3 into Kinesis Firehose?
It appears that your situation is:
Files randomly appear in S3 from an SFTP server
You would like to load the data into Redshift
There's two basic ways you could do this:
Load the data directly from Amazon S3 into Amazon Redshift, or
Send the data through Amazon Kinesis Firehose
Frankly, there's little benefit in sending it via Kinesis Firehose because Kinesis will simply batch it up, store it into temporary S3 files and then load it into Redshift. Therefore, this would not be a beneficial approach.
Instead, I would recommend:
Configure an event on the Amazon S3 bucket to send a message to an Amazon SQS queue whenever a file is created
Configure Amazon CloudWatch Events to trigger an AWS Lambda function periodically (eg every hour, or 15 minutes, or whatever meets your business need)
The AWS Lambda function reads the messages from SQS and constructs a manifest file, then triggers Redshift to import the files listed in the manifest file
This is a simple, loosely-coupled solution that will be much simpler than the Firehose approach (which would require somehow reading each file and sending the contents to Firehose).
Its actually designed to do the opposite, Firehose sends incoming streaming data to Amazon S3 not from Amazon S3, and other than S3 it can send data to other services like Redshift and Elasticsearch Service.
I don't know whether this will solve your problem but you can use COPY from S3 to redshift.
Hope it will help!