How to transfer data from S3 bucket to Kafka - amazon-web-services

There are examples and documentation on copying data from Kafka topics to S3 but how do you copy data from S3 to Kafka?

When you read an S3 object, you get a byte stream. And you can send any byte array to Kafka with ByteArraySerializer.
Or you can parse that InputStream to some custom object, then send that using whatever serializer you can configure.
You can find one example of a Kafka Connect process here (which I assume you are comparing to Confluent's S3 Connect writer) - https://jobs.zalando.com/tech/blog/backing-up-kafka-zookeeper/index.html that can be configured to read binary archives or line-delimted text from S3.
Similarly, Apache Spark, Flink, Beam, NiFi, etc. simlar Hadoop related tools can read from S3 and write events to Kafka as well.
The problems with this approach is that you need to keep track of what files have been read so far, as well as handle partially read files.

Depending upon your scenario or desired frequency for uploading the objects, you can either use Lambda function on each event (e.g. every time a file is uploaded) or as a cron. This lambda works as a producer by using Kafka API and publishes to a topic.
Specifics:
The trigger for the Lambda function can be the s3:PutObject event coming from directly s3 or cloudwatch events.
You can run lambda as a cron if you don't need the objects instantly. The alternate in this case could also be running a cron on an EC2 instance which has Kafka producer and permissions to read objects from s3 and it keeps pushing them to the kafka topics.

Related

How to store data into s3 bucket using an AWS SQS Queue

I have live streaming data coming my way different sources. We have created an AWS SQS Queue to get those data. I was directly pushing this incoming data to DynamoDB using lambda function but I have been asked to store the raw files into s3 bucket first. So that we do not have any data loss and later use that data for further transformations.
I have an AWS SQS Queue and I want to store the messages of this queue into an S3 bucket in JSON or Parquet form. Or is there any other alternative other than s3 to store the raw data in AWS?
I tried searching over net but couldn't find anything concrete. I am new to this please help me solve this.
I tried searching over net but couldn't find anything concrete.
That's correct, because there is no such AWS provided feature or tool. You have to implement custom solution for that yourself.
Since you are using lambda which reads from the queue, you can add code to the lambda function to write that data to S3 before writing to DynamoDB.

Automatically ingest Data from S3 into Amazon Timestream?

What is the easiest way to automatically ingest csv-data from a S3 bucket into a Timestream database ?
I have a s3-bucket which continuasly is generating csv files inside a folder structure. I want to save these files inside a timestream database so i can visualize them inside my grafana instance.
I already tried to do that via a Glue crawler but that wont wont for me. Is there any workaround or tutorial on how to solve this task ?
I do this using a Lambda function, an SNS topic and a queue.
New files in my bucket triggers a notification on an SNS Topic
The notification gets added to an SQS queue.
The lambda function consumes the queue, recovers the bucket and key of the new s3 object, downloads the csv file, does some processing and ingests the data into timestream. The lambda is implemented in Python.
This has been working ok, with the caveat that large files may not ingest fully within the lambda 15 minute limit. Timestream is not super fast. It gets better by using multi-valued records, as well as using the "common attributes' feature of the timestream client in boto3.
(it should be noted that the lambda can be triggered directly by the S3 bucket, if one prefers. Using a queue allows a bit more flexibility, such as being able to manually add files to the queue for reprocessing)

buffer s3 object inputs

Does anyone know other than kinesis firehose, is there any other service from AWS can catch the S3 inject event? I am trying to do some analysis on VPC flow logs, currently setup is cloud-watch-logs -> Kinesis Firehose -> S3 -> Athena.
The problem is kinesis firehose can only buffer up to 128MB which is to small for me.
Events from Amazon S3 can go to:
AWS Lambda functions
Amazon SNS topic
Amazon SQS queue
So, you could send the messages to an SQS queue and then have a regular process (every hour?) that retrieves many messages and writes them to a single file.
Alternatively, you could use your current setup but use Amazon Athena on a regular basis to join multiple files by using CREATE TABLE AS. This would select from the existing files and store the output in a new table (with a new location). You could even use it to transform the files into a format that is easier to query in Athena (eg Snappy-compressed Parquet). The hard part is to only include each input file once into this concatenation process (possibly using SymlinkTextInputFormat).

Using S3 instead of SQS for integration purposes

folks.
There's a question I've recently faced, which brought some concerns and hesitations. I'm creating an "almost serverless" micro-service using AWS. Here is its workflow: Workflows options
The thing is the input message may be large, but AWS SQS limits message size to 256 Kb. So, I decided to use S3 and S3 notifications to handle inputs: client PUTs an object; its creation triggers Lambda functions and so on. In that way, 256 Kb limits are not relevant, but on the other hand, I'm using storage service as an integration one. One of the concerns is a dead letter queue handling f.e.
Maybe someone has faced similar problems. One of the things is to keep "serverless". Are there any good solutions/improvements/advice?
Thanks in advance.
I would recommend combining the two approaches:
Write the data to Amazon S3
Create a message in the Amazon SQS queue that includes a reference to the data in S3
This way, you have the benefits of using a queue, with additional storage.
If all the data you require is already in the file, then you can configure an Amazon S3 Event to create the SQS message directly in the queue. The message will include the name of the bucket and the key of the object. Thus, putting the file in S3 will create the SQS message and trigger the AWS Lambda function. This is more scalable than directly triggering the Lambda function from S3.
Have you considered using Kinesis Stream, then attaching your lambda to the stream with a size of 1? You could also handle your dead-letter with shard time stamps etc.
Or if you are able to manipulate the original message place your timestamp inside the message then you can easily utilize kinesis firehose and bulk load messages. Limitation on Kinesis is 2mb so will give you almost 10x the message size on sqs which you could compress further.

Is there a way to put data into Kinesis Firehose from S3 bucket?

I am want to write streaming data from S3 bucket into Redshift through Firehose as the data is streaming in real time (600 files every minute) and I dont want any form of data loss.
How to put data from S3 into Kinesis Firehose?
It appears that your situation is:
Files randomly appear in S3 from an SFTP server
You would like to load the data into Redshift
There's two basic ways you could do this:
Load the data directly from Amazon S3 into Amazon Redshift, or
Send the data through Amazon Kinesis Firehose
Frankly, there's little benefit in sending it via Kinesis Firehose because Kinesis will simply batch it up, store it into temporary S3 files and then load it into Redshift. Therefore, this would not be a beneficial approach.
Instead, I would recommend:
Configure an event on the Amazon S3 bucket to send a message to an Amazon SQS queue whenever a file is created
Configure Amazon CloudWatch Events to trigger an AWS Lambda function periodically (eg every hour, or 15 minutes, or whatever meets your business need)
The AWS Lambda function reads the messages from SQS and constructs a manifest file, then triggers Redshift to import the files listed in the manifest file
This is a simple, loosely-coupled solution that will be much simpler than the Firehose approach (which would require somehow reading each file and sending the contents to Firehose).
Its actually designed to do the opposite, Firehose sends incoming streaming data to Amazon S3 not from Amazon S3, and other than S3 it can send data to other services like Redshift and Elasticsearch Service.
I don't know whether this will solve your problem but you can use COPY from S3 to redshift.
Hope it will help!