Stream new files in S3 bucket into Kinesis - amazon-web-services

I am trying to evaluate using Kinesis for stream processing log files. There is a separate process that uploads new logs into a S3 bucket - I can't touch that process. I want to know if there's a good way to stream new files that show up in the S3 log bucket into a Kinesis stream for processing. All documentation I've found so far covers using S3 as an output for the stream.
My current solution is to have a machine that constantly polls S3 for new files, downloads the new file to the local machine and streams it in using the Log4j appender. This seems inefficient. Is there a better way?

I realize this is a really old question, but have a look at AWS Lambda. It's perfect for your use case, as illustrated here.
In your case, you would setup the s3 event such that each new object added to the bucket invokes your lambda function. In the lambda function you then write a few lines of code that read in the file and send the contents to the PutRecord (or PutRecords for batch) method for the Kinesis stream.
Not only will this work for your use case, but it's also awesome since it checks off a few buzzwords: "serverless" and "realtime"!

Related

Automatically ingest Data from S3 into Amazon Timestream?

What is the easiest way to automatically ingest csv-data from a S3 bucket into a Timestream database ?
I have a s3-bucket which continuasly is generating csv files inside a folder structure. I want to save these files inside a timestream database so i can visualize them inside my grafana instance.
I already tried to do that via a Glue crawler but that wont wont for me. Is there any workaround or tutorial on how to solve this task ?
I do this using a Lambda function, an SNS topic and a queue.
New files in my bucket triggers a notification on an SNS Topic
The notification gets added to an SQS queue.
The lambda function consumes the queue, recovers the bucket and key of the new s3 object, downloads the csv file, does some processing and ingests the data into timestream. The lambda is implemented in Python.
This has been working ok, with the caveat that large files may not ingest fully within the lambda 15 minute limit. Timestream is not super fast. It gets better by using multi-valued records, as well as using the "common attributes' feature of the timestream client in boto3.
(it should be noted that the lambda can be triggered directly by the S3 bucket, if one prefers. Using a queue allows a bit more flexibility, such as being able to manually add files to the queue for reprocessing)

Write data to file using AWS kinesis-firehose

I am having a node app that writes data to s3 using firehose stream. I am using the putRecord method for the same. The objects are successfully entered to s3 bucket.
However instead of objects I want to write the data to a file (.txt format).
Is there some method to write from stream to s3 as text file?
Update the s3 object from kinesis-firehose.
Also sometimes firehose makes multiple entries to one record. If I write after a minute's interval or longer it generates new records. Is there a way to ensure that each entry is stored as new object irrespective of intervals.
Kinesis Firehose is the wrong tool for your usecase, since it has a mininum buffer interval of 1 minute. If you want single objects, why don't you use the S3 SDK?

How to transfer data from S3 bucket to Kafka

There are examples and documentation on copying data from Kafka topics to S3 but how do you copy data from S3 to Kafka?
When you read an S3 object, you get a byte stream. And you can send any byte array to Kafka with ByteArraySerializer.
Or you can parse that InputStream to some custom object, then send that using whatever serializer you can configure.
You can find one example of a Kafka Connect process here (which I assume you are comparing to Confluent's S3 Connect writer) - https://jobs.zalando.com/tech/blog/backing-up-kafka-zookeeper/index.html that can be configured to read binary archives or line-delimted text from S3.
Similarly, Apache Spark, Flink, Beam, NiFi, etc. simlar Hadoop related tools can read from S3 and write events to Kafka as well.
The problems with this approach is that you need to keep track of what files have been read so far, as well as handle partially read files.
Depending upon your scenario or desired frequency for uploading the objects, you can either use Lambda function on each event (e.g. every time a file is uploaded) or as a cron. This lambda works as a producer by using Kafka API and publishes to a topic.
Specifics:
The trigger for the Lambda function can be the s3:PutObject event coming from directly s3 or cloudwatch events.
You can run lambda as a cron if you don't need the objects instantly. The alternate in this case could also be running a cron on an EC2 instance which has Kafka producer and permissions to read objects from s3 and it keeps pushing them to the kafka topics.

How to replay in a stream data pushed to S3 from AWS Firehose?

There is a plenty of examples how data is stores by AWS Firehose to S3 bucket and parallelly passed to some processing app (like on the picture above).
But I can't find anything about good practice of replaying this data from s3 bucket in case if processing app was crushed. And we need to supply it with historical data, which we have in s3, but which is already not in the Firehose.
I can think of replaying it with Firehose or Lambda, but:
Kinesis Firehose could not consume from bucket
Lambda will need to deserialize .parquet file to send it to Firehose or Kinesis Data Stream. And I'm confused with this implicit deserializing, because Firehose was serializing it explicitly.
Or maybe there is some other way to put data back from s3 to stream which I completely miss?
EDIT: More over if we will run lambda for pushing records to stream probably it will have to rum more that 15 min. So another option is to run a script doing it which runs on separate EC2 instance. But this methods of extracting data from s3 looks so much more complicated than storing it there with Firehose, that is makes me think there should be some easier approach
The problem which stuck me was actually that I expect some more advanced serialization than just converting to JSON (as Kafka support AVRO for example).
Regarding replaying records from s3 bucket: this part of solution seems to be really significantly more complicated, than the one needed for archiving records. So if we can archive stream with out of the box functions of Firehose, for replaying it we will need two lambda functions and two streams.
Lambda 1 (pushes filenames to stream)
Lambda 2 (activated for every filename in the first stream, pushes records from files to second stream)
First lambda is triggered manually, scans through all s3 bucket files and write their names to first stream. Second lambda function is triggered by every event is stream with file names, reads all the records in the file and sends them to final stream. From which there could be consumed but Kinesis Data Analytics or another Lambda.
This solution expects that there are multiple files generated per day, and there are multiple records in every file.
Similar to this solution, but destination is Kinesis in my case instead of Dynamo in the article.

Use AWS lambda to read Kinesis and save to S3

I am very new to AWS. Upto now I am able to send csv data to kinesis streams using aws .net sdk. Now I have to save this data in S3 using lambda using S3 Emitter(this is the most common way which I found on many websites) . When I create a Lambda function for it. It asks for Node.js or java8 code.
I don't understand from here , what code needs to be uploaded, how to use S3 Emitter code.
I cannot use Kinesis Firehose because the streaming data is going to EMR for processing.
Please help me here.
If there is any alternate way please suggest.
You need to write code that will get the events from the kinesis stream and write it to S3 (or even easier to Kinesis Firehose). This code should be in one of the programing languages that are supported in Lambda currently (JavaScript, Java, Python).
Here is a tutorial for reading from Kinesis: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis-example.html
It is relatively easy to read the events and batch them to S3 or even easier to write them to Firehose to get a more optimized batches in S3 (larger, compressed, encrypted...).