I am having a node app that writes data to s3 using firehose stream. I am using the putRecord method for the same. The objects are successfully entered to s3 bucket.
However instead of objects I want to write the data to a file (.txt format).
Is there some method to write from stream to s3 as text file?
Update the s3 object from kinesis-firehose.
Also sometimes firehose makes multiple entries to one record. If I write after a minute's interval or longer it generates new records. Is there a way to ensure that each entry is stored as new object irrespective of intervals.
Kinesis Firehose is the wrong tool for your usecase, since it has a mininum buffer interval of 1 minute. If you want single objects, why don't you use the S3 SDK?
Related
I have some fairly large datasets (upwards of 60k rows in a csv file) that I need to ingest in elasticsearch on a daily basis (to keep the data updated).
I currently have two lambda functions handling this.
Lambda 1:
A python lambda (nodejs would run out of memory doing this task) is triggered when a .csv file is added to S3 (this file could have upwards of 60k rows). The lambda converts this to JSON and saves to another S3 bucket.
Lambda 2:
A nodejs lambda that is triggered by .json files generated from the Lambda 1. This lambda uses the elasticsearch bulk api to try and insert all of the data into ES.
However because of the large amount of data we hit the ES api rate limiting and fail to insert much of the data.
I have tried splitting the data and uploading smaller amount at a time, however this would then be a very long running lambda function.
I have also looked at adding the data to a kinesis stream however even that has a limit to the data you can add to it in each operation.
I am wondering what the best solution may be to insert large amounts of data like this into ES. My next thought is possibly splitting the .json files into multiple .json files and trigger the lambda that adds data to ES for each smaller .json file. However I am concerned that I would still just hit the rate limiting of the ES domain.
Edit* Looking into the kinesis firehose option this seems like it is the best option as I can set the buffer size to maximum 5mb (this is the ES bulk api limit).
However firehose has an import limit of 1mb so I'd still need to do some form of processing on the lambda that pushes to firehose to split up the data before pushing.
I'd suggest designing the application to use SQS for queuing your messages rather than using firehose (which is more expensive and perhaps not the best option for your use case). Amazon SQS provides a lightweight queueing solution and is cheaper than Firehose (https://aws.amazon.com/sqs/pricing/)
Below is how it can work -
Lambda 1 converts each row to JSON and posts each JSON to SQS.
(Assuming each JSON is less than 256KB)
the SQS queue acts as an event source for Lambda 2 and triggers it in batches of, say 5000 messages.
(Ref - https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html)
Lambda 2 uses the payload received from SQS to insert into ElasticSeach using the Bulk API.
Illustration -
The batch size can be adjusted based on your observation of how the lambda is performing. Make sure to adjust the visibility timeout and set up DLQ for efficient running.
This also reduces the S3 cost by avoiding storing the JSON in S3 for the 2nd Lambda to pick up. The data gets stored into ElasticSearch, hence duplication of data should be avoided.
You could potentially create one record per row and push it to firehose. When the data in the firehose stream reaches the buffer size configured it will be flushed to ES. This way only one lambda is required which can process rhe records from the csv and push to firehose
There is a plenty of examples how data is stores by AWS Firehose to S3 bucket and parallelly passed to some processing app (like on the picture above).
But I can't find anything about good practice of replaying this data from s3 bucket in case if processing app was crushed. And we need to supply it with historical data, which we have in s3, but which is already not in the Firehose.
I can think of replaying it with Firehose or Lambda, but:
Kinesis Firehose could not consume from bucket
Lambda will need to deserialize .parquet file to send it to Firehose or Kinesis Data Stream. And I'm confused with this implicit deserializing, because Firehose was serializing it explicitly.
Or maybe there is some other way to put data back from s3 to stream which I completely miss?
EDIT: More over if we will run lambda for pushing records to stream probably it will have to rum more that 15 min. So another option is to run a script doing it which runs on separate EC2 instance. But this methods of extracting data from s3 looks so much more complicated than storing it there with Firehose, that is makes me think there should be some easier approach
The problem which stuck me was actually that I expect some more advanced serialization than just converting to JSON (as Kafka support AVRO for example).
Regarding replaying records from s3 bucket: this part of solution seems to be really significantly more complicated, than the one needed for archiving records. So if we can archive stream with out of the box functions of Firehose, for replaying it we will need two lambda functions and two streams.
Lambda 1 (pushes filenames to stream)
Lambda 2 (activated for every filename in the first stream, pushes records from files to second stream)
First lambda is triggered manually, scans through all s3 bucket files and write their names to first stream. Second lambda function is triggered by every event is stream with file names, reads all the records in the file and sends them to final stream. From which there could be consumed but Kinesis Data Analytics or another Lambda.
This solution expects that there are multiple files generated per day, and there are multiple records in every file.
Similar to this solution, but destination is Kinesis in my case instead of Dynamo in the article.
I would like to be able to send data sent to kinesis firehose based on the content inside the data. For example if I sent this JSON data:
{
"name": "John",
"id": 345
}
I would like to filter the data based on id and send it to a subfolder of my s3 bucket like: S3://myS3Bucket/345_2018_03_05. Is this at all possible with Kinesis Firehose or AWS Lambda?
The only way I can think of right now is to resort to creating a kinesis stream for every single one of my possible IDs and point them to the same bucket and then send my events to those streams in my application, but I would like to avoid that since there are many possible IDs.
You probably want to use an S3 event notification that gets fired each time Firehose places a new file in your S3 bucket (a PUT); the S3 event notification should call a custom lambda function that you write that reads the contents of the S3 file and splits it up and writes it out to the separate buckets, keeping in mind that each S3 file is likely going to contain many records, not just one.
https://aws.amazon.com/blogs/aws/s3-event-notification/
This is not possible out-of-the box, but here's some ideas...
You can write a Data Transformation in Lambda that is triggered by Amazon Kinesis Firehose for every record. You could code Lambda to save to save the data to a specific file in S3, rather than having Firehose do it. However, you'd miss-out on the record aggregation features of Firehose.
You could use Amazon Kinesis Analytics to look at the record and send the data to a different output stream based on the content. For example, you could have a separate Firehose stream per delivery channel, with Kinesis Analytics queries choosing the destination.
If you use a lambda to save the data you would end up with duplicate data onto s3. One stored by lambda and the other stored by firehose since transformation lambda will add the data back to firehose. Unless there is a way to avoid transformed data from lambda being re-added to the stream. I am not aware of a way to avoid that
Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.
I am trying to evaluate using Kinesis for stream processing log files. There is a separate process that uploads new logs into a S3 bucket - I can't touch that process. I want to know if there's a good way to stream new files that show up in the S3 log bucket into a Kinesis stream for processing. All documentation I've found so far covers using S3 as an output for the stream.
My current solution is to have a machine that constantly polls S3 for new files, downloads the new file to the local machine and streams it in using the Log4j appender. This seems inefficient. Is there a better way?
I realize this is a really old question, but have a look at AWS Lambda. It's perfect for your use case, as illustrated here.
In your case, you would setup the s3 event such that each new object added to the bucket invokes your lambda function. In the lambda function you then write a few lines of code that read in the file and send the contents to the PutRecord (or PutRecords for batch) method for the Kinesis stream.
Not only will this work for your use case, but it's also awesome since it checks off a few buzzwords: "serverless" and "realtime"!