Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.
Related
I have some fairly large datasets (upwards of 60k rows in a csv file) that I need to ingest in elasticsearch on a daily basis (to keep the data updated).
I currently have two lambda functions handling this.
Lambda 1:
A python lambda (nodejs would run out of memory doing this task) is triggered when a .csv file is added to S3 (this file could have upwards of 60k rows). The lambda converts this to JSON and saves to another S3 bucket.
Lambda 2:
A nodejs lambda that is triggered by .json files generated from the Lambda 1. This lambda uses the elasticsearch bulk api to try and insert all of the data into ES.
However because of the large amount of data we hit the ES api rate limiting and fail to insert much of the data.
I have tried splitting the data and uploading smaller amount at a time, however this would then be a very long running lambda function.
I have also looked at adding the data to a kinesis stream however even that has a limit to the data you can add to it in each operation.
I am wondering what the best solution may be to insert large amounts of data like this into ES. My next thought is possibly splitting the .json files into multiple .json files and trigger the lambda that adds data to ES for each smaller .json file. However I am concerned that I would still just hit the rate limiting of the ES domain.
Edit* Looking into the kinesis firehose option this seems like it is the best option as I can set the buffer size to maximum 5mb (this is the ES bulk api limit).
However firehose has an import limit of 1mb so I'd still need to do some form of processing on the lambda that pushes to firehose to split up the data before pushing.
I'd suggest designing the application to use SQS for queuing your messages rather than using firehose (which is more expensive and perhaps not the best option for your use case). Amazon SQS provides a lightweight queueing solution and is cheaper than Firehose (https://aws.amazon.com/sqs/pricing/)
Below is how it can work -
Lambda 1 converts each row to JSON and posts each JSON to SQS.
(Assuming each JSON is less than 256KB)
the SQS queue acts as an event source for Lambda 2 and triggers it in batches of, say 5000 messages.
(Ref - https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html)
Lambda 2 uses the payload received from SQS to insert into ElasticSeach using the Bulk API.
Illustration -
The batch size can be adjusted based on your observation of how the lambda is performing. Make sure to adjust the visibility timeout and set up DLQ for efficient running.
This also reduces the S3 cost by avoiding storing the JSON in S3 for the 2nd Lambda to pick up. The data gets stored into ElasticSearch, hence duplication of data should be avoided.
You could potentially create one record per row and push it to firehose. When the data in the firehose stream reaches the buffer size configured it will be flushed to ES. This way only one lambda is required which can process rhe records from the csv and push to firehose
I'm storing some events into DynamoDB. I have to sync (i.e. copy incrementally) the data with Redshift. Ultimately, I want to be able to analyze the data through AWS Quicksight.
I've come across multiple solutions but those are either one-time (using the one-time COPY command) or real-time (a streaming data pipeline using Kinesis Firehose).
The real-time solution seems superior to hourly sync, but I'm worried about performance and complexity. I was wondering if there's an easier way to batch the updates on an hourly basis.
What you are looking for are DynamoDB Streams (official docs). This can seamlessly flow into the Kinesis firehose as you have correctly pointed out.
This is the most optimal way and provides the best balance between the cost, operational overhead and the functionality itself. Allow me explain how:
DynamoDB streams: Streams are triggered when any activity happens on the database. This means that unlike a process that will scan the data on a periodic basis and consume the read capacity even if there is no update, you will be notified of the new data.
Kinesis Firehose: You can configure the Firehose to batch data either by the size of the data or time. This means that if you have a good inflow, you can set the stream to batch the records received in each 2 minutes interval and then issue just one COPY command to redshift. Same goes for the size of the data in the stream buffer. Read more about it here.
The ideal way to load data into Redshift is via COPY command and Kinesis Firehose does just that. You can also configure it to automatically create backup of the data into S3.
Remember that a reactive or push based system is almost always more performant and less costly than a reactive or push based system. You save on the compute capacity needed to run a cron process and also continuously scan for updates.
My use-case is as follows:
I have JSON data coming in which needs to be stored in S3 in parquet format. So far so good, I can create a schema in Glue and attach a "DataFormatConversionConfiguration" to my firehose stream. BUT the data is coming from different "topics". Each topic has a particular "schema". From my understanding I will have to create multiple firehose streams, as one stream can only have one schema. But I have thousands of such topics with very high volume high throughput data incoming. It does not look feasible to create so many firehose resources (https://docs.aws.amazon.com/firehose/latest/dev/limits.html)
How should I go about building my pipeline.
IMO you can:
ask for upgrade of your Firehose limit and do everything with 1 Firehose/stream + add Lambda transformation to convert the data into a common schema - IMO not cost-efficient but you should see with your load.
create a Lambda for each Kinesis data stream, convert each event to the schema managed by a single Firehose and at the end can send the events directly to your Firehose stream with Firehose API https://docs.aws.amazon.com/firehose/latest/APIReference/API_PutRecord.html (see "Q: How do I add data to my Amazon Kinesis Data Firehose delivery stream?" here https://aws.amazon.com/kinesis/data-firehose/faqs/) - but also, check the costs before because even though your Lambdas are invoked "on demand", you may have a lot of them invoked during long period of time.
use one of data processing frameworks (Apache Spark, Apache Flink, ...) and read your data from Kinesis in batches of 1 hour, starting every time when you terminated last time --> use available sinks to convert the data and write it in Parquet format. The frameworks use the notion of checkpointing and store the last processed offset in an external storage. Now, if you restart them every hour, they will start to read the data directly from the last seen entry. - It may be cost-efficient, especially if you consider using spot instances. On the other side, it requires more coding that 2 previous solutions and obviously may have higher latency.
Hope that helps. You could please give a feedback about chosen solution ?
Question
I've read this and this and this articles. But they provide contradictory answers to the question: how to customize partitioning on ingesting data to S3 from Kinesis Stream?
More details
Currently, I'm using Firehose to deliver data from Kinesis Streams to Athena. Afterward, data will be processed with EMR Spark.
From time to time I have to handle historical bulk ingest into Kinesis Streams. The issue is that my Spark logic hardly depends on data partitioning and order of event handling. But Firehouse supports partitioning only by ingestion_time (into Kinesis Stream), not by any other custom field (I need by event_time).
For example, under Firehouse's partition 2018/12/05/12/some-file.gz I can get data for the last few years.
Workarounds
Could you please help me to choose between the following options?
Copy/partition data from Kinesis Steam with help of custom lambda. But this looks more complex and error-prone for me. Maybe because I'm not very familiar with AWS lambdas. Moreover, I'm not sure how well it will perform on bulk load. At this article it was said that Lambda option is much more expensive than Firehouse delivery.
Load data with Firehouse, then launch Spark EMR job to copy the data to another bucket with right partitioning. At least it sounds simpler for me (biased, I just starting with AWS Lambas). But it has the drawback of double-copy and additional spark Job.
At one hour I could have up to 1M rows that take up to 40 MB of memory (at compressed state). From Using AWS Lambda with Amazon Kinesis I know that Kinesis to Lambda event sourcing has a limitation of 10,000 records per batch. Would it be effective to process such volume of data with Lambda?
While Kinesis does not allow you to define custom partitions, Athena does!
The Kinesis stream will stream into a table, say data_by_ingestion_time, and you can define another table data_by_event_time that has the same schema, but is partitioned by event_time.
Now, you can make use of Athena's INSERT INTO capabilities to let you repartition data without needing to write Hadoop or a Spark job and you get the serverless scale-up of Athena for your data volume. You can use SNS, cron, or a workflow engine like Airflow to run this at whatever interval you need.
We dealt with this at my company and go in-to more depth details of the trade-offs of using EMR or a streaming solution, but now you don't need to introduce anymore systems like Lambda or EMR.
https://radar.io/blog/custom-partitions-with-kinesis-and-athena
you may use the kinesis stream, and create the partitions like you wants.
you create a producer, and in your consumer, create the partitions.
https://aws.amazon.com/pt/kinesis/data-streams/getting-started/
We want to use kinesis stream and firehose to update an aws managed elasticsearch cluster. We have hundreds of different indexes (corresponding to our DB shards) that need to be updated. When creating the firehose it requires that I specify the specific index name I want updated. Does that mean I need to create a separate firehose for each index in my cluster? Or is there a way to configure the firehose so it knows what index to used based on the content of the data.
Also, we would have 20 or so separate producers that would send data to a kinesis stream (each one of these producers would generate data for 10 different indexes). Would I also need a separate kinesis stream for each producer.
Summary:
20 producers (EC2 instances) -> Each producer sends data for 20 different indexes to a kinesis stream -> The kinesis stream then uses a firehose to update a single cluster which has 200 indexes in it.
Note: all of the indexes have the same mapping and name temple i.e. index_1, index_2...index_200
Edit: As we reindex the data we create new indexes along the lines of index_1-v2. Obviously we won't want to create a new firehose for each index version as they're being created. The new index name can be included in the JSON that's sent to the kinesis stream.
As you guessed, Firehose is the wrong solution for this problem, at least as stated. It is designed for situations where there's a 1:1 correspondence between stream (not producer!) and index. Things like clickstream data or log aggregation.
For any solution, you'll need to provide a mechanism to identify which index a record belongs to. You could do this by creating a separate Kinesis stream per message type (in which case you could use Firehose), but this would mean that your producers have to decide which stream to write each message to. That may cause unwanted complexity in your producers, and may also increase your costs unacceptably.
So, assuming that you want a single stream for all messages, you need a consumer application and some way to group those messages. You could include a message type (/ index name) in the record itself, or use the partition key for that purpose. The partition key makes for a somewhat easier implementation, as it guarantees that records for the same index will be stored on the same shard, but it means that your producers may be throttled.
For the consumer, you could use an always-on application that runs on EC2, or have the stream invoke a Lambda function.
Using Lambda is nice if you're using partition key to identify the message type, because each invocation only looks at a single shard (you may still have multiple partition keys in the invocation). On the downside, Lambda will poll the stream once per second, which may result in throttling if you have multiple stream consumers (with a stand-alone app you can control how often it polls the stream).