I want to move(export) data from DynamoDB to S3
I have seen this tutorial but i'm not sure if the extracted data of dynamoDB will be deleted or coexits in DynamoDB and S3 at the same time.
What I expect is the data from dynamoDB will be deleted and stored in s3 (after X time stored in DynamoDB)
The main purpose of the project could be similar to this
There are any way to do this without have to develop a lambda function?
In resume, I have found this 2 different ways:
DynamoDB -> Pipeline -> S3 (Are the dynamoDB data deleted?)
DynamoDB -> TTL DynamoDB + DynamoDB stream -> Lambda -> firehose -> s3 (this appears to be more difficult)
Is this post currently valid for this purpouse?
What would be the simpliest and fasted way?
In your first option, as per default, data is not removed from dynamoDB. You can design a pipeline to make this work, but I think that is not the best solution.
In your second option, you must evaluate the solution based on your expected data volume:
If the data volume that will expire in TTL definition is not very
large, you can use lambda to persist removed data into S3 without
Firehose. You can design a simple lambda function to be triggered by
DynamoDB Stream and persist each stream event as a S3 object. You
can even trigger another lambda function to consolidate the objects
in a single file in the end of the day, week or month. But again,
based on your expected volume.
If you have a lot of data being expired at the same time and you
must perform transformations on this data, the best solution is to
use Firehose. Firehose can proceed with the transformation,
encryption and compact your data before sending it to S3. If the
volume of data is to big, using functions in the end of the day,
week or month may not be feasible. So it's better to perform all
this procedures before persisting it.
You can use AWS Pipeline to dump DynamoDB table to S3 and it will not be deleted.
Related
I have gone through couple of stackoverflow questions regarding hourly backups from DDB to S3 where the best solution turned out to be to enable DDB Stream, subscribe lambda function and push to S3.
I am trying to understand if directly pushing from Lambda to S3 is fine or from Lambda to Kinesis Firehose and then to S3. Can someone share what is the advantage if we introduce Firehose in between. We anyways trigger lambda only after specific batch window that implies we are already buffering there.
Thanks in advance.
Firehose gives you the possibility to convert and compress your data. In addition you can directly attach a Glue Metadata table, so you can query your data with Athena.
You can write a Lambda function that reads a DynamoDB table, gets a result set, encodes the data to some format (ie, JSON), then place that JSON into an Amazon S3 bucket. You can use scheduled events to fire off the Lambda function on a regular schedule.
Here in AWS tutorial that shows you how to use scheduled events to invoke a Lambda function:
Creating scheduled events to invoke Lambda functions
This AWS tutorial also shows you how to read data from an Amazon DynamoDB table from a Lambda function.
I want to move my DynamoDB table (that has approx. 1000 rows) to S3 and everytime that the DynamoDB table gets updated, the file in S3 should be updated automatically.
What's a good way to implement this? Is my approach correct?
Initial step:
DynamoDB -> Glue/Data Pipeline -> S3
Updating S3:
DynamoDB (or DynamoDB Stream?) -> Lambda -> S3
Is it better to use Glue or Data Pipeline for moving? And is there a helpful link on how I can write a Lambda function for this case?
As the number of rows is less, it is better to go ahead with Lambda based solution to avoid bootstrap time in Glue/Data Pipeline(uses EMR) and associated costs.
Doing CRUD operations on files is difficult, so, the lambda function can do a full table query everytime and write to a file and push to S3.
This function can happen on a schedule with 1 min frequency or more as you expect
Hope this helps!
There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.
Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.
I am using Data Pipeline (DP) for daily backups of DynamoDB, however, I would like to do incremental backups of the data that is missed by DP runs (updates between DP runs). To accomplish that, I would like to use DynamoDB Streams + Lambda + S3 to bring real-time DynamoDB updates to S3. I understand how DynamoDB streams work, however, I am struggling with creating a Lambda function that writes to S3 and say rolls a file every hour.
Has anyone tried it?
Its an hour job dude,What you need to do is
Enable Dynamo DB update Stream and attach aws provided lambda function
https://github.com/awslabs/lambda-streams-to-firehose
Enable Firehose stream and use above function to stream outs records in firehose.
Configure Firehose to dump the records to S3.
done.