Best strategy to archive specific records from RDS to a cheaper storage in AWS - amazon-web-services

I have the following requirements:
For every deleted record in RDS we need to archive it into somewhere cheaper on AWS.
Reduce storage cost
Not using Glacier
Context oriented (e.g. a file per table)
re-import is not a requirement
I'm not an experienced user with AWS, so I'm still a bit lost among the amount of options it has to offer and I'd like to know if you have more ideas to help me clear it out.
Initial thoughts:
The microservice that deletes the record, might send it to a broker (RabbitMQ for e.g.) and another microservice (let's call it archiver) will listen to it, write into a file, zip and send to S3. This approach has some technical challenges though: in order to make sense create big files, I need to wait the queue to growth a bit, wrap it into a stream and zip inside S3. The transaction control is very weak as well, since file writing and ack on messages are signal based i.e. I'll remove the messages from the broker just after the file is created.
Add a new column to the "archiveble" tables as "deleted (bool)" and run a separate job fetching only those records and saving them into S3. Discarded they don't want the new microservice with access to other's databases.
Following the same approach as in the first item, but instead of save into S3, save into a cheaper database. SimpleDB?

option 1, but instead of rabbitmq, write it to a kinesis firehose and direct that to an s3 location - it doesn't get much cheaper or easier than that.

Related

Architectural advice for AWS firehose or similar when collecting a lot of events in real-time

I would like to ask you about getting some advice about handling many application events on AWS. My application sends a lot of different events about everything what a user did in real-time. For collecting those events, I’m using AWS firehose (kinesis) - I have few data streams where I push some different events. Some events, before storing on S3/Redshift contains data which I want to extract and store to other databases (DynamoDB) or to other S3 files — for that case I’m using lambda which is assigned to a specific stream.
My problem is that business adds more and more new events which they need to collect or do something with data and for every new event or „group” events I need create separate data stream + s3/rs/es + lambda for extracting data. Also, events on S3 are stored in one format and there is not possible to group that events e.g. by userId from an application or even name of the event in the stream filename. Ideal s3 with that events would look like events/{user_id}/{date}/{event-name}{timestamp}.json.
Maybe I’m wrong using firehose or I have wrong thinking about firehose in my case, maybe there are other, better services on AWS for my case which can give me more control. Maybe simple SQS + lambdas as a listener on S3 is better solution in this case?
Thanks for any advice.
EDIT 12th Nov 2020
This was supposed to be a comment for #Lina, but it was too long to put a comment, so I updated my question with the solution which I pick.
I resolved my issue as I "felt", so it may not be a good way to repeat, but: I've written a nodejs routing application which I connected on firehose and I wrote a few microservices where data is sent from firehose by my routing app. So now, I have a firehose tube and I'm taking 10 different event types. When some event came, my routing application decides what microservice should be run with what data based on the event type (the raw firehose event is still stored on s3 automatically). This gives me needed flexibility as I can extract specific data from the event, do with that data what I need, by running every other microservices from the whole system and still have a raw event in the s3 in case of needed revert history of events.
Some of the events are not passing to any service, it is just stored as a raw s3 file e.g. application logs - I can do many things with that files on S3 PUT/CREATE event.
I hope that it will help someone with a similar problem.

Spark Streaming with S3 vs Kinesis

I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two possible architectures:
Have Spark Streaming watch an S3 prefix and pick up new objects as they
come in
Stream data from S3 to a Kinesis stream (through a Lambda function triggered as new S3 objects are created by DMS) and use the stream as input for the Spark application.
While the second solution will work, the first solution is simpler. But are there any pitfalls? Looking at this guide, I'm concerned about two specific points:
The more files under a directory, the longer it will take to scan for changes — even if no files have been modified.
We will be keeping the S3 data indefinitely. So the number of objects under the prefix being monitored is going to increase very quickly.
“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is opened, even before data has been completely written, it may be included in the DStream - after which updates to the file within the same window will be ignored. That is: changes may be missed, and data omitted from the stream.
I'm not sure if this applies to S3, since to my understanding objects are created atomically and cannot be updated afterwards as is the case with ordinary files.
I posted this to Spark mailing list and got a good answer from Steve Loughran.
Theres a slightly-more-optimised streaming source for cloud streams
here
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala
Even so, the cost of scanning S3 is one LIST request per 5000 objects;
I'll leave it to you to work out how many there will be in your
application —and how much it will cost. And of course, the more LIST
calls tehre are, the longer things take, the bigger your window needs
to be.
“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is
opened, even before data has been completely written, it may be
included in the DStream - after which updates to the file within the
same window will be ignored. That is: changes may be missed, and data
omitted from the stream.
Objects written to S3 are't visible until the upload completes, in an
atomic operation. You can write in place and not worry.
The timestamp on S3 artifacts comes from the PUT tim. On multipart
uploads of many MB/many GB uploads, thats when the first post to
initiate the MPU is kicked off. So if the upload starts in time window
t1 and completed in window t2, the object won't be visible until t2,
but the timestamp will be of t1. Bear that in mind.
The lambda callback probably does have better scalability and
resilience; not tried it myself.
Since the number of objects in my scenario is going to be much larger than 5000 and will continue to grow very quickly, S3 to Spark doesn't seem to be a feasible option. I did consider moving/renaming processed objects in Spark Streaming, but the Spark Streaming application code seems to only receive DStreams and no information about which S3 object the data is coming from. So I'm going to go with the Lambda and Kinesis option.

How long does it take for AWS S3 to save and load an item?

S3 FAQ mentions that "Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES." However, I don't know how long it takes to get eventual consistency. I tried to search for this but couldn't find an answer in S3 documentation.
Situation:
We have a website consists of 7 steps. When user clicks on save in each step, we want to save a json document (contains information of all 7 steps) to Amazon S3. Currently we plan to:
Create a single S3 bucket to store all json documents.
When user saves step 1 we create a new item in S3.
When user saves step 2-7 we override the existing item.
After user saves a step and refresh the page, he should be able to see the information he just saved. i.e. We want to make sure that we always read after write.
The full json document (all 7 steps completed) is around 20 KB.
After users clicked on save button we can freeze the page for some time and they cannot make other changes until save is finished.
Question:
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
Is there a function to calculate save/load time based on item size?
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?
I wanted to add to #error2007s answers.
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
It's not only that you will not find the exact time anywhere - there's actually no such thing exact time. That's just what "eventual consistency" is all about: consistency will be achieved eventually. You can't know when.
If somebody gave you an upper bound for how long a system would take to achieve consistency, then you wouldn't call it "eventually consistent" anymore. It would be "consistent within X amount of time".
The problem now becomes, "How do I deal with eventual consistency?" (instead of trying to "beat it")
To really find the answer to that question, you need to first understand what kind of consistency you truly need, and how exactly the eventual consistency of S3 could affect your workflow.
Based on your description, I understand that you would write a total of 7 times to S3, once for each step you have. For the first write, as you correctly cited the FAQs, you get strong consistency for any reads after that. For all the subsequent writes (which are really "replacing" the original object), you might observe eventual consistency - that is, if you try to read the overwritten object, you might get the most recent version, or you might get an older version. This is what is referred to as "eventual consistency" on S3 in this scenario.
A few alternatives for you to consider:
don't write to S3 on every single step; instead, keep the data for each step on the client side, and then only write 1 single object to S3 after the 7th step. This way, there's only 1 write, no "overwrites", so no "eventual consistency". This might or might not be possible for your specific scenario, you need to evaluate that.
alternatively, write to S3 objects with different names for each step. E.g., something like: after step 1, save that to bruno-preferences-step-1.json; then, after step 2, save the results to bruno-preferences-step-2.json; and so on, then save the final preferences file to bruno-preferences.json, or maybe even bruno-preferences-step-7.json, giving yourself the flexibility to add more steps in the future. Note that the idea here to avoid overwrites, which could cause eventual consistency issues. Using this approach, you only write new objects, you never overwrite them.
finally, you might want to consider Amazon DynamoDB. It's a NoSQL database, you can securely connect to it directly from the browser or from your server. It provides you with replication, automatic scaling, load distribution (just like S3). And you also have the option to tell DynamoDB that you want to perform strongly consistent reads (the default is eventually consistent reads; you have to change a parameter to get strongly consistent reads). DynamoDB is typically used for "small" records, 20kB is definitely within the range -- the maximum size of a record would be 400kB as of today. You might want to check this out: DynamoDB FAQs: What is the consistency model of Amazon DynamoDB?
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
You will not find the exact time anywhere. If you ask AWS they will give you approx timings. Your file is 20 KB so as per my experience from S3 usage the time will be more or less 60-90 Sec.
Is there a function to calculate save/load time based on item size?
No there is no any function using which you can calculate this.
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?
For Seattle US West Oregon Will work with no problem.
You can also take a look at this experiment for comparison https://github.com/andrewgaul/are-we-consistent-yet

How can we efficiently push data from csv file to dynamodb without using aws pipeline?

Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?
If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.

AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function

I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.
If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.