S3 Incremental Backups - amazon-web-services

I am currently using S3 to store large quantities of account level data such as images, text files and other forms of durable content that users upload in my application. I am looking to take an incremental snapshot of this data (once per week) and ship it off to another S3 bucket. I'd like to do this in order to protect against accidentally data loss, i.e. one of our engineers accidentally deleting a chunk of data in the S3 browser.
Can anyone suggest some methodology for achieving this? Would we need to host our own backup application on an EC2 instance? Is there an application that will handle this out of the box? The data can go into S3 Glacier and doesn't need to be readily accessible, it's more of an insurance policy than anything else.
EDIT 1
I believe switching on versioning maybe the answer (continuing to research this):
http://docs.amazonwebservices.com/AmazonS3/latest/dev/Versioning.html
EDIT 2
For others looking for answers to this question, there a good thread on ServerFault. I only came across this later:
https://serverfault.com/questions/9171/aws-s3-bucket-backups

Enabling versioning on your bucket is the right solution. It can be used to protect both against accidental deletes and overwrites as well.
There's a question on the S3 FAQ, under "Data Protection", that discusses exactly this issue (accidental deletes/overwrites): http://aws.amazon.com/s3/faqs/#Why_should_I_use_Versioning

Related

Amazon AWS S3 Lifecycle rule exception?

I've got a few s3 buckets that I'm using as a storage backend for Duplicacy which stores its metadata in chunks right alongside the backup data.
I current have a lifecycle rule to move all objects with the prefix "chunks/" to Glacier Deep Archive. The problem is, I then can't list the contents of a backup revision because some of those chunks have backup metadata in them that's needed to list, initiate a restore, etc...
The question is, is there a method where I could apply some tag to certain objects such then, even though they are in the "chunks/" folder, the are exempt from the lifecycle rule?
Looking for solution to basically the same problem.
I've seen this which seems consistent with what I'm finding which is it can't be done in a straightforward fashion. This is a few years old, I'll be disappointed if this is the answer.
Expected to see the exclude use case in these examples but no luck.

multiple organizations/accounts and dev environments for S3 buckets

As per best practice aws resources should be per account (prod, stage, ...) and its also good to give devs their own accounts with defined limits (budget, region, ...).
Im now wondering how i can create a full working dev environment especially when it comes to S3 buckets.
Most of the services are pay per use so its totally fine to spin up some lambdas, SQS etc. to use the real services for dev.
Now to the real questions what should be done with static assets like pictures, downloads and so on which are stored in S3 buckets?
Duplicating those buckets for every dev/environment could come expensive as you pay for storage and/or data transfer.
What i thought was to give the devs S3 bucket a redirect rule and when a file is not found (e.g. 404) in the dev bucket it redirects to the prod bucket so that images, ... are retrieved from there.
I have testet this and it works pretty well but it solves only part of the problem.
The other part is how to replace those files in a convenient way?
Currently static assets and downloads are also in our git (maybe not the best idea after all ... - how you handle file changes which should go live with new features, currently its convenient to have it in git as well) and when someone changes stuff they push it and it gets deployed to prod.
We could of course sync back the devs S3 bucket to prod bucket with the new files uploaded but how to combine this with merge requests and have a good CI/CD experience?
What are your solutions to have S3 buckets for every dev so that they can spinn up their own completely working dev environment with everything available to them?
My experience is that you don't want to complicate things just to save a few dollars. S3 costs are pretty cheap, so if you're just talking about website assets, like HTML, CSS, JavaScript, and some images, then you're probably going to spend more time creating, managing, and troubleshooting a solution than you'll save. Time is, after all, your most precious resource.
If you do have large items that need to be stored to make your system work then maybe have the S3 bucket have a lifecycle policy on those large items and delete them after some reasonable amount of time. If/when a dev needs that object they can retrieve it again from its source and upload it again, manually. You could write a script to do that pretty easily.

What is the best way to transfer data from AWS SQS to S3?

Here is the case - I have a large dataset, temporally retained in AWS SQS (around 200GB).
My main goal is to store the data so I can access it for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket. And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there? Amazon has so many different solutions and ways of integration so it is kinda confusing.
Thanks for your help!
for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket.
Imho good idea. Indeed, S3 is the best option to retain data and be able to reuse them (unlike sqs). AWS tools (sagemaker, ml) can directly use content stored in s3. Most of the machine learning framework can read files, where you can easily copy files from s3 or mount a bucket as a filesystem (not my favourite option, but possible)
And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
It depends on what data do you have a how you want to store and process the data files.
If you plan to have a file for each sqs message, I'd suggest to create a lambda function (assuming you can read and store the message reasonably fast).
If you want to aggregate and/or concatenate the source messages or processing a message would take too long, you may rather write a script to read and process the data on a server.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there?
well - in theory you can do it on your laptop, but it would mean downloading 200G and uploading 200G (not counting the overhead and speed latency)
Your intuition is IMHO good, having EC2 in the same region would be most feasible, accessing all data almost locally
Amazon has so many different solutions and ways of integration so it is kinda confusing.
you have many options feasible for different use cases, often overlapping, so indeed it may look confusing

Amazon EC2 scaling and upload temporary folder

I have an application based on php in one amazon instance for uploading and transcoding audio files. This application first uploads the file and after that transcodes that and finally put it in one s3 bucket. At the moment application shows the progress of file uploading and transcoding based on repeatedly ajax requests by monitoring file size in a temporary folder.
I was wondering all the time if tomorrow users rush to my service and I need to scale my service with any possible way in AWS.
A: What will happen for my upload and transcoding technique?
B: If I add more instances does it mean I have different files on different temporary conversion folders in different physical places?
C: If I want to get the file size by ajax from http://www.example.com/filesize up to the finishing process do I need to have the real address of each ec2 instance (i mean ip,dns) or all of the instances folders (or folder)?
D: When we scale what will happen for temporary folder is it correct that all of instances except their lamp stack locate to one root folder of main instance?
I have some basic information about scaling in the other hosting techniques but in amazon these questions are in my mind.
Thanks for advice.
It is difficult to answer your questions without knowing considerably more about your application architecture, but given that you're using temporary files, here's a guess:
Your ability to scale depends entirely on your architecture, and of course having a wallet deep enough to pay.
Yes. If you're generating temporary files on individual machines, they won't be stored in a shared place the way you currently describe it.
Yes. You need some way to know where the files are stored. You might be able to get around this with an ELB stickiness policy (i.e. traffic through the ELB gets routed to the same instances), but they are kind of a pain and won't necessarily solve your problem.
Not quite sure what the question is here.
As it sounds like you're in the early days of your application, give this tutorial and this tutorial a peek. The first one describes a thumbnailing service built on Amazon SQS, the second a video processing one. They'll help you design with best AWS practices in mind, and help you avoid many of the issues you're worried about now.
One way you could get around scaling and session stickiness is to have the transcoding update a database with the current progress. Any user returning checks the database to see the progress of their upload. No need to keep track of where the transcoding is taking place since the progress gets stored in a single place.
However, like Christopher said, we don't really know anything about you're application, any advice we give is really looking from the outside in and we don't have a good idea about what would be the easiest thing for you to do. This seems like a pretty simple solution but I could be missing something because I don't know anything about your application or architecture.

where to store 10kb pieces of text in amazon aws?

These will be indexed and randomly accessed in a web app like SO questions. SimpleDB has a 1024-byte limit per attribute but you could use multiple attrs but sounds inelegant.
Examples: blog posts; facebook status messages; recipes (in a blogging application; facebook-like application; recipe web site).
If I were to build such an application on Amazon AWS, where/how should I store the pieces of text?
With S3, you could put all the actual files in S3, then index them with Amazon RDS, or Postgres on Heroku, or whatever suits you at that time.
Also, you can get the client to download the multi kB text blurbs directly from S3, so your app could just deliver URLs to the messages, thereby creating a massively parallel server - even if the main server is just a single thread on one machine, constructing the page from S3 asset URLs. S3 could store all assets, like images, etc.
The advantages are big. This also solves backup, etc. And allows you to play with many indexing and searching schemes. Search could for instance be done using Google...
I'd say you would want to look at Amazon RDS, running a relational database like MySQL in the cloud. A single DynamoDB read capacity unit can only (consistently) read a 1kb-item, that's probably not going to work for you.
Alternatively, you could store the text files in S3 and put pointers to these files in SimpleDB. It depends on a lot of factors which is going to be more cost-effective: how many files you add every day, how often these files are expected to change, how often they are requested, etc.
Personally, I think that using S3 would not be the best approach. If you store all questions and answers in separate text files, you're looking at a number of requests for displaying even a simple page. Let alone search, which would require you to fetch all the files from S3 and search through them. So for search, you need a database anyway.
You could use SDB for keeping an index but frankly, I would just use MySQL on Amazon RDS (there's a free two-month trial period right now, I think) where you can do all the nice things that relational databases can do, and which also offers support for full-text search. RDS should be able to scale up to huge numbers of visitors every day: you can easily scale up all the way to a High-Memory Quadruple Extra Large DB Instance with 68 GB of memory and 26 ECUs.
As far as I know, SO is also built on top of a relational database: https://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/
DynamoDB is might be what you want, there is even a forum use case in their documentation: Example Tables and Data in Amazon DynamoDB
There is insufficient information in the question to provide a reasonable answer to "where should I store text that I'm going to use?"
Depending on how you build your application and what the requirements are for speed, redundancy, latency, volume, scalability, size, cost, robustness, reliability, searchability, modifiability, security, etc., the answer could be any of:
Drop the text in files on an EBS volume attached to an instance.
Drop the text into a MySQL or RDS database.
Drop the text into a distributed file system spread across multiple instances.
Upload the text to S3
Store the text in SimpleDB
Store the text in DynamoDB
Cache the text in ElastiCache
There are also a number of variations on this like storing the master copy in S3, caching copies in ElastiCache and on the local disk, indexing it with specific keys in DynamoDB and making it searchable in Cloud Search.