Backup and restore files in ECS Tasks in AWS - amazon-web-services

Description
I have an ECS cluster that has multiple tasks. Every task is a Wordpress website. These task will automatically start and stop based on some Lambda functions. To persist the files when a task goes down for some reason I tried using EFS, but that is very slow when the burst credits ran out.
Now I use the volume type: Bind Mount (just using the normal filesystem, nothing fancy here). The websites are a lot faster but not persisted anymore. When an instance goes down the files of that website are gone. ECS starts the task again but without the files the websites break.
First solution
My first solution is to run an extra container in the task that makes backups once a day and stores it in S3. All files are automatically stored in a .tar.gz and uploaded to S3. This all works fine but I don't have a way yet to restore these backups yet. These things should be considered:
When a new tasks starts: need to check if current task/website already has a backup
If the latest backup should be restored: download .tar.gz from S3 and unzip it
To realize this I think it should be a bash script or something like it and run it on startup of a task?
Possible second solution
Another solution I thought about and I think is a lot cleaner is instead of having an extra container doing backups every day. Mount EFS to each task and have it sync data between the Bind Mount and EFS. This way EFS is a backup storage location instead of the working file system for my websites. Other pros: The tasks/websites will have more recent backups and I have more CPUs and Memory in my EC2 instances in my ECS cluster for other tasks.
Help?
I would like some opinions on the solutions above and maybe some advice if the second solution is any good and some tips on how to implement it. Any other advice would be helpful too!

Related

On-Premise file backup to aws

Use case:
I have one directory on-premise, I want to make a backup for it let's say at every midnight. And want to restore it if something goes wrong.
Doesn't seem a complicated task,but reading through the AWS documentation even this can be cumbersome and costly.Setting up Storage gateway locally seems unnecessarily complex for a simple task like this,setting up at EC2 costly also.
What I have done:
Reading through this + some other blog posts:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/storagegateway/latest/userguide/WhatIsStorageGateway.html
What I have found:
1.Setting up file gateway (locally or as an EC2 instance):
It just mount the files to an S3. And that's it.So my on-premise App will constantly write to this S3.The documentation doesn't mention anything about scheduled backup and recovery.
2.Setting up volume gateway:
Here I can make a scheduled synchronization/backup to the a S3 ,but using a whole volume for it would be a big overhead.
3.Standalone S3:
Just using a bare S3 and copy my backup there by AWS API/SDK with a manually made scheduled job.
Solutions:
Using point 1 from above, enable versioning and the versions of the files will serve as a recovery point.
Using point 3
I think I am looking for a mix of file-volume gateway: Working on file level and make an asynchronus scheduled snapshot for them.
How this should be handled? Isn't there a really easy way which will just send a backup of a directory to the AWS?
The easiest way to backup a directory to Amazon S3 would be:
Install the AWS Command-Line Interface (CLI)
Provide credentials via the aws configure command
When required run the aws s3 sync command
For example
aws s3 sync folder1 s3://bucketname/folder1/
This will copy any files from the source to the destination. It will only copy files that have been added or changed since a previous sync.
Documentation: sync — AWS CLI Command Reference
If you want to be more fancy and keep multiple backups, you could copy to a different target directory, or create a zip file first and upload the zip file, or even use a backup program like Cloudberry Backup that knows how to use S3 and can do traditional-style backups.

Python pipeline on AWS Cloud

I have few python scripts which need to be executed in sequence on AWS Cloud so what are the best and simplest options? These script files are proof of concept so little bit dirty also but need to run overnight. Most of the script finishes within 10 mins but couple of them can take up to 1 hour running on a single core.
We do not have any servers like Jenkins, airflow etc...we are planning to use existing aws services.
Please let me know, Thanks.
1) EC2 Instance (Manually controlled)
Upload your scripts to an S3 bucket Use default VPC
launch EC2 Instance
Use SSM Remote session to log in
Run AWS CLI (AWS S3 Sync to download from S3)
Run them Manually
stop instance when done.
To be clean, make a SH file (or master .py file) to do the work. If you want it to stop charging you money afterwards, add command to stop instance when complete.
Least amount of work
2) If you want to run scripts daily
- Script out the work above (include modifying the Autoscale group at end to go to one box)
- Create an EC2 Auto Scale Group and launch it on a CRON job schedule.
It will start up, do the work, and then shut down and stop charging you.
3) Lambda
Pretty much like option 2, but AWS will do most of the work for you.
Either put all your scripts into one lambda..or put each script into its own lambda and have a master that does sync invoke of each script in the order you want.
You have a cloudwatch alarm trigger daily and does the work
I would say that if you are in POC mode, option 1 is best decision. It is likely closest to what you already do where you are currently executing. This is what #jarmod recommended already.
You didn't mention anything about which AWS resources your python scripts need to access or at least the purpose of the scripts, so it is difficult to provide a solution.
However a good option is to use AWS Batch.

How is `tmp` folder managed when using ECS Fargate?

I'm currently running some containers on production using AWS Fargate. I'm running an application that from time to time populates some files to /tmp folder.
That said, I want to know what happens to this /tmp folder. Is this something managed by Fargate (by ECS Container Agent, for example) or is it something that I need to manage by myself (using a cronjob to clear the files there, for example)?
NOTE 1: One way to handle that is to use s3 to handle that kind of behavior, however, the question is to know how Fargate behaves regarding /tmp folder.
NOTE 2: I don't need the files in /tmp folder, they just happen to appear there, and I want to know if I need to remove them or if ECS will do that for me.
I couldn't find anything about that in documentation. If someone points that subjects on the docs, I would be happy to accept the answer.
if I understand your question correctly, it looks like you want more precise control over temporary storage within your container.
I don't think there is anything special that ECS or Fagate does with /tmp folders on the FS within the container.
However, docker does have a notion of a tempfs mount. This allows you to designate a path that allows you to avoid storing data on the containers host machine.
https://docs.docker.com/storage/tmpfs/
ECS and Fargate recently added support for the tmpfs flag:
https://aws.amazon.com/about-aws/whats-new/2018/03/amazon-ecs-adds-support-for-shm-size-and-tmpfs-parameters/
If I understand correctly, after your Fargate task ends its running, all the storage goes away.
According to aws documentation, a Fargate task receives some storage when provisioned, but it is an ephemeral storage.
So unless you are using a forever running task, you don't need to deal with temporary files. They will be gone with the storage.
I hope this helps.

Auto scaling and data replication on EC2

Here is my scenario.
We have an ELB setup with two reserved instances of EC2 acting as web server under it (Amazon Linux).
There are some rapidly changing files (pdf, xls, jpg, etc) on the web server which are consumed by the websites hosted on the EC2 instances. Code files are identical and we will be sure to update both the servers manually at the same time with new code as and when needed.
The main problem is the user uploaded content which is stored on the EC2 instance.
What is the best approach to make sure that the uploaded files are available on both the servers almost instantly ?
Many people have suggested the use of rsync or unison, but this will involve setting a cron job. I am looking for something like FileSystemWatcher in C# which is triggered
ONLY when the contents of the specified folder are changed. Moreover due to the ELB we are not sure which of the EC2 instances will actually be connected to the user when the files are uploaded.
To add to the above we have one more Staging Server which pushes certain files to one of the EC2 web servers. We want these files too replicated to the other instance.
I was wondering whether S3 can solve the problem ? Will this setup be still good if we decide to enable auto scaling ?
I am confused at this stage. Please help
S3 will be the choice for your case. In this way, you don't have to sync files between EC2 instances. Also it is probably the best choice if you need to enable auto scaling. You should not put any data in EC2 instances, they should be stateless so that you can easily auto scale.
To use S3, it will require your application to support it instead of directly writing to local file system. It should be quite easy, there are many libraries in each language which can help you to store files into S3.

Is s3cmd a safe option for sync EC2 instances?

I have the following problem: we are working on a project on AWS which will use autoscaling, so the EC2 instances will start and die very often. Freeze images, update the launch configurations, auto scalling groups, alarms, etc, takes a while and several things can go wrong.
I just want the new instances to sync the most recent code, so I was just thinking about fetching it from S3 using s3cmd once the instance finishes booting and manually updating it everytime we have new codes to be uploaded. So my doubts are:
Is it too much risky to store the code on s3? How secure are the files in there? Using the s3cmd encryption password it is unlikely someone will be able do decrypt them?
What other ooptions would be good for this? I was thinking about rsync, but then I think I would need to store the private key for the servers inside them, which I don't think its a good idea.
Thanks for any advices
You might be a candidate for Elastic Beanstalk - using a plain vanilla AMI.
Then package your application, use AWS's ebextensions tool to customize the instance when it is spun up. ebextensions will allow you to do anything you like to the image, in place, as it is deploying. change .htaccess, erase a file, place a cron job, whatever.
When you have code updates, package them, upload and do a rolling update.
All instances will use your latest code, including auto-scaled ones.
The key concept here is to never have your real data in the instance, where it might go away if an instance dies or is shut down.
Elastic Beanstalk will allow you to set up the load balancing, auto-scaling, monitoring, etc.