S3 versioning: only copy files if they are new or changed - amazon-web-services

I'm running a cron job in a EC2 instance that backups a database dump and a folder (with files and subfolders) in a S3 bucket.
I only want to backup new and modified files in order to save costs. Is this possible?
I'm currently using aws cp, maybe there is an argument or another command?
thanks

Use aws s3 sync instead of aws s3 cp and it will do this automatically for you.

Related

AWS CLI "delete-folder" command help needed

I have a bucket with folder s3://mybucket/abc/thisFolder which contains thousands of files inside.
I can use aws s3 rm s3://mybucket/abc/thisFolder --recursive to delete it and all files inside, and it does it fine one by one.
However, there's also a delete-folder command, but to me the official doc is not very clear. Its example says aws workdocs delete-folder --folder-id 26fa8aa4ba2071447c194f7b150b07149dbdb9e1c8a301872dcd93a4735ce65d
I would like to know what is workdocs in example above, and how do I obtain the long --folder-id string for my folder s3://mybucket/abc/thisFolder?
Thank you.
Amazon WorkDocs is a Dropbox-like service.
If you wish to delete objects in Amazon S3, then you should only use AWS CLI commands that start with aws s3 or aws s3api.
Another way to delete folders in Amazon S3 is to configure Amazon S3 Object lifecycle management with a rule to delete objects with a given prefix. They might take a while to delete (~24 hours), but it will happen automatically rather than one-by-one.

Any automate way to copy data from Amazon EFS to S3?

I am looking for an efficient way to periodically copy data from EFS to S3. I know I am able to create a cron job and use S3 cli to move the data, but I was wondering if there is any existing service or ETL data pipeline on AWS that is able to copy data from EFS to S3 periodically.
Thanks
You are right; you can create a cron job and use AWS CLI. There is no existing service to do this.
s3 sync : Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination. Only creates folders in the destination if they contain one or more files.
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Is there a temporary folder that I can access while using AWS Glue?

Is there a temporary folder that I can access to hold files temporarily while running processes within AWS glue? For example, in Lambda we have access to a /tmp directory as long as the process is executing. Do we have something similar in AWS Glue that we can store files while the job is executing?
Are you asking for this? There are a number of argument names that are recognized and used by AWS Glue, that you can use to set up the script environment for your Jobs and JobRuns:
--TempDir — Specifies an S3 path to a bucket that can be used as a temporary directory for the Job.
Here is a link, which you can refer.
Hope, this helps.
Yes, there is a tmp directory which you can use to move files to and from s3.
s3 = boto3.resource('s3')
--Downloads file to local spark directory tmp
s3.Bucket(bucket_name).download_file(DATA_DIR+file,'tmp/'+file)
And you can also upload files from 'tmp/' to s3.

Why can I not run dynamic content from Amazon S3?

I know that Amazon S3 is a service for storing static files. But what I don't understand is, if I store some PHP files on a S3 bucket, why isn't it possible to have those files executed from a EC2 instance?
Amazon S3 is a data storage service. When a file is requested from S3, it is sent to the requester, regardless of file format. S3 does not process the file in any way, nor does it pass content to Amazon EC2 for execution.
If you want a PHP file executed by a PHP engine, you will need to run a web server on an Amazon EC2 instance.
Run directly from S3 this will never work as objects in s3 aren't presented in a way whilst stored in s3 that your local system can really use.
However good news you can pull the php down from S3 to your local system and execute it!
I use this method myself with an instance created by lambda to do some file processing. Lambda creates the instance, the bash script in the instance UserData will do an s3 copy (see below) to copy the php file down and the data file down that PHP will process and then php is called against my file.
To download a file from s3 in the cli you:
//save as file.php in the current directory
aws s3 cp s3://my-s3-bucket-name/my/s3/file.php .
//or
//save as a different filename
aws s3 cp s3://my-s3-bucket-name/my/s3/file.php my-file.php
//or
//save it in a different folder
aws s3 cp s3://my-s3-bucket-name/my/s3/file.php some/directory/path/file.php
You would then pass this file into PHP for execution like any other file.

AWS Synchronize S3 bucket with EC2 instance

I would like to synchronize an S3 bucket with a single directory on multiple Windows EC2 instances. When a file is uploaded or deleted from the bucket, I would like it to be immediately pushed or removed respectively from all of the instances. New instances will be added/removed frequently (multiple times per week). Files will be uploaded/deleted frequently as well. The files sizes could be up to 2gb in size. What AWS services or features can solve this?
Based on what you've described, I'd propose the following solution to this problem.
You need to create an SNS topic for S3 change notifications. Then you need a script that's going to subscribe to this topic from your machines. This script will update files on your machines based on changes coming from S3. It should support basic CRUD operations.
Run this script and then sync contents of your S3 to your machine when it starts using aws-cli mentioned above.
Yes, i have used the aws cli s3 "sync" command to keep a local server's content updated with S3 changes. It allows a local target directory's files to be synchronized with a bucket or prefix.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Edit : This following answer is to sync EC2 with S3 Bucket, Source : EC2 & Destination : Bucket.
If it were for only one instance, then only aws cli sync(with --delete option) would have been worked for both: putting files to S3 bucket and to delete.
But the case here is for Multiple Instances, so if we use aws s3 sync with --delete option, there would be a problem.
To explain it simply, consider Instance I1 with files a.jpg & b.jpg to be synced to Bucket.
Now a CRON job has synced the files with the S3 bucket.
Now we have Instance I2 which has files c.jpg & d.jpg.
So when the CRON job of this Instance runs, it puts the files c.jpg & d.jpg and also deletes the files a.jpg & b.jpg, because those files doesn't exist in Instance I2.
So to rectify the problem we have two approaches :
Sync all files across all Instances(Costly and removes the purpose of S3 altogether).
Sync files without the --delete option, and implement the deletion separately(using aws s3 rm).