s3cmd - delete first file if total number files more than 5

s3cmd - delete first file if total number files more than 5 - amazon-web-services

I wish to create a daily cronjob, using s3cmd, to check if a S3 bucket has more than 5 backup files. If more than 5 then delete the oldest one leaving 5; and if less than 5 there will be no file deletion.
Therefore, the s3 bucket will always leave 5 backup copies.
How can I achieve it?

You can use S3 lifecycle rules for your use case. This will avoid your effort in writing a cron that runs daily.
S3 Object Lifecycle
If you want you use Cron in a scenario where you don't backup daily then in that case use AWS CLI and in that use
aws s3 ls
in combination with your logic and
aws s3 delete
command to achieve the same.

Related

Python Script as a Cron on AWS S3 buckets

I have a python script which copy files from one S3 bucket to another S3 bucket. This script needs to be run every Sunday at some specific time. I was reading some of articles and answers, So I tried to use AWS lambda + Cloudwatch events. This files run for minimum 30 minutes. would it be still good with Lambda as Lambda can run max 15 minutes only. Or is there any other way? I can create an EC2 box and run it as a Cron but that would be expensive. Or any other standard way?

The more appropriate way would be to use aws glue python shell job as it is under the serverless umbrella and you'll be charged as you go.
So this way you will only be charged for the time your code runs.
Also you don't need to manage the EC2 for this. This is like an extended lambda.

If the two buckets are supposed to stay in sync, i.e. all files from bucket #1 should eventually be synced to bucket #2, then there are various replication options in S3.
Otherwise look at S3 Batch Operations. You can derive the list of files that you need to copy from S3 Inventory which will give you additional context on the files, such as date/time uploaded, size, storage class etc.

Unfortunately, the lambda 15min execution time is a hard stop so it's not suitable for this use case as a big bang.
You could use multiple lambda calls to go through the objects one at a time and move them. However, you would need a DynamoDB table (or something similar) to keep track of what has been moved and what has not.
Another couple of options would be:
S3 Replication which will keep one bucket in sync with the other.
An S3 Batch operation
Or if its data files? you can always use AWS glue.

You can certainly use Amazon EC2 for a long-running batch job.
A t3.micro Linux instance costs $0.0104 per hour, and a t3.nano is half that price, charged per-second.
Just add a command at the end of the User Data script that will shut down the instance:
sudo shutdown now -h
If you launch the instance with Shutdown Behavior = Terminate, then the instance will self-terminate.

Download millions of records from s3 bucket based on modified date

I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!
someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.
current command:
aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive
I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.

The AWS CLI will copy files in parallel.
Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)
Worst case, if something goes wrong, just run the aws s3 sync command again.
It might take a while for the sync command to gather the list of objects, but just let it run.
If you find that there is a lot of network overhead due to so many small files, then you might consider:
Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
Do an aws s3 sync to copy the files to the instance
Zip the files (probably better in several groups rather than one large zip)
Download the zip files via scp, or copy them back to S3 and download from there
This way, you are minimizing the chatter and bandwidth going in/out of AWS.

I'm assuming you're looking to sync arbitrary date ranges, and not simply maintain a local synced copy of the entire bucket (which you could do with aws s3 sync).
You may have to drive this from an Amazon S3 Inventory. Use the inventory list, and specifically the last modified timestamps on objects, to build a list of objects that you need to process. Then partition those somehow and ship sub-lists off to some distributed/parallel process to get the objects.

How to check is S3 service is available or not in AWS via CLI?

We have options to :
1. Copy file/object to another S3 location or local path (cp)
2. List S3 objects (ls)
3. Create bucket (mb) and move objects to bucket (mv)
4. Remove a bucket (rb) and remove an object (rm)
5. Sync objects and S3 prefixes
and many more.
But before using the commands, we need to check if S3 service is available in first place. How to do it?
Is there a command like :
aws S3 -isavailable
and we get response like
0 - S3 is available, I can go ahead upload object/create bucket etc.
1 - S3 is not availble, you can't upload object etc. ?

You should assume that Amazon S3 is available. If there is a problem with S3, you will receive an error when making a call with the Amazon CLI.
If you are particularly concerned, then add a simple CLI command first, eg aws s3 ls and throw away the results. But that's really the same concept. Or, you could use the --dry-run option available on many commands that simply indicates whether you would have had sufficient permissions to make the request, but doesn't actually run the request.
It is more likely that you will have an error in your configuration (eg wrong region, credentials not valid) than S3 being down.

Fastest way to sync two Amazon S3 buckets

I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).
I'm using AWS CLI's s3 sync command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.
I was trying to run the sync both from my local machine and from EC2 c4.xlarge instance and there isn't much difference in time taken.
I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude and --include options and run them in parallel from separate terminal windows, i.e.
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"
Is there anything else I can do speed up the sync even more? Is another type of EC2 instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync processes that can run in parallel on the same bucket?
Update
I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync command even on buckets with no differences takes a lot of time.

You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.
aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

40100 objects 160gb was copied/sync in less than 90 seconds
follow the below steps:
step1- select the source folder
step2- under the properties of the source folder choose advance setting
step3- enable transfer acceleration and get the endpoint
AWS configurations one time only (no need to repeat this every time)
aws configure set default.region us-east-1 #set it to your default region
aws configure set default.s3.max_concurrent_requests 2000
aws configure set default.s3.use_accelerate_endpoint true
options :-
--delete : this option will delete the file in destination if its not present in the source
AWS command to sync
aws s3 sync s3://source-test-1992/foldertobesynced/ s3://destination-test-1992/foldertobesynced/ --delete --endpoint-url http://soucre-test-1992.s3-accelerate.amazonaws.com
transfer acceleration cost
https://aws.amazon.com/s3/pricing/#S3_Transfer_Acceleration_pricing
they have not mentioned pricing if buckets are in the same region

As a variant of what OP is already doing..
One could create a list of all files to be synced, with aws s3 sync --dryrun
aws s3 sync s3://source-bucket s3://destination-bucket --dryrun
# or even
aws s3 ls s3://source-bucket --recursive
Using the list of objects to be synced, split the job into multiple aws s3 cp ... commands. This way, "aws cli" won't be just hanging there, while getting a list of sync candidates, as it does when one starts multiple sync jobs with --exclude "*" --include "1?/*" type arguments.
When all "copy" jobs are done, another sync might be worth it, for good measure, perhaps with --delete, if object might get deleted from "source" bucket.
In case of "source" and "destination" buckets located in different regions, one could enable cross-region bucket replication, before starting to sync the buckets..

New option in 2020:
We had to move about 500 terabytes (10 million files) of client data between S3 buckets. Since we only had a month to finish the whole project, and aws sync tops out at about 120megabytes/s... We knew right away this was going to be trouble.
I found this stackoverflow thread first, but when I tried most of the options here, they just weren't fast enough. The main problem is they all rely on serial item-listing. In order to solve the problem, I figured out a way to parallelize listing any bucket without any a priori knowledge. Yes, it can be done!
The open source tool is called S3P.
With S3P we were able to sustain copy speeds of 8 gigabytes/second and listing speeds of 20,000 items/second using a single EC2 instance. (It's a bit faster to run S3P on EC2 in the same region as the buckets, but S3P is almost as fast running on a local machine.)
More info:
Bog post on S3P
S3P on NPM
Or just try it out:
# Run in any shell to get command-line help. No installation needed:
npx s3p
(requirements nodejs, aws-cli and valid aws-cli credentials)

Background: The bottlenecks in the sync command is listing objects and copying objects. Listing objects is normally a serial operation, although if you specify a prefix you can list a subset of objects. This is the only trick to parallelizing it. Copying objects can be done in parallel.
Unfortunately, aws s3 sync doesn't do any parallelizing, and it doesn't even support listing by prefix unless the prefix ends in / (ie, it can list by folder). This is why it's so slow.
s3s3mirror (and many similar tools) parallelizes the copying. I don't think it (or any other tools) parallelizes listing objects because this requires a priori knowledge of how the objects are named. However, it does support prefixes and you can invoke it multiple times for each letter of the alphabet (or whatever is appropriate).
You can also roll-your-own using the AWS API.
Lastly, the aws s3 sync command itself (and any tool for that matter) should be a bit faster if you launch it in an instance in the same region as your S3 bucket.

As explained in recent (May 2020) AWS blog post tiled:
Replicating existing objects between S3 buckets
Once can also use S3 replication for existing objects. This requires contacting AWS support to enable this feature:
Customers can copy existing objects to another bucket in the same or different AWS Region by contacting AWS Support to add this functionality to the source bucket.

I used Datasync to migrate 95 TB of data. Took about 2 days. Has all this fancy things for network optimization, parallelization of the jobs. You can even have checks on source and destination to be sure everything transfered as expected.
https://aws.amazon.com/datasync/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

I'm one of the developers of Skyplane, which can copy data across buckets at over 110X speed compared to cloud CLI tools. You can sync two buckets with:
skyplane sync -r s3://bucket-1/ s3://bucket-2/
Underneath the hood, Skyplane creates ephemeral VM instances which parallelize syncing the data across multiple machines (so you're not bottlenecked by disk bandwidth)

AWS Synchronize S3 bucket with EC2 instance

I would like to synchronize an S3 bucket with a single directory on multiple Windows EC2 instances. When a file is uploaded or deleted from the bucket, I would like it to be immediately pushed or removed respectively from all of the instances. New instances will be added/removed frequently (multiple times per week). Files will be uploaded/deleted frequently as well. The files sizes could be up to 2gb in size. What AWS services or features can solve this?

Based on what you've described, I'd propose the following solution to this problem.
You need to create an SNS topic for S3 change notifications. Then you need a script that's going to subscribe to this topic from your machines. This script will update files on your machines based on changes coming from S3. It should support basic CRUD operations.
Run this script and then sync contents of your S3 to your machine when it starts using aws-cli mentioned above.

Yes, i have used the aws cli s3 "sync" command to keep a local server's content updated with S3 changes. It allows a local target directory's files to be synchronized with a bucket or prefix.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Edit : This following answer is to sync EC2 with S3 Bucket, Source : EC2 & Destination : Bucket.
If it were for only one instance, then only aws cli sync(with --delete option) would have been worked for both: putting files to S3 bucket and to delete.
But the case here is for Multiple Instances, so if we use aws s3 sync with --delete option, there would be a problem.
To explain it simply, consider Instance I1 with files a.jpg & b.jpg to be synced to Bucket.
Now a CRON job has synced the files with the S3 bucket.
Now we have Instance I2 which has files c.jpg & d.jpg.
So when the CRON job of this Instance runs, it puts the files c.jpg & d.jpg and also deletes the files a.jpg & b.jpg, because those files doesn't exist in Instance I2.
So to rectify the problem we have two approaches :
Sync all files across all Instances(Costly and removes the purpose of S3 altogether).
Sync files without the --delete option, and implement the deletion separately(using aws s3 rm).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js