How does the aws s3 cli figure out which files to sync? - amazon-web-services

When using the aws s3 cli to sync the files of a web app over to my s3 bucket I noticed that the sync command always uploads every file, even though they didn't actually change. Whats changed though is the timestamp of the files. So I was wondering how the sync command figures out which files it needs to upload?
Does sync only compare file name and timestamp?

Timestamp and Size.
If you want sync to consider size only:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

Related

How can I decompress ZIP files from S3, recompress them & then move them to an S3 bucket?

I have an S3 bucket with a bunch of zip files. I want to decompress the zip files and for each decompressed item, I want to create an $file.gz and save it to another S3 bucket. I was thinking of creating a Glue job for it but I don't know how to begin with. Any leads?
Eventually, I would like to terraform my solution and it should be triggered whenever there are new files in the S3 bucket,
Would a Lambda function or any other service be more suited for this?
From an architectural point of view, it depends on the file size of your ZIP files - if the process takes less than 15 minutes, then you can use Lambda functions.
If more, you will hit the current 15 minute Lambda timeout so you'll need to go ahead with a different solution.
However, for your use case of triggering on new files, S3 triggers will allow you to trigger a Lambda function when there are files created/deleted from the bucket.
I would recommend to segregate the ZIP files into their own bucket otherwise you'll also be paying for checking to see if any file uploaded is in your specific "folder" as the Lambda will be triggered for the entire bucket (it'll be negligible but still worth pointing out). If segregated, you'll know that any file uploaded is a ZIP file.
Your Lambda can then download the file from S3 using download_file (example provided by Boto3 documentation), unzip it using zipfile & eventually GZIP compress the file using gzip.
You can then upload the output file to the new bucket using upload_object(example provided by Boto3 documentation) & then delete the original file from the original bucket using delete_object.
Terraforming the above should also be relatively simple as you'll mostly be using the aws_lambda_function & aws_s3_bucket resources.
Make sure your Lambda has the correct execution role with the appropriate IAM policies to access both S3 buckets & you should be good to go.

How to copy only updated files/folders from aws s3 bucket to local machine?

I have a requirement of copying certain files from an S3 bucket to local machine. Below are the important points to note on my requirement:
The files are kept in S3 bucket based on the date folder.
The files are in csv.gz extension and I need to change it to csv and copy it to my local machine.
It keeps on updating every minute and I need to copy only the new files and process it. The processed files needs not to be copied again.
I have tried using sync folder but after processing of the file, the file name is renamed and again the csv.gz file is synced with the local folder.
I am planning to use some scheduled task to con.
Amazon S3 is a storage service. It cannot 'process' files for you.
If you wish to change the contents of a file (eg converting from .csv.gz to .csv), you would need to do this yourself on your local computer.
The AWS Command-Line Interface (CLI) aws s3 sync command makes it easy to copy files that have been changed/added since the previous sync. However, if you are changing the files locally (unzipping), then you will likely need to write your own program to download from Amazon S3.
There are AWS SDKs available for popular programming languages. You can also do a web search to find sample code for using Amazon S3.

Download millions of records from s3 bucket based on modified date

I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!
someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.
current command:
aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive
I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.
The AWS CLI will copy files in parallel.
Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)
Worst case, if something goes wrong, just run the aws s3 sync command again.
It might take a while for the sync command to gather the list of objects, but just let it run.
If you find that there is a lot of network overhead due to so many small files, then you might consider:
Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
Do an aws s3 sync to copy the files to the instance
Zip the files (probably better in several groups rather than one large zip)
Download the zip files via scp, or copy them back to S3 and download from there
This way, you are minimizing the chatter and bandwidth going in/out of AWS.
I'm assuming you're looking to sync arbitrary date ranges, and not simply maintain a local synced copy of the entire bucket (which you could do with aws s3 sync).
You may have to drive this from an Amazon S3 Inventory. Use the inventory list, and specifically the last modified timestamps on objects, to build a list of objects that you need to process. Then partition those somehow and ship sub-lists off to some distributed/parallel process to get the objects.

How does aws s3 sync handle interruptions? Is it possible that files are corrupted?

I want to perform an aws s3 sync to a bucket. What happens with the files if the sync gets aborted manually? Is it possible hat there is a corrupt file left behind? AWS says that multipart-upload is used for files >5G and here corrupt files cannot occur. But what about files smaller than 5GB?
I couldnt find exact information in aws documentation about that. I want to use aws s3 sync and not aws s3api.
AWS S3 is not a hierarchical filesystem. It is devided into two significant components, the backing store and the index which, unlike in a typical filesystem, are separate... so when you're writing an object, you're not really writing it "in place." Uploading an object saves the object to the backing store, and then adds it to the bucket's index, which is used by GET and other requests to fetch the stored data and metadata for retrieval. Hence in your case if the sync is aborted then its AWS responsibility to delete that file and it would not be indexed,
Coming to the multipart uploads, here also aws would not list the complete file until you send the last part of your multipart upload, you can also send an abort request to abort the multipart upload in that case aws would stop charging you for your partially uploaded files.
for more information about multipart upload refer to this document:
S3 multipart upload

Output AWS CLI "sync" results to a txt file

I'm new to AWS and specifically to the AWS CLI tool, but so far I seem to be going OK.
I'm using the following commands to connect to AWS S3 and synchronise a local directory to my S3 bucket:
set AWS_ACCESS_KEY_ID=AKIAIMYACCESSKEY
set AWS_SECRET_ACCESS_KEY=NLnfMySecretAccessCode
set AWS_DEFAULT_REGION=ap-southeast-2
aws s3 sync C:\somefolder\Data\Dist\ s3://my.bucket/somefolder/Dist/ --delete
This is uploading files OK and displaying the progress and result for each file.
Once the initial upload is done, I'm assuming that all new syncs will just upload new and modified files and folders. Using the --delete will remove anything in the bucket that no longer exists on the local server.
I'd like to be able to output the results of each upload (or download in the case of other servers which will be getting a copy of what is being uploaded) to a .txt file on the local computer so that I can use blat.exe to email the contents to someone who will be monitoring the sync.
All of this will be put into a batch file that will be scheduled to run nightly.
Can the output to .txt be done? If so, how?
I haven't tested this myself, but I found some resources that indicate you can redirect output from command-line driven applications in Windows command prompt just like you would in linux.
aws s3 sync C:\somefolder\Data\Dist\ s3://my.bucket/somefolder/Dist/ --delete > output.txt
The resources I found are:
https://stackoverflow.com/a/16713357/4471711
https://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/redirection.mspx?mfr=true
Once the initial upload is done, I'm assuming that all new syncs will just upload new and modified files and folders. Using the --delete will remove anything in the bucket that no longer exists on the local server.
That is correct, sync will upload either new or modified files as compared to the destination (whether it is an S3 bucket or your local machine).
--delete will remove anything in the destination (not necessarily an S3 bucket) that is not in the source. It should be used carefully so as to avoid a situation where you've downloaded, modified and sync one file and because your local machine doesn't have ALL of the files, use of the --delete flag will then delete all other files at destination.