Upload only newly modified files to S3 bucket using Golang aws-sdk - amazon-web-services

I'm trying to implement a backup mechanism to S3 bucket in my code.
Each time a condition is met I need to upload an entire directory contents to an S3 bucket.
I am using this code example:
https://github.com/aws/aws-sdk-go/tree/c20265cfc5e05297cb245e5c7db54eed1468beb8/example/service/s3/sync
Which creates an iterator of the directory content's and then use s3manager.Upload.UploadWithIterator to upload them.
Everything works, however I noticed it uploads all files and overwrites existing files on the bucket even if they weren't modified since last backup, I only want to upload the delta between each backup.
I know aws cli has the command aws s3 sync <dir> <bucket> which does exactly what I need, however I couldn't find anything equivalent on aws-sdk documentation.
Appreciate the help, thank you!

There is no such feature in aws-sdk. You could instrument it yourself for each file to check the hash of both objects before upload. Or use a community solution https://www.npmjs.com/package/s3-sync-client

Related

How can I decompress ZIP files from S3, recompress them & then move them to an S3 bucket?

I have an S3 bucket with a bunch of zip files. I want to decompress the zip files and for each decompressed item, I want to create an $file.gz and save it to another S3 bucket. I was thinking of creating a Glue job for it but I don't know how to begin with. Any leads?
Eventually, I would like to terraform my solution and it should be triggered whenever there are new files in the S3 bucket,
Would a Lambda function or any other service be more suited for this?
From an architectural point of view, it depends on the file size of your ZIP files - if the process takes less than 15 minutes, then you can use Lambda functions.
If more, you will hit the current 15 minute Lambda timeout so you'll need to go ahead with a different solution.
However, for your use case of triggering on new files, S3 triggers will allow you to trigger a Lambda function when there are files created/deleted from the bucket.
I would recommend to segregate the ZIP files into their own bucket otherwise you'll also be paying for checking to see if any file uploaded is in your specific "folder" as the Lambda will be triggered for the entire bucket (it'll be negligible but still worth pointing out). If segregated, you'll know that any file uploaded is a ZIP file.
Your Lambda can then download the file from S3 using download_file (example provided by Boto3 documentation), unzip it using zipfile & eventually GZIP compress the file using gzip.
You can then upload the output file to the new bucket using upload_object(example provided by Boto3 documentation) & then delete the original file from the original bucket using delete_object.
Terraforming the above should also be relatively simple as you'll mostly be using the aws_lambda_function & aws_s3_bucket resources.
Make sure your Lambda has the correct execution role with the appropriate IAM policies to access both S3 buckets & you should be good to go.

How to partially upload a ZIP file to S3 bucket?

I want my users to be able to download many files from AWS S3 bucket(potentially over few hundred GBs sized when accumulated) as one large ZIP file. I would download those selected files from S3 first and upload a newly created ZIP file on S3. This job will be rarely invoked during our service, so I decided to use Lambda for it.
But Lambda has its own limitations - 15 min of execution time, ~500MB /tmp storage, etc. I found several workaround solutions on Google that can beat the storage limit(streaming) but found no way to solve execution time limit.
Here are what I've found so far:
https://dev.to/lineup-ninja/zip-files-on-s3-with-aws-lambda-and-node-1nm1
Create a zip file on S3 from files on S3 using Lambda Node
Note that programming language is not a concern here.
Could you please give me a suggestion?

Tagging objects read by spark on s3

I use pyspark to read objects on an s3 bucket on amazon s3. My bucket is composed if many json files which I read and then save as parquet files with
spark.read.json('s3://my-bucket/directory1/')
spark.write.parquet('s3://bucket-with-parquet/', mode='append')
Every day I will upload some new files on s3://my-bucket/directory1/ and I would like to update them to s3://bucket-with-parquet/ is there a way to ensure that I do not update the data two times. My idea is to tag every files which I read with spark (do not know how to do it). I can then use those tags to tell spark not to read the file again after (dunno how to do it as well). If an AWS guru could help me on that I would be very grateful.
There are a couple of things you could do, one is to write a script which reads timestamp from the metadata of the bucket and gives the list of files added on that day. You can process only those files which are mentioned in this list. (https://medium.com/faun/identifying-the-modified-or-newly-added-files-in-s3-11b577774729)
Second, you can enable versioning in S3 bucket to make sure if you overwrite any files you can retrieve the old file. You can also set ACL for read-only and write once permission as mentioned here Amazon S3 ACL for read-only and write-once access.
I hope this helps.

Output AWS CLI "sync" results to a txt file

I'm new to AWS and specifically to the AWS CLI tool, but so far I seem to be going OK.
I'm using the following commands to connect to AWS S3 and synchronise a local directory to my S3 bucket:
set AWS_ACCESS_KEY_ID=AKIAIMYACCESSKEY
set AWS_SECRET_ACCESS_KEY=NLnfMySecretAccessCode
set AWS_DEFAULT_REGION=ap-southeast-2
aws s3 sync C:\somefolder\Data\Dist\ s3://my.bucket/somefolder/Dist/ --delete
This is uploading files OK and displaying the progress and result for each file.
Once the initial upload is done, I'm assuming that all new syncs will just upload new and modified files and folders. Using the --delete will remove anything in the bucket that no longer exists on the local server.
I'd like to be able to output the results of each upload (or download in the case of other servers which will be getting a copy of what is being uploaded) to a .txt file on the local computer so that I can use blat.exe to email the contents to someone who will be monitoring the sync.
All of this will be put into a batch file that will be scheduled to run nightly.
Can the output to .txt be done? If so, how?
I haven't tested this myself, but I found some resources that indicate you can redirect output from command-line driven applications in Windows command prompt just like you would in linux.
aws s3 sync C:\somefolder\Data\Dist\ s3://my.bucket/somefolder/Dist/ --delete > output.txt
The resources I found are:
https://stackoverflow.com/a/16713357/4471711
https://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/redirection.mspx?mfr=true
Once the initial upload is done, I'm assuming that all new syncs will just upload new and modified files and folders. Using the --delete will remove anything in the bucket that no longer exists on the local server.
That is correct, sync will upload either new or modified files as compared to the destination (whether it is an S3 bucket or your local machine).
--delete will remove anything in the destination (not necessarily an S3 bucket) that is not in the source. It should be used carefully so as to avoid a situation where you've downloaded, modified and sync one file and because your local machine doesn't have ALL of the files, use of the --delete flag will then delete all other files at destination.

How to find the last modified Bucket in S3?

I would like to store the information about, which s3 Bucket was last modified. Studying the documentation makes me wonder if s3cmd 'sync' would be of use. I'm new to Amazon s3, so please help me by suggesting the best way to get the information about the last modified Bucket and also store it in a log using s3cmd. Anyone to help?
Bucket in S3 cannot be modified. If you means last modified time of object in bucket, you may make use of s3cmd sync with --dry-run option, and print the output to an log file.