How to find the last modified Bucket in S3? - amazon-web-services

I would like to store the information about, which s3 Bucket was last modified. Studying the documentation makes me wonder if s3cmd 'sync' would be of use. I'm new to Amazon s3, so please help me by suggesting the best way to get the information about the last modified Bucket and also store it in a log using s3cmd. Anyone to help?

Bucket in S3 cannot be modified. If you means last modified time of object in bucket, you may make use of s3cmd sync with --dry-run option, and print the output to an log file.

Related

AWS S3 sync command from one bucket to another

I want to use the AWS S3 sync command to sync a large bucket with another bucket.
I found this answer that say that the files from the bucket synced over the AWS backbone and are not copied to the local machine but I can't find a reference anywhere in the documentation. Does anyone has a proof for this behavior? any formal documentation that explains how it works?
I tried to find something in the documentation but nothing there.
To learn more about the sync command, check CLI docs. You can directly refer to the section named -
Sync from S3 bucket to another S3 bucket
The following sync command syncs objects to a specified bucket and
prefix from objects in another specified bucket and prefix by copying
s3 objects. An s3 object will require copying if one of the following
conditions is true:
The s3 object does not exist in the specified bucket and prefix
destination.
The sizes of the two s3 objects differ.
The last modified time of the source is newer than the last modified time of the destination.
Use the S3 replication capability if you only want to replicate the data that moves from bucket1 to bucket2.

Upload only newly modified files to S3 bucket using Golang aws-sdk

I'm trying to implement a backup mechanism to S3 bucket in my code.
Each time a condition is met I need to upload an entire directory contents to an S3 bucket.
I am using this code example:
https://github.com/aws/aws-sdk-go/tree/c20265cfc5e05297cb245e5c7db54eed1468beb8/example/service/s3/sync
Which creates an iterator of the directory content's and then use s3manager.Upload.UploadWithIterator to upload them.
Everything works, however I noticed it uploads all files and overwrites existing files on the bucket even if they weren't modified since last backup, I only want to upload the delta between each backup.
I know aws cli has the command aws s3 sync <dir> <bucket> which does exactly what I need, however I couldn't find anything equivalent on aws-sdk documentation.
Appreciate the help, thank you!
There is no such feature in aws-sdk. You could instrument it yourself for each file to check the hash of both objects before upload. Or use a community solution https://www.npmjs.com/package/s3-sync-client

S3 Bucket AWS CLI takes forever to get specific files

I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.
2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.

Tagging objects read by spark on s3

I use pyspark to read objects on an s3 bucket on amazon s3. My bucket is composed if many json files which I read and then save as parquet files with
spark.read.json('s3://my-bucket/directory1/')
spark.write.parquet('s3://bucket-with-parquet/', mode='append')
Every day I will upload some new files on s3://my-bucket/directory1/ and I would like to update them to s3://bucket-with-parquet/ is there a way to ensure that I do not update the data two times. My idea is to tag every files which I read with spark (do not know how to do it). I can then use those tags to tell spark not to read the file again after (dunno how to do it as well). If an AWS guru could help me on that I would be very grateful.
There are a couple of things you could do, one is to write a script which reads timestamp from the metadata of the bucket and gives the list of files added on that day. You can process only those files which are mentioned in this list. (https://medium.com/faun/identifying-the-modified-or-newly-added-files-in-s3-11b577774729)
Second, you can enable versioning in S3 bucket to make sure if you overwrite any files you can retrieve the old file. You can also set ACL for read-only and write once permission as mentioned here Amazon S3 ACL for read-only and write-once access.
I hope this helps.

Parse data from aws s3 bucket and save parsed data to another bucket

I'm new to AWS S3, and I was reading to this tutorial from AWS on how to move data from bucket to another
How can I copy objects between Amazon S3 buckets?
However, I didn't notice, or it didn't mention that you can apply a hook or any intermediate step before data will be saved.
Ideally, we wanted to take the data from a log bucket(its very dirty and wanted to clean it up a bit) and save another copy of it in another S3 (the parsed data). We also wanted to do this periodically so that automation would be necessary for the future.
What I wanted to know is that, can I do this with just S3 or do I need to use another service to do the parsing and saving to another bucket.
Any insight is appreciated, thanks!
S3 by itself is simply for storage. You should be looking at using AWS Lambda with Amazon S3.
Every time a file is pushed to your Log bucket, S3 can trigger a Lambda function (that you write) that can read the file, do the clean up, and then push the cleaned data to the new S3 bucket.
Hope this helps.