How to change file upload date in Amazon S3 using AWS CLI - amazon-web-services

I need to move some files (thousands) to Amazon S3 bucket, from where they will be displayed to the end-user by another application (instead of the current one).
Problem is, that these files have creation/upload date now (dates very between 2012 and 2017, when they were uploaded to current application), and when I move them they all start to be of the same date. That is a problem because when you look at the files in the new application, you don't understand the time hierarchy which is sometimes very important.
Is there any way I can modify upload date of a file(s) in S3?

The Last Modification Date is generated by Amazon S3 and cannot be set via the API.
If dates and other information (eg user) are important to your application, you can store it as metadata on the object. Then, retrieve the metadata when displaying dates, user, etc.

What I did was renaming the file to something else and then renaming it again to its original name.
As you cannot rename directly, you have to copy the file to a new name, and then copy it back to its original name. (and delete the auxiliary file, of course)
It is not optimal, but that's the solution when using AWS client. I hope one day AWS will have all function the FTP used to have.

You can just copy over the same object and the timestamp will update.
This technique is also used to prolong the expire of an object in a bucket with a lifecycle rule.

Related

How to migrate data from s3 bucket to glacier?

I have a TB sized S3 bucket with pdf files. I need to migrate the old files to glacier. I know that I can create a life cycle rule to migrate files which are older than certain number of days. But in my case currently the bucket consists of both old and new pdf files and they were added at a same time. So they may have same uploaded date. In this case a life cycle rule won't be useful.
In the pdf files there is a field called capture_date. So i need to migrate those files based on the capture_date. (ie: migrate all pdf files if the capture_date < 2015-05-21 likewise).
Will a Fargate job will be useful here? if so, please give a brief idea.
Please suggest your ideas. Thanks in advance
S3 by itself will not read your pdf files. Thus you have to read them yourself, extract data that determine which ones are old and new, and using AWS SDK (or CLI) to move them to Glacier.
Since the files are not too big, you could use S3 Batch along with lambda function which would do the change of the class to glacier.
Alternatively, you could do this on an EC2 instance, using S3 Inventory's CSV list of your objects (assuming large number of them).
And the most traditional way is to just list your bucket, and iterate over each object.

Automating folder creation in S3

I have an S3 bucket into which clients drop data files (CSV files) each month. I was wondering there was a way that I could automatically create a new "folder" (object) every time the files are dropped each month and put the newest files into that "folder". I need the CSV files separated by month so that AWS Glue can create new partitions when I run incremental crawlers on this bucket.
For example, let's say I have a S3 bucket called "client." On December 1st, a new CSV file ("DecClientData") will be dropped into that "client" bucket. I want to know if there is a way to automate the following two processes:
Create a "folder" (let's call it "dec") within "client".
Place the "DecClientData" file in the "dec" "folder".
Thanks in advance for any assistance you can provide!
S3 doesn't have the notion of folders commonly found in file systems but instead has a flat structure, more details can be found here.
Instead, the full path of an object is stored in its Key (filename). For example, an object can be stored in Amazon S3 with a Key of files/2020-12/data.txt regardless of the existence of files and 2020-12 directories (they are not really directories but zero-length objects).
In your case, to solve both points you are mentioning, you should leverage S3 event notifications and use them as a Lambda Trigger. When the Lambda function is triggered, it is passed the name of the object (Key) as an argument, at that point you can simply change its Key.
I.e. Object is uploaded in s3://my_bucket/uploads/file.txt, this creates an event notification that triggers a Lambda function. The functions gets the object and re-uploads it to s3://my_bucket/files/dec/file.txt (and deletes the original one).
Write an AWS Lambda function to create a folder in the client bucket and move the most recent .csv file (or files) in the new folder.
Then, configure the client S3 bucket to trigger the AWS Lambda function on new uploads through the event notification settings.

is there any way to setup s3 bucket to get append to the existing object for each run?

We have a requirement to append to the existing S3 object, when we run the spark application every hour. I have tried this code:
df.coalesce(1).write.partitionBy("name").mode("append").option("compression", "gzip").parquet("s3n://path")
This application is creating new parquet files for every run. Hence, I am looking for a workaround to achieve this requirement.
Question is:
How we can configure the S3 bucket to get append to the existing object?
It is not possible to append to objects in Amazon S3. They can be overwritten, but not appended.
There is apparently a sneaky method where a file can be multi-part copied, with the 'source' set to the file and then set to some additional data. However, that cannot be accomplished in the method you show.
If you wish to add additional data to an External Table (eg used by EMR or Athena), then simply add an additional file in the correct folder for the desired partition.

Easy way to created dated subdirectories on AWS S3

I'm trying to create a web service that is able to store user-upload files in S3. The problem is that we want the files stored in "dated directories".
For example, if a user uploads a.txt on 12/1/2017 at 9:15am, the file should look like this in S3:
https://s3-eu-west-1.amazonaws.com/test-bucket/uploaded/2017/12/1/9/a.txt
Does S3 have any API to help us achieving this or do we need to hand-craft this solution?
There is no such API in S3. Think of Amazon S3 as a storage service, not an application or database.
It is the responsibility of your application to store the data in the desired naming format -- just like storing data on a disk.
By the way, your naming format could do with some improvement:
Always expand fields to the correct number of digits (use 01 for January rather than 1) so that they sort correctly.
Think about your use-case -- if you will be scanning documents by year, then the /2017/12/01/09/a.txt naming format makes sense since you can look in the 2017 directory (not that directories really exist in S3). If not, then simply store it as /2017-12-01-09-a.txt.
Make it very clear which one is month vs day -- the USA is the only country in the world that treats "12/1/2017" as December 1st. The rest of the world reads it as "12 January". Using the format of 2017-12-01 makes it clear that it is 1-December-2017.
What about naming conflicts? Can only one person upload a file with a given name on a given day? How are you going to differentiate between different users uploading a file with the same name?
The reality is, the filename is totally irrelevant -- your application should use a database to keep track of objects that users
upload and assign each of them a unique name. When a file is later
requested, lookup the filename in the database and then provide that
file. Do not use S3 filenames as a pseudo-database where the name
conveys particular meaning, otherwise you'll often have to rename
files to add more meaning!
Directories don't actually exist in S3 -- they are just part of the filename. So, you can create a file in a given directory just by storing it -- there is no need to pre-create directories.
AWS S3 does not provide you with such logic. But it should by fairly easy to use the time information of your application to create such a s3 object key ("path").
Good luck!

How to Upload/download to S3 without changing Last Modified date?

I want to upload and download files to S3 using boto3 without changing their "LastModified" date so I can keep tabs on the age of the contents. Whenever I upload or download a file it takes on the date of this operation and I lose the date that the contents were modified.
I'm looking at the timestamp of the files using
fileObj.get('LastModified')
where the fileObj is taken from a paginator result. I'm using the following command to upload
s3Client.upload_fileobj(data, bucket_name, destpath)
and the following to download the files:
s3Client.download_file(bucket_name, key, localPath)
How can I stop the last modified date changing?
This is not possible.
The Last Modified Date is generated by Amazon S3 and cannot be overridden.
If you wish to maintain your own timestamps, you could add some User-Define Metadata and set the value yourself.
If you replicate the content using the AWS replication tool from an existing bucket to another therefore the last modified date would also be replicated. It is not a copying action it is a cloning action.