Wrong Last Modified Date in AWS S3 - amazon-web-services

While uploading file one by one to AWS S3 bucket using Java code, I'm observing a strange issue with Last Modified Date column, all the files are showing same last modified date. I followed few posted in StackOverFlow but no-where is properly mentioned on how to set user defied meta-data while storing file to S3.
I did like this in my code but didn't not work for me. Could you please suggest?
ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentLength(bytes.length);
metaData.setHeader("x-amz-meta-last-modified", OffsetDateTime.parse(new Date().toString()).toLocalDate());

Related

Amazon S3 LastModified time vs Upload complete time

I am a bit confused on meaning of LastModified time in S3.
Suppose I start upload of a large file at 10:00 AM and say upload takes 4 minutes. I am seeing that instead of showing LastModified time as 10:04 AM its showing the same as 10:00 AM, i.e. when I initiated the upload.
In Azure Blob Storage however lastModified time however seems to be the time when upload completed.
Am I interpreting this incorrectly for S3 ? I mean how can we have lastModified time as the time when upload starts because technically object is not created until all bytes are uploaded, right ?
Looking at answers like: amazon S3 upload object LastModified date keep changed? its confusing as they seem to be mentioning LastModified to be the time when upload finished.
Can anyone please confirm ?
Last-Modified is defined to be:
Object creation date or the last modified date, whichever is the latest.
Last modified is more like creation date, as mentioned in the docs:
Amazon S3 maintains only the last modified date for each object. For example, the Amazon S3 console shows the Last Modified date in the object Properties pane. When you initially create a new object, this date reflects the date the object is created. If you replace the object, the date changes accordingly. So when we use the term creation date, it is synonymous with the term last modified date.
It seems that it uses the value of the Date header as demonstrated in the PutObject example here - this will be, as you've seen, when the upload request was started and not when it finished.
Why S3 uses the Date header and not the timestamp of when the file has finished uploading is something internal to AWS AFAIK.
I have not seen the answer to the question, "why?" in the docs.

AWS Glue - table version increases on data load even with no schema changes

I have a lambda job which infrequently dumps a parquet file into an S3 bucket/Glue table using AWS Wrangler.
This Glue table appears to be increasing the table version number every time there is new data, even though the schema is unchanged.
I do not think the problem is with the lambda job/wrangler, since it deposits the parquet files as expected. I have also tested that code separately and it works as expected.
Something is going on with the Glue data catalogue table that makes it increase versions despite no changes to the schema.
I have checked for differences in the underlying parquet files to see if there are some schema, data type etc changes between updates, and there are none.
I have checked for differences between the Glue table versions via the console and AWS CLI (aws glue get-table-versions) and found no differences there either (only the UpdateTime and VersionId changes).
I have tried to recreate my setup with the same code and do not find this issue. I have tried to delete and recreate the Glue table in the same place, but the issue reoccurs.
Question: What could be causing my Glue table version numbers to increase when there are no schema changes?
Note:
The code in question looks like this. It's part of a bigger function (this is really just generating logs of what the main lambda function is doing). It works fine on its own and doesn't use variables etc from the rest of the code. I don't see how this could be the issue but including it here anyway.
#other functions do some things when triggered by a new file in another s3 bucket
#this function is just logging which files were processed. It's the Glue table from these log files which is having issues with the version number increasing every time a new log file is added.
import aws-wrangler as wr
def log(resource, filename):
log_df = build_log(resource, filename) # for building the log df, just columns of date, time, file used etc
wr.s3.to_parquet(
df=log_df,
path=log_path(), #s3 bucket where parquet logs are being put
dataset=True,
catalog_versioning=False,
database="MYDB",
partition_cols=['date'],
table='log',
mode='append'
)
This is, I think due to partitioning. You are partitioning based on date, so I guess for every day of time unit a new partition will be added. The new partitions are the reason why the table version is being incremented.

Regex pattern to pick latest file in nifi

I have a NiFi flow where I am getting all data from s3 and putting it in the destination folder. Now, the requirement is if there is any latest data then just transfer the latest data only. I have a data file in s3 like below:
20201130-011101493.parquet
20201129-011101493.parquet
And the regex I tried:
\d[0-9]{8}.parquet
The problem is it is not picking the first file which is the latest data i.e 30/11/2020
How can I modify my regex so that it will be picking the latest file only if the job runs once per day? I also referred this SO post but I guess I am not able to get my regex correct.

How to Upload/download to S3 without changing Last Modified date?

I want to upload and download files to S3 using boto3 without changing their "LastModified" date so I can keep tabs on the age of the contents. Whenever I upload or download a file it takes on the date of this operation and I lose the date that the contents were modified.
I'm looking at the timestamp of the files using
fileObj.get('LastModified')
where the fileObj is taken from a paginator result. I'm using the following command to upload
s3Client.upload_fileobj(data, bucket_name, destpath)
and the following to download the files:
s3Client.download_file(bucket_name, key, localPath)
How can I stop the last modified date changing?
This is not possible.
The Last Modified Date is generated by Amazon S3 and cannot be overridden.
If you wish to maintain your own timestamps, you could add some User-Define Metadata and set the value yourself.
If you replicate the content using the AWS replication tool from an existing bucket to another therefore the last modified date would also be replicated. It is not a copying action it is a cloning action.

How to change file upload date in Amazon S3 using AWS CLI

I need to move some files (thousands) to Amazon S3 bucket, from where they will be displayed to the end-user by another application (instead of the current one).
Problem is, that these files have creation/upload date now (dates very between 2012 and 2017, when they were uploaded to current application), and when I move them they all start to be of the same date. That is a problem because when you look at the files in the new application, you don't understand the time hierarchy which is sometimes very important.
Is there any way I can modify upload date of a file(s) in S3?
The Last Modification Date is generated by Amazon S3 and cannot be set via the API.
If dates and other information (eg user) are important to your application, you can store it as metadata on the object. Then, retrieve the metadata when displaying dates, user, etc.
What I did was renaming the file to something else and then renaming it again to its original name.
As you cannot rename directly, you have to copy the file to a new name, and then copy it back to its original name. (and delete the auxiliary file, of course)
It is not optimal, but that's the solution when using AWS client. I hope one day AWS will have all function the FTP used to have.
You can just copy over the same object and the timestamp will update.
This technique is also used to prolong the expire of an object in a bucket with a lifecycle rule.