I am a bit confused on meaning of LastModified time in S3.
Suppose I start upload of a large file at 10:00 AM and say upload takes 4 minutes. I am seeing that instead of showing LastModified time as 10:04 AM its showing the same as 10:00 AM, i.e. when I initiated the upload.
In Azure Blob Storage however lastModified time however seems to be the time when upload completed.
Am I interpreting this incorrectly for S3 ? I mean how can we have lastModified time as the time when upload starts because technically object is not created until all bytes are uploaded, right ?
Looking at answers like: amazon S3 upload object LastModified date keep changed? its confusing as they seem to be mentioning LastModified to be the time when upload finished.
Can anyone please confirm ?
Last-Modified is defined to be:
Object creation date or the last modified date, whichever is the latest.
Last modified is more like creation date, as mentioned in the docs:
Amazon S3 maintains only the last modified date for each object. For example, the Amazon S3 console shows the Last Modified date in the object Properties pane. When you initially create a new object, this date reflects the date the object is created. If you replace the object, the date changes accordingly. So when we use the term creation date, it is synonymous with the term last modified date.
It seems that it uses the value of the Date header as demonstrated in the PutObject example here - this will be, as you've seen, when the upload request was started and not when it finished.
Why S3 uses the Date header and not the timestamp of when the file has finished uploading is something internal to AWS AFAIK.
I have not seen the answer to the question, "why?" in the docs.
Related
We know that if we download a large file in linux or mac, the file last modified time will keep changed. Is that same in S3? The object last modified time will keep changed during uploading, or it just a simple timestamp to record the start of upload operation?
Doc says After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata.,i believe in order to maintain atomicity, if the put operation is successful then only it will update time.
Last-Modified,comes under category of system-defined metadata.
Last-Modified-> Description-> Object creation date or the last modified date, whichever is the latest.which a successful put operation will only update lastmodified time in other words.
modified date/time is updated by the S3 system itself, and reflects the time when the file completed uploading fully to S3 (S3 will not show incomplete transfers.)
The last modified date of an object is a direct reflection of when the object was last put into S3.
even a similar answer says the same https://stackoverflow.com/a/40699793/13126651 - "The Last-Modified timestamp should match the Date value returned in the response headers from the successful PUT request."
in S3 buckets we have a folder where incoming files are being placed. And then some of our system picks it up and processes it.
I want to know how many files in this folder is older than some period and then send a notification to corresponding team.
I.e. In S3 bucket, if some file arrived today and it's still there even after 3 hours, I want to get notified.
I am thinking to use boto python library to iterate through all the objects inside S3 bucket at schduled interval to check files are folder. And then send notification. However, this pulling solution doesn't seem good.
I am thinking to have some event based solution. I know, S3 has events which I can subscribe using either queue or lambda. However, I don't want to do any action as soon as I have file available, I just want to to check which files are older than some time and send email notification.
can we achieve this using event based solution?
Per hour we are expecting around 1000 files. Once file is processed they are moved to different folder. However if something goes wrong it will be there. So in day, I am not expecting more than 10,000 files in one bucket. Consider I have multiple buckets.
Itarate through S3 files to do that kind of filter is not a good idea. It can get very slow when you have more than a thousad of files in there. I would suggest you to use a database to store that records.
You can have a dynamodb with 2 columns: file name and upload date. Or, if budget is a problem, you can even have a sqlite3 file on the bucket, and fetch it whenever you need to query or add data to it. I did this using lambda, and it works just fine. Just don't forget to upload the file again when new records are inserted.
You could create an Amazon CloudWatch Event rule that triggers an AWS Lambda function at a desired time interval (eg every 5 minutes or once an hour).
The AWS Lambda function could list the desired folder looking for files older than a desired time period. It would be something like this:
import boto3
from datetime import datetime, timedelta, timezone
s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(
Bucket = 'my-bucket',
Prefix = 'to-be-processed/'
)
for page in page_iterator:
for object in page['Contents']:
if object['LastModified'] < datetime.now(tz=timezone.utc) - timedelta(hours=3):
// Print name of object older than given age
print(object['Key'])
You could then have it notify somebody. The easiest way would be to send a message to an Amazon SNS topic, and then people can subscribe to that topic via SMS or email to receive a notification.
The above code is quite simple in that it will find the same file every time, not just the new files that have been added to the notification period.
I want to upload and download files to S3 using boto3 without changing their "LastModified" date so I can keep tabs on the age of the contents. Whenever I upload or download a file it takes on the date of this operation and I lose the date that the contents were modified.
I'm looking at the timestamp of the files using
fileObj.get('LastModified')
where the fileObj is taken from a paginator result. I'm using the following command to upload
s3Client.upload_fileobj(data, bucket_name, destpath)
and the following to download the files:
s3Client.download_file(bucket_name, key, localPath)
How can I stop the last modified date changing?
This is not possible.
The Last Modified Date is generated by Amazon S3 and cannot be overridden.
If you wish to maintain your own timestamps, you could add some User-Define Metadata and set the value yourself.
If you replicate the content using the AWS replication tool from an existing bucket to another therefore the last modified date would also be replicated. It is not a copying action it is a cloning action.
I need to move some files (thousands) to Amazon S3 bucket, from where they will be displayed to the end-user by another application (instead of the current one).
Problem is, that these files have creation/upload date now (dates very between 2012 and 2017, when they were uploaded to current application), and when I move them they all start to be of the same date. That is a problem because when you look at the files in the new application, you don't understand the time hierarchy which is sometimes very important.
Is there any way I can modify upload date of a file(s) in S3?
The Last Modification Date is generated by Amazon S3 and cannot be set via the API.
If dates and other information (eg user) are important to your application, you can store it as metadata on the object. Then, retrieve the metadata when displaying dates, user, etc.
What I did was renaming the file to something else and then renaming it again to its original name.
As you cannot rename directly, you have to copy the file to a new name, and then copy it back to its original name. (and delete the auxiliary file, of course)
It is not optimal, but that's the solution when using AWS client. I hope one day AWS will have all function the FTP used to have.
You can just copy over the same object and the timestamp will update.
This technique is also used to prolong the expire of an object in a bucket with a lifecycle rule.
The AWS S3 docs state that:
Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all regions.
http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
The timespan until full consistency is reached can vary. During this period GET requests may return the previous object or the udpated object.
My question is:
When is the last-modified timestamp updated? Is it updated immediately after the overwrite PUT succeeds but before full consistency is reached, or is it only updated after full consistency is achieved?
I suspect the former but I can't find any documentation which clearly states this.
The Last-Modified timestamp should match the Date value returned in the response headers from the successful PUT request.
To my knowledge, this is not explicitly documented, but it can be derived from what is documented.
When you overwrite an object, it's not the overwriting itself that may be delayed by the eventual consistency model -- it's the availability of the overwritten content at a given S3 node (S3 is replicated to multiple nodes within the S3 region).
The Last-Modified timestamp, like the rest of the metadata, is established at the time of object creation and immutable, thereafter.
It is, in fact, not the "modification" time of the object at all, it is the creation time of the object. The explanation may sound pedantic, but it is accurate in the strictest sense: S3 objects and their metadata cannot in fact be modified at all, they can only be overwritten. When you "overwrite" an object in S3, what you are actually doing is creating a new object, reusing the old object's key (path+file name). The availability of this new object at a given S3 node (replication) is what may be delayed by the eventual consistency model... not the actual creation of the new object that overwrites the old one... hence there would be no reason for Last-Modified to be impacted by the replication delay (assuming there is a replication delay -- eventual consistency can at times be indistinguishable from immediate consistency).
This is something S3 does that is absolutely terrible.
Basically in Linux you have the mtime which is the time the file was last modified on the filesystem. Any S3 client could gather the mtime and set the Last-Modified time on S3 so that it would maintain when things were actually last modified.
Instead, Amazon just does this based on the object creation and this is effectively a massive problem if you ever just want to use the data as data outside of the original application that put it there.
So if you download a file from S3, your client would likely set the modified time and if it was uploaded to s3 immediately as it was created then you would at least have a near correct timestamp. But the reality is that you might take a picture and it might not get from your phone through the app, through the stack and to S3 for days!
This is not even considering re-uploading the file to s3. Which would compound the problem, as you might re-upload it years later. S3 will just act like Last-Modified is years later when the file was not actually modified.
They really need to allow you to set it, but they remain ambiguous and over-documented in other areas to make this hard to figure out.
https://github.com/s3tools/s3cmd/issues/524