AWS S3 file with same name does not get overwrite but gets characters added at the end of filename - amazon-web-services

Below is an example for my scenario,
I have a Django API which allows user to upload images to a certain directory, the images will be stored in an S3 bucket. Let's say the file name is 'example.jpeg'
User again uploads image with the same name 'example.jpeg' to the same directory.
Both of them correctly show up in the same directory but the second one gets additional characters at the end of the filename like this 'example_785PmrM.jpeg'. I suspect the additional characters are added by s3 but my research says s3 will overwrite the file with same name.
How can I enable the overwrite feature, I haven't seen any option for this.
Thanks

S3 itself does not change a key on it's own. The only option I see that can be impacting this is Django's storage backend for S3:
AWS_S3_FILE_OVERWRITE (optional: default is True)
By default files with the same name will overwrite each other. Set this to False to have extra characters appended.
So you should set AWS_S3_FILE_OVERWRITE to True to prevent this behavior.
Depending on your exact needs, consider enabling S3 versioning so you can access previous versions of a objects as they're overwritten in S3 in the future.

Related

Viewing s3 files in browser

I have created a bucket in s3 and successfully uploaded files to it with django storages. However, when I try to access the files in the browser, I get the following error:
IllegalLocationConstraintException
The eu-south-1 location constraint is incompatible for the region specific endpoint
this request was sent to.
I have also realised I do not have region name included in my URL(https://docs.s3.amazonaws.com/media/admin/2.pdf...).
Could that be the problem?
If so, how do I set it to append the region name?
What could be missing here?
TL;DR
Set AWS_S3_ENDPOINT_URL to https://s3.your-region-name.amazonaws.com in settings.py. If you need to specify an alternate region in an override of S3Boto3Storage for a particular field, set the endpoint_url attribute to s3.your-alternate-region-name.amazonaws.com.
Explanation
I finally figured this out after an embarrassing number of hours. According to this comment on a boto3 repo issue if a region was launched after 20 March 2019 (which both eu-south-1 and af-south-1, the region I am using, were), then basically s3 requests are routed differently. Read the comment, but in order to fix this you need to specify which region the request is going to like so:
This URL style works for all regions but the ones launched after 20 March 2019: bucket-name.s3.amazonaws.com/file_key.txt. Don't use this one.
For the regions launched post 20 March 2019 the URL needs to include the region name between the .s3 and .amazonaws.com parts, like so:bucket-name.s3.your-region-name.amazonaws.com/file_key.txt. Note that this style is backwards-compatible and works with all s3 regions. Use this one.
This means that we need to explicitly set the endpoint_url on all of these regions. We need to keep in mind that the addressing_style attribute for django-storages is set to None, meaning that boto3 will use the value path for this attribute. This means that if we set our endpoint_url to bucket-name.s3.your-region-name.amazonaws.com on S3Boto3Storage override classes (like below) boto3 will prepend the bucket_name attribute to every s3 key. What you will end up with is something like bucket-name.s3.your-region-name.amazonaws.com/bucket-name/file_key.txt when we obviously only want bucket-name.s3.your-region-name.amazonaws.com/file_key.txt. This is not documented in django-storages.
class IncorrectStorageSetup(S3Boto3Storage):
bucket_name = "bucket-name"
endpoint_url = "bucket-name.s3.your-region-name.amazonaws.com"
# addressing_style = None -> defaults to path type `addressing_style`.
Here is how to fix this.
Use path addressing style by leaving AWS_S3_ADDRESSING_STYLE to its default of None in settings.py and set AWS_S3_ENDPOINT_URL to s3.your-region-name.amazonaws.com. This means that all URLs will take the correct form of s3.your-region-name.amazonaws.com/bucket-name/file_key.txt. Now, every time you override S3Boto3Storage you only need to set the bucket_name attribute if the bucket is in your-region-name that you set in AWS_S3_ENDPOINT_URL above, that is. If you want to use another region, explicitly set the endpoint_url attribute of the class to s3.your-other-region-name.amazonaws.com as well as the bucket_name attribute.
Note there is another way to fix this by using the converse, i.e. setting AWS_S3_ADDRESSING_STYLE to virtual with everything else configured the same. It should do the same thing but you need to explicitly set AWS_S3_ADDRESSING_STYLE which is one more step than above.

How to upload a file to S3 without a name

I am reading data from S3 bucket using Athena and the data from following file is correct.
# aws s3 ls --human s3://some_bucket/email_backup/email1/
2020-08-17 07:00:12 0 Bytes
2020-08-17 07:01:29 5.0 GiB email_logs_old1.csv.gz
When I change the path to _updated as shown below, I get an error.
# aws s3 ls --human s3://some_bucket/email_backup_updated/email1/
2020-08-22 12:01:36 5.0 GiB email_logs_old1.csv.gz
2020-08-22 11:41:18 5.0 GiB  
This is because of the extra file without name in the same location. I have no idea how I managed to upload a file without a name. I will like to know how to repeat it (so that I can avoid it)
All S3 files have a name (in fact the full path is actually the object key, which is the metadata to define your object name).
If you see a blank named file in the path of s3://some_bucket/email_backup_updated/email1/ you have likely created a file named s3://some_bucket/email_backup_updated/email1/.
As I mentioned earlier S3 objects use key, for this reason the file hierarchy does not exist. You simply are filtering by prefix instead.
You should be able to validate this by performing the following without the trailing slash aws s3 ls --human s3://some_bucket/email_backup_updated/email1.
If you add an extra non-breaking space at the end of the destination path, the file will be copied to S3 but with a blank name. for e.g.
aws s3 cp t.txt s3://some_bucket_123/email_backup_updated/email1/ 
(Note the non-breaking space after email1/ )
\xa0 is actually non-breaking space in Latin1, also chr(160). The non breaking space itself is the name of the file!
Using the same logic, I can remove the "space" file by adding the non-breaking space at the end.
aws s3 rm s3://some_bucket_123/email_backup_updated/email1/ 
I can also login to console and remove it from User Interface.

Replace content in all files inside s3 bucket

I have a s3 bucket which is mapped to a domian say xyz.com . When ever a user register on xyz.com a file is created and stored in s3 bucket. Now i have 1000 of files in s3 and I want to replace some text in those files. All files have common name in start ex abc-{rand}.txt
The safest way of doing this would be to regenerate them again through the same process you originally used.
Personally I would try to avoid find and replace as it could lead to modifying parts that you did not intend.
Run multiple generations in parallel and override the existing files. This will ensure the files you generate will match your expectation and will not need to be modified again.
As a suggestion enable versioning before any of these interactions if you want the ability to rollback quickly in a scenario where it needs to be reverted.
Sadly, you can't do this in place in S3. You have to download them, change their content and re-upload.
This is because S3 is an object storage system, not regular file system.
To simply working with S3 files, you can use third part tool s3fs-fuse. The tool will make the S3 appear like a filesystem on your os.

uploading file to specific folder in S3 bucket using boto3

My code is working. The only issue I'm facing is that I cannot specify the folder within the S3 bucket that I would like to place my file in. Here is what I have:
with open("/hadoop/prodtest/tips/ut/s3/test_audit_log.txt", "rb") as f:
s3.upload_fileobj(f, "us-east-1-tip-s3-stage", "BlueConnect/test_audit_log.txt")
Explanation from #danimal captures pretty much everything. If you wanted to just create a folder-like object in s3, you can simply specify that folder-name and end it with "/", so that when you look at it from the console, it will look like a folder.
It's rather useless, an empty object, without a body (consider it as a key with null value) just for eye-candy but if you really want to do it, you can.
1) You can create it on the console interactively, as it gives you that option
2_ You can use aws sdk. boto3 has put_object method for s3 client, where you specify the key as "your_folder_name/", see example below:
import boto3
session = boto3.Session() # I assume you know how to provide credentials etc.
s3 = session.client('s3', 'us-east-1')
bucket = s3.create_bucket('my-test-bucket')
response = s3.put_object(Bucket='my-test-bucket', Key='my_pretty_folder/' # note the ending "/"
And there you have your bucket.
Again, when you are uploading a file you specify "my-test-bucket/my_file" and what you did there is create a "key" with name "my-test-bucket/my_file" and put the content of your file as its "value".
In this case you have 2 objects in the bucket. First object has null body and looks like a folder, while the second one looks like it is inside that but as #danimal pointed out in reality you created 2 keys in the same flat hierarchy, it just "looks-like" what we are used to seeing in a file system.
If you delete the file, you still have the other objects, so on the aws console, it looks like folder is still there but no files inside.
If you skipped creating the folder and simply uploaded the file like you did, you would still see the folder structure in AWS Console but you have a single object at that point.
When you however list the objects from command line, you would see a single object and if you delete it on the console it looks like folder is gone too.
Files ('objects') in S3 are actually stored by their 'Key' (~folders+filename) in a flat structure in a bucket. If you place slashes (/) in your key then S3 represents this to the user as though it is a marker for a folder structure, but those folders don't actually exist in S3, they are just a convenience for the user and allow for the usual folder navigation familiar from most file systems.
So, as your code stands, although it appears you are putting a file called test_audit_log.txt in a folder called BlueConnect, you are actually just placing an object, representing your file, in the us-east-1-tip-s3-stage bucket with a key of BlueConnect/test_audit_log.txt. In order then to (seem to) put it in a new folder, simply make the key whatever the full path to the file should be, for example:
# upload_fileobj(file, bucket, key)
s3.upload_fileobj(f, "us-east-1-tip-s3-stage", "folder1/folder2/test_audit_log.txt")
In this example, the 'key' of the object is folder1/folder2/test_audit_log.txt which you can think of as the file test_audit_log.txt, inside the folder folder1 which is inside the folder folder2 - this is how it will appear on S3, in a folder structure, which will generally be different and separate from your local machine's folder structure.

Google Cloud Storage bucket has stopped overwriting files by default when uploading with the Python library

I have an App Engine cron job that runs every week, uploading a file called logs.json to a Google Cloud Storage bucket.
For the past few months, this file has been overwritten each time the new version was uploaded.
In the last few weeks, rather than overwriting the file, the existing copy has been retained and the new one uploaded under a different name, e.g. logs_XHjYmP3.json.
This is a simplified snippet from the Django storage class where the upload is performed. I have verified that the filename is correct at the point of upload:
# Prints 'logs.json'
print(file.name)
blob.upload_from_file(file, content_type=content_type)
blob.make_public()
Reading the documentation, it says:
The effect of uploading to an existing blob depends on the
“versioning” and “lifecycle” policies defined on the blob’s bucket. In
the absence of those policies, upload will overwrite any existing
contents.
The versioning for the bucket is set to suspended, and I'm not aware of any other settings or any changes I have made that would affect this.
How can I make the file upload overwrite any existing file with the same name?
After further testing, although print(file.name) looked correct, the incorrect filename was actually coming from Django's get_available_name() storage class method. That method was trying to generate a unique filename if the file already existed. I have added the method to my custom storage class, and, if the file meets the criteria, I just return the existing name to allow it to overwrite. I'm still not sure why it started doing this, however.