aws s3 replace file atomically - amazon-web-services

Environment
I copied a file, ./barname.bin, to s3, using the command aws s3 cp ./barname.bin s3://fooname/barname.bin
I have a different file, ./barname.1.bin that I want to upload in place of that file
How can I upload and replace (overwrite) the file at s3://fooname/barname.bin with ./barname.1.bin?
Goals:
Don't change the s3 url used to access the file (new file should also be available at s3://fooname/barname.bin).
zero/minimum 'downtime'/unavailability of the s3 link.

As I understand it, you've got an existing file located at s3://fooname/barname.bin and you want to replace it with a new file. To replace that, you should just upload a new one on top of the old one:
aws s3 cp ./barname.1.bin s3://fooname/barname.bin.
The old file will be replaced. According to the S3 docs, this is atomic, though due to EC2s replication pattern, requests for the key may still return the old file for some time.
Note (thanks #Chris Kuehl): though the replacement is technically atomic, it's possible for multipart downloads to end up with chunks from different versions of the file. 😬

Related

dynamically create / append to zip from multiple instances

I have a situation where thousands o files are created for a user by multiple backend instances, and then they're uploaded to AWS S3 / Azure Storage. After all the files are created, the user wants to download them as a zip. I can create the zip and then get a pre-signed URL, but I tried few archiving solutions and all of them are just taking too much time (hours).
Is there any way of creating the zip dynamically from the multiple backend instances? I want append to zip after each file creation, from any backend instance.
Zip itself supports the use case you want. For example, zip command in Linux:
When given the name of an existing zip archive, zip will replace identically named entries in the zip archive (matching the relative names as stored in the archive) or add entries for new names.
You need to persist the working zip file somewhere in a file system though. The most obvious choice I can think of is EFS, so that multiple instances can mount the file system and access the zip file.
If you don't want to modify the existing instances/workloads, you can even mount EFS on Lambda. Then set S3 trigger for the Lambda to update zip file every time a new file is uploaded.
I think you can not use only S3 for this, because you cannot update S3 objects. Then you need to download/upload for every new file, which is really not ideal.

Amazon S3 notification for file change

Initially a csv file is uploaded to S3 bucket and we often append that file by scripting when new row is added to that csv file. what we want is we want the script to run only when the csv file is modified, is there any watchers which can notify the script to run when the csv file is changed?
There is S3 event notification for that, you would be interested in the s3:ObjectCreated event
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
You should also take a look at the s3 documentation and note the difference between S3 and a File system. An "update" or "append" operation on s3 is actually replacing the whole object, just for your information

Replace content in all files inside s3 bucket

I have a s3 bucket which is mapped to a domian say xyz.com . When ever a user register on xyz.com a file is created and stored in s3 bucket. Now i have 1000 of files in s3 and I want to replace some text in those files. All files have common name in start ex abc-{rand}.txt
The safest way of doing this would be to regenerate them again through the same process you originally used.
Personally I would try to avoid find and replace as it could lead to modifying parts that you did not intend.
Run multiple generations in parallel and override the existing files. This will ensure the files you generate will match your expectation and will not need to be modified again.
As a suggestion enable versioning before any of these interactions if you want the ability to rollback quickly in a scenario where it needs to be reverted.
Sadly, you can't do this in place in S3. You have to download them, change their content and re-upload.
This is because S3 is an object storage system, not regular file system.
To simply working with S3 files, you can use third part tool s3fs-fuse. The tool will make the S3 appear like a filesystem on your os.

AWS S3: .csv file is downloaded as .csv

I have 2 AWC accounts, each of them has one S3 bucket. I uploaded two same-size .CSV files to each of the S3 bucket.
When I try to Download or Download As, this file is downloaded as .CSV file in first account. BUT(!!) When I try to download this file from second account - it is downloading it as .TXT.
How can this happen? Both files are created in the same way: through Redshift UNLOAD query, that perform copying of selected data from Redshift to S3.
UPDATE:
Can it be because in this account for this document , **Server side encryption is equal to AWS-KMS?
I noticed that file, that converted from .csv to .txt has "Server side encryption: AWS-KMS", while .csv file that is downloaded as .csv - has "Server side encryption: NONE"
UPDATE: tried in different browsers - same result
Check the headers for each object in the AWS S3 console and compare the Content-Type values. Content-Type provides a hint to web browsers on what data the object contains.
If Content-Type does not exist or does not contain text/csv, add or modify the header in the S3 console or via your favorite S3 application such as CloudBerry.
John is right about the Content-Type not being text/csv. Sometimes, S3 will get it right and sometimes it won't. If you can't manually correct this yourself, you can run a Lambda function to do this for you everytime you upload a new object. You can use a Python 2.7 template Lambda function to download the object from the bucket, employ mimetypes library to guess_type for your S3 object, and then re-upload the file in the same bucket. You will need to trigger this function with S3 object upload and give it the necessary permissions (S3:GetObject).
P.S. This will work for files with any extension. If you know you are only going to upload .csv files, you can ignore the mimetypes and directly re-upload the object with
bucket.upload_fileobj(filename, key, ExtraArgs={'ContentType': 'text/csv'})
If the mimetypes cannot guess the typethen you might need to add the types, look at an example here https://www.programcreek.com/python/example/5209/mimetypes.add_type
Good Luck!
Here is scala solution (to specify content type):
val settingsLine: String = "csvdata1,csvdata2,csvdata3"
val settingsStream: InputStream = new ByteArrayInputStream(settingsLine.getBytes())
val metadata: ObjectMetadata = new ObjectMetadata()
metadata.setContentType("text/csv")
s3Client.putObject(bucketName, prefix, settingsStream, metadata)

AWS S3 sync between buckets overwriting newer destination files

We have two s3 buckets, and we have a sync cron job that should copy bucket1 changes to bucket2.
aws s3 sync s3://bucket1/images/ s3://bucket2/images/
When a new image is added to bucket1, it correctly gets copied over to bucket2.
However, if we upload a new version of that image to bucket2, when the sync job next runs it actually copies the older version from bucket1 over to bucket2, replacing the newer version we just put there.
This is part of a migration process, and in time the only place images will be uploaded to will be bucket2, but for the time being sometimes they may be uploaded to either, and we only want changes form bucket1 to be copied up to bucket2, NOT the other way round.
Why does the aws sync job seem to think that the file on bucket1 has changed? Does it not know that the file in bucket2 is newer, so it should be left alone?
The AWS Command-Line Interface (CLI) aws s3 sync command copies content from the Source location to the Destination location. It only copies files that have been added or changed since the last sync.
It is designed as a one-way sync, not a two-way sync. Your file is being overwritten because the file in the Source is not present in the Destination. This is correct behavior.
There is limited range to tweak these controls, such as (from the sync command documentation):
--exact-timestamps (boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
However, there does not appear to be an option that stops overwriting of files merely because a file with the same name exists, or something with a preference to keep newer files.
If you want a two-way sync with more specific rules, you will need to code it yourself.