AWS S3 Deleting file while someone is downloading that object - amazon-web-services

I cant seem to find how does AWS s3 handles, if someone deletes file while other person is downloading it.
Does it behave like unix system, where descriptor is opened and file is downloaded without problems or does it behave in other way?
Thanks for help!

S3 offers eventual consistency for DELETES.
From the S3 Data Consistency Model
A process deletes an existing object and immediately tries to read it.
Until the deletion is fully propagated, Amazon S3 might return the
deleted data.
Here, where the deletion and downloading of the same object is done concurrently, even if the deletion of the object succeeds before the download is complete, the process will still be able to download the data.

You will face a race condition, of sorts, so the outcome is unpredictable. When you download a file from S3, you might be connected to multiple S3 servers. If at any time, you request part of an S3 object and the server you are connected to thinks the object has been deleted then your download will fail.
Here's a simple test: store a 2GB flle in S3, then download it. While it it is downloading, go into the S3 console and delete the object. You will find that your download fails (with NoSuchKey) because the specified key no longer exists.
Create a temporary 2GB file and upload it to S3:
mkfile -n 2g 2gb.dat
$ aws s3 cp 2gb.dat s3://mybucket
upload: ./2gb.dat to s3://mybucket/2gb.dat
Once complete, start downloading the file:
$ aws s3 cp s3://mybucket/2gb.dat fred.dat
Completed 162.2 MiB/2.0 GiB (46.9 MiB/s) with 1 file(s) remaining
Then jump over to the S3 console and delete 2gb.dat, and this will happen:
$ aws s3 cp s3://mybucket/2gb.dat fred.dat
download failed: s3://mybucket/2gb.dat to ./fred.dat An error occurred
(NoSuchKey) when calling the GetObject operation: The specified key does not exist.

Related

Deflating 7z within Google Cloud Storage Bucket

I am trying to deflate a 7z multipart container within a Google Cloud Storage Bucket. Can I do this without copying the data locally and re-uploading?
I want to make sure that I perform the extraction of the files without generating unnecessary overhead. I am not sure if there is any way this can be done directly within the Bucket.
In an ideal scenario I could decompress the archives directly into the Bucket.
I believe you might be making confusion between the term storage that one would be used to nowadays, as in a persistent disk accessed by a File System abstraction, and what you can do with a Google Cloud Storage Bucket.
You can make several operations on Objects, which are the pieces of data that reside in Buckets, including upload and download.
So, you have a compressed file in a Bucket and you want to decompress it and have the decompressed content in a Bucket too. Then you have to download the compressed file to some machine that is able to decompress it and after that you’d upload the decompressed content.
I'll leave you here a demonstration:
Make sure you have an archive file and nothing else on the current directory.
ARCHIVE=ar0000.7z
Create a Bucket, if you don't got one created already:
gsutil mb gs://sevenzipblobber
Upload the archive file to a Bucket:
gsutil cp -v $ARCHIVE gs://sevenzipblobber/archives/
Download the archive file from a Bucket (this could from any other Bucket at any other time):
gsutil cp -v gs://sevenzipblobber/archives/$ARCHIVE .
Extract and remove the archive:
7z x $ARCHIVE && rm -v $ARCHIVE
Upload to a Bucket the contents of the current directory, which should be the contents of the archive file decompressed (keep in mind that with the -m flag, that speeds up the upload, the output will be jumbled up).
gsutil -m cp -vr . gs://sevenzipblobber/dearchives/$ARCHIVE
List the contents of the Bucket:
gsutil ls -r gs://sevenzipblobber/
You could also use a Client Server pattern, where the Server would be responsable for decompressing the archive and upload the contents to Cloud Storage again.
The Client could be Google Cloud Functions triggered by an event on a Bucket, in this case the Server could be an HTTP Server waiting for the upload.
Or the Client could be Cloud Pub/Sub Notifications for Cloud Storage and therefore Server would have to be subscribed to the respective topic.

Spark doesn't output .crc files on S3

When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file.
Using the same job on Aws EMR and writing on S3, the .crc files are not written.
Is this normal? Is there a way to force the writing of .crc files on S3?
those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies.
On S3, stopping corruption is left to AWS.
What you can get off S3 is the etag of a file, which is the md5sum on a small upload; on a multipart upload it is some other string, which again, changes when you upload it.
you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. For earlier versions, you can't get at it, nor does the aws s3 command show it. You'd have to try some other S3 libraries (it's just a HEAD request, after all)

How to clean up S3 files that is used by AWS Firehose after loading the files?

AWS Firehose uses S3 as an intermittent storage before the data is copied to redshift. Once the data is transferred to redshift, how to clean them up automatically if it succeeds.
I deleted those files manually, it went out of state complaining that files got deleted, I had to delete and recreate Firehose again to resume.
Deleting those files after 7 days with S3 rules will work? or Is there any automated way, that Firehose can delete the successful files that got moved to redshift.
Discussing with Support AWS,
Confirmed it is safe to delete those intermediate files after 24 hour period or to the max retry time.
A Lifecycle rule with an automatic deletion on S3 Bucket should fix the issue.
Hope it helps.
Once you're done loading your destination table, execute something similar to (the below snippet is typical to a shell script):
aws s3 ls $aws_bucket/$table_name.txt.gz
if [ "$?" = "0" ]
then
aws s3 rm $aws_bucket/$table_name.txt.gz
fi
This'll check whether the table you've just loaded exists on s3 or not and will drop it. Execute it as a part of a cronjob.
If your ETL/ELT is not recursive, you can write this snippet towards the end of the script. It'll delete the file on s3 after populating your table. However, before execution of this part, make sure that your target table has been populated.
If you ETL/ELT is recursive, you may put this somewhere at the beginning of the script to check and remove the files created in the previous run. This'll retain the files created till the next run and should be preferred as the file will act as a backup in case the last load fails (or you need a flat file of the last load for any other purpose).

aws s3 mv/sync command

I have about 2 million files nested in subfoldrs in a bucket and want to move all of them to another bucket. Spending much of time on searching ... i found a solution to use AWS CLI mv/sync command. use move command or use sync command and then delete all the files after successfully synced.
aws s3 mv s3://mybucket/ s3://mybucket2/ --recursive
or it can be as
aws s3 sync s3://mybucket/ s3://mybucket2/
But the problem is how would i know that how many files/folders have moved or synced and how much time would it take...
And what if some exception occurs(machine/server stops/ internet disconnection due to any reason )...i have to again execute the command or it will for surely complete and move/sync all files. How can i be sure about the number of files moved/synced and files not moved/synced.
or can i have something like that
I move limited number of files e.g 100 thousand.. and repeat until all files are moved...
or move files on the basis of uploaded time.. e.g files uploaded from starting date to ending date
if yes .. how?
To sync them use:
aws s3 sync s3://mybucket/ s3://mybucket2/
You can repeat the command, after it finish (or fail) without issue. This will check if anything is missing/different to the target s3 bucket and will process it again.
The time depends on what size are the files, how much objects you have. Amazon counts directories as an object, so they matter too.

Sync command for OpenStack Object Storage (like S3 Sync)?

Using the S3 CLI, I can sync a local directory with an S3 bucket using the following command:
aws s3 sync s3://mybucket/ ./local_dir/
This command is a complete sync. It uploads new files, updates changed files, and deletes removed files. I am trying to figure out how to do something equivalent using the OpenStack Object Storage CLI:
http://docs.openstack.org/cli-reference/content/swiftclient_commands.html
The upload command has a --changed option. But I need a complete sync that is also capable of deleting local files that were removed.
Does anyone know if I can do something equivalent to s3 sync?
The link you mentioned has this :
`
objects –
A list of file/directory names (strings) or SwiftUploadObject instances containing a source for the created object, an object name, and an options dict (can be None) to override the options for that individual upload operation`
I'm thinking, if you pass the directory and the --changed option it should work.
I don't have a swift to test with. Can you try again?