I am having a service that uploads files during the day. The same file gets updated multiple time on different events (no determined way to know when it gets updated). At the same time there is a client that downloads the file. What happens if the file gets updated during the download? Does s3 still preserve an old version until all active processes with it are done (kind of like filesystem)? Can the file be corrupted (part from old version, part from new)? Can the connection be closed abruptly in this case?
An object will only be created in Amazon S3 if the upload process completed fully. Partial files will not appear in Amazon S3.
Similarly, when overwriting an object in Amazon S3, the object will only be replaced if the new object was fully uploaded. The new object completely replaces the old object.
There might be a small delay between the upload completing and the new object appearing because objects in Amazon S3 are replicated between multiple servers for durability.
Related
I have an S3 bucket into which clients drop data files (CSV files) each month. I was wondering there was a way that I could automatically create a new "folder" (object) every time the files are dropped each month and put the newest files into that "folder". I need the CSV files separated by month so that AWS Glue can create new partitions when I run incremental crawlers on this bucket.
For example, let's say I have a S3 bucket called "client." On December 1st, a new CSV file ("DecClientData") will be dropped into that "client" bucket. I want to know if there is a way to automate the following two processes:
Create a "folder" (let's call it "dec") within "client".
Place the "DecClientData" file in the "dec" "folder".
Thanks in advance for any assistance you can provide!
S3 doesn't have the notion of folders commonly found in file systems but instead has a flat structure, more details can be found here.
Instead, the full path of an object is stored in its Key (filename). For example, an object can be stored in Amazon S3 with a Key of files/2020-12/data.txt regardless of the existence of files and 2020-12 directories (they are not really directories but zero-length objects).
In your case, to solve both points you are mentioning, you should leverage S3 event notifications and use them as a Lambda Trigger. When the Lambda function is triggered, it is passed the name of the object (Key) as an argument, at that point you can simply change its Key.
I.e. Object is uploaded in s3://my_bucket/uploads/file.txt, this creates an event notification that triggers a Lambda function. The functions gets the object and re-uploads it to s3://my_bucket/files/dec/file.txt (and deletes the original one).
Write an AWS Lambda function to create a folder in the client bucket and move the most recent .csv file (or files) in the new folder.
Then, configure the client S3 bucket to trigger the AWS Lambda function on new uploads through the event notification settings.
I'm an absolute beginner in AWS and have been practising for 3 months from now.
Recently I was working on S3 and playing a bit with S3 object lock. So I enabled S3 object lock for a specific object with governance mode along with legal hold. Now when I tried to overwrite the object with the same file using the following CLI command:
aws s3 cp /Users/John/Desktop/112133.jpg s3://my-buck/112133.jpg
It succeeded interestingly and I checked in the console that the new file is uploaded with Latest Version on it. Now I read this in AWS docs that:
Bypassing governance mode doesn't affect an object version's legal
hold status. If an object version has a legal hold enabled, the legal
hold remains in force and prevents requests to overwrite or delete the
object version.
Now my question is how it get overwritten if this CLI command is used to overwrite a file? I tried also in the console to re uplaod the same file but it also worked.
Moreover I uploaded another file and enabled ojbect lock with compliance mode and it also get overwritten. But deletion doesn't work for both cases as expected.
Did I understand something wrong about the whole S3 ojbect lock thing? Any help will be appreciated.
To quote the Object Lock documentation:
Object Lock works only in versioned buckets, and retention periods and
legal holds apply to individual object versions. When you lock an
object version, Amazon S3 stores the lock information in the metadata
for that object version. Placing a retention period or legal hold on
an object protects only the version specified in the request. It
doesn't prevent new versions of the object from being created.
I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two possible architectures:
Have Spark Streaming watch an S3 prefix and pick up new objects as they
come in
Stream data from S3 to a Kinesis stream (through a Lambda function triggered as new S3 objects are created by DMS) and use the stream as input for the Spark application.
While the second solution will work, the first solution is simpler. But are there any pitfalls? Looking at this guide, I'm concerned about two specific points:
The more files under a directory, the longer it will take to scan for changes — even if no files have been modified.
We will be keeping the S3 data indefinitely. So the number of objects under the prefix being monitored is going to increase very quickly.
“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is opened, even before data has been completely written, it may be included in the DStream - after which updates to the file within the same window will be ignored. That is: changes may be missed, and data omitted from the stream.
I'm not sure if this applies to S3, since to my understanding objects are created atomically and cannot be updated afterwards as is the case with ordinary files.
I posted this to Spark mailing list and got a good answer from Steve Loughran.
Theres a slightly-more-optimised streaming source for cloud streams
here
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala
Even so, the cost of scanning S3 is one LIST request per 5000 objects;
I'll leave it to you to work out how many there will be in your
application —and how much it will cost. And of course, the more LIST
calls tehre are, the longer things take, the bigger your window needs
to be.
“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is
opened, even before data has been completely written, it may be
included in the DStream - after which updates to the file within the
same window will be ignored. That is: changes may be missed, and data
omitted from the stream.
Objects written to S3 are't visible until the upload completes, in an
atomic operation. You can write in place and not worry.
The timestamp on S3 artifacts comes from the PUT tim. On multipart
uploads of many MB/many GB uploads, thats when the first post to
initiate the MPU is kicked off. So if the upload starts in time window
t1 and completed in window t2, the object won't be visible until t2,
but the timestamp will be of t1. Bear that in mind.
The lambda callback probably does have better scalability and
resilience; not tried it myself.
Since the number of objects in my scenario is going to be much larger than 5000 and will continue to grow very quickly, S3 to Spark doesn't seem to be a feasible option. I did consider moving/renaming processed objects in Spark Streaming, but the Spark Streaming application code seems to only receive DStreams and no information about which S3 object the data is coming from. So I'm going to go with the Lambda and Kinesis option.
I need to move some files (thousands) to Amazon S3 bucket, from where they will be displayed to the end-user by another application (instead of the current one).
Problem is, that these files have creation/upload date now (dates very between 2012 and 2017, when they were uploaded to current application), and when I move them they all start to be of the same date. That is a problem because when you look at the files in the new application, you don't understand the time hierarchy which is sometimes very important.
Is there any way I can modify upload date of a file(s) in S3?
The Last Modification Date is generated by Amazon S3 and cannot be set via the API.
If dates and other information (eg user) are important to your application, you can store it as metadata on the object. Then, retrieve the metadata when displaying dates, user, etc.
What I did was renaming the file to something else and then renaming it again to its original name.
As you cannot rename directly, you have to copy the file to a new name, and then copy it back to its original name. (and delete the auxiliary file, of course)
It is not optimal, but that's the solution when using AWS client. I hope one day AWS will have all function the FTP used to have.
You can just copy over the same object and the timestamp will update.
This technique is also used to prolong the expire of an object in a bucket with a lifecycle rule.
We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.
No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.
This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.
if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch