Amazon S3 multipart upload - amazon-web-services

Amazon S3 multipart upload - amazon-web-services

I am trying to upload a .bak file(24gb) to amazon s3 using multipart upload low-level API approach in Java. I was able to write the file successfully but the time it took was around 7-8 hours. I want to know what is the average/ideal time to upload a file of such a big size, is the time it took is expected or it can be improved? If there is a scope for improvement than what could be the approach?

If you are using default settings of Transfer Manager, then for multipart uploads, the DEFAULT_MINIMUM_UPLOAD_PART_SIZE is 5MB which is too low for a 24GB file. This essentially means that you'll end up having thousands of small part uploaded to s3. Since each part is uploaded by a different worker thread, your application will spend too much time in Network communication. This will not give you optimal uploading speed.
You must increase the minimum upload part size to be between 100MB to 500 MB. Use this setting : setMinimumUploadPartSize
Official Documentation for setting MinimumUploadPartSize :
Decreasing the minimum part size will cause multipart uploads to be split into a larger number of smaller parts. Setting this value too low can have a negative effect on transfer speeds since it will cause extra latency and network communication for each part.
I am certain you'll see improvement in upload throughput by tuning this setting if you are currently using default settings. Let me know if this improves the throughput.
Happy Uploading !!!

Related

Efficient way to upload huge number of small files in S3

I'm encoding dash streams locally that I intend to stream through Cloudfront after, but when it comes to uploading the whole folder it get counted as +4000 PUT requests. So, I thought instead to compress it and upload the zip folder that would count as only 1 PUT request, and then Unzip it using lambda.
My question is, is lambda still going to use the PUT requests for unzipping the file ? And if so, what would be a better/cost effective way to achieve this ?

No, there is no way around having to pay for the individual PUT/POST requests per-file.
S3 is expensive. So is anything related to video streaming. The bandwidth and storage costs will eclipse your HTTP request costs. You might consider a more affordable provider. AWS is the highest price out of all that do S3-compatible hosting.

AWS-S3: How do multipart upload increases throughput

If I am having a file of 1GB and have upload speed of 1MBps. So if I will upload whole file or will upload it in parts, the total uploading speed would not increase since I am having an upper bound on the upload speed. Am I right or there is some other scenario?

You are right. In your scenario there is little to gain from parallel multipart uploads.
The recommendation to use parallel multipart uploads is when you have more bandwidth available on your end than will be used per S3 connection.

Is Fineuploader with File Chunking more expensive on Amazon S3?

Fineuploader http://fineuploader.com/ has the possibility to use File Chunking
File Chunking / Partitioning
Splitting a file into smaller pieces
allows for a more efficient overall upload, and powers some Fine
Uploader features such as pausing, and resuming uploads. Fine Uploader
can also upload multiple chunks for the same file concurrently.
Is Fineuploader with File Chunking more expensive on Amazon S3? Thinking of that Amazon will charge you for each request to Amazon S3. If fineuploader splits any file into smaller pieces it becomes more requests to Amazon = more expensive. Is that correct?

Yes, there are more requests, so chunking may result in increased fees. If you expect to allow your users to upload large files, the benefit of chunking is significant in terms of user experience, especially when the new concurrent chunking feature is enabled. However, if the increased cost bothers you, you can always turn chunking off.

Determine available upload/download bandwidth

I have an application which does file upload and download. I also am able to limit upload/download speed to a desired level (CONFIGURABLE), so that my application does not consume the whole available bandwidth. I am able to achieve this using the libcurl (http) library.
But my question is, if I have to limit my upload speed to say 75% of the available upload bandwidth, how do I find out my available upload bandwidth programatically? preferably in C/C++. If it is pre-configured, I have no issues, but if it has to be learnt and adapted each time, like I said, 75% of the available upload limit, I do not know who to figure it out. Same is applicable to download. Any pointers would be of great help.

There's no way to determine the absolute network capacity between two points on a regular network.
The reason is that the traffic can be rerouted in between, other data streams appear or disappear or links can be severed.
What you can do is figure out what is the available bandwidth right now. One way to do it is to upload/download a chunk of data (say 1MB) as fast as possible (no artificial caps), and measure how long it takes. From there you can figure out what bandwidth is available now and go from there.
You could periodically measure the bandwidth again to make sure you're not too way off.

Backup: Amazon S3 or Glacier - lots of little files?

I'm trying to understand the complicated Amazon Glacier pricing model. I don't want to store a huge amount of data, a few GB's say 10. I hope never to download the files and if I did need to I don't care how long it takes.
Is there a cost per file I upload? Is it cheaper to zip lots of tiny files and upload in a few chunks or does 10,000 say images not matter? (cannot get a straight answer to this during searching)
Am I able to request the download of a whole Archive/Bucket or is it file-by-file?

I know this is a bit old, but you may still find my answer helpful (I hope). The other answer is based on S3 which wasn't your question I believe.
Glacier is intended for rare file access. Having that in mind they sort of punish you if you need to retrieve many files at once. In your particular case I would suggest uploading 10.000 separate files instead of let's say 100 ZIP files with 100 files each. The reason is very simple. Glacier will let you download for free only 5% of the total archive and is prorated daily. So if, for example, you need to download 10 photos you took on a weekend, you would be able to get those 10 photos for free if they are spread in the vault. On the other hand, if you have a ZIP file that has 100 photos inside, you'll be forced to download that zip that will probably be more than 5% of the total archive meaning you'll be paying some fees for the retrieval.
The only reason it makes sense to upload fewer files is to avoid high upload requests (10.000 files usually mean 10.000 requests). Requests are charged $0,05 per 1000. This fees are much lower that retrieval fees (taking into account the limits imposed), that's why I would always recommend uploading separate files. Of course you may zip files that make sense to be together.
Retrieval costs are very complex in Amazon Glacier. They have a good explanation here:
http://aws.amazon.com/glacier/faqs/#How_much_data_can_I_retrieve_for_free
But even there you'll need to pay attention on the calculations to get a clear idea on how costs are billed.
Regarding this question:
Am I able to request the download of a whole Archive/Bucket or is it file-by-file?
Requests are by file-by-file, although you can select many files at once and download them altogether.
Deciding whether to use S3 or Glacier really depends on your needs on file access. If you will rearly need access to your files then Glacier is your answer. Otherwise for 10GB S3 can still be cheap and be more flexible than Glacier.
In my case I find family photos to be a very precious thing. That's why I have a 100GB backup on glacier with all my family photos. I don't intend to access it unless there is some kind of disaster at home. In that case, I think I would not mind the retrieval cost if that saved something I really care about. But that's just me.

Detailed pricing information for S3 is available here. Specifics of the API functions available are here.
For S3, you are mostly charged for upload bandwidth (bytes sent TO S3), download bandwidth (bytes received FROM S3), and storage (bytes IN S3). You are also charged for the number and type of API calls.
So, if you upload your 10GB of data to S3 in 10,000 1MB files, store it for a month, and then download each of the files once, you'll be charged:
$0.00 for upload bandwidth (this is free)
$0.10 for the 10,000 PUT requests to upload the files
$0.95 for storing the 10GB for a month
$1.08 for 10GB download bandwidth (the first is free, then $0.12/GB)
$0.01 for the 10,000 GET requests to download the files
That's $2.14. If you uploaded and downloaded once each, but kept the data for a year, only the storage cost would go up to 12 * $0.95, or $11.40. If your files averaged only 100KB, so you had 100,000 of them, you'd pay 10 times as much for the PUT and GET requests, or $1.10 instead of $0.11.
You can only upload and download a single file per operation. If you combined your files into one using Zip, you'd only save by using fewer operations, which, as you can see, are pretty cheap to start with.
There is one quirk here, though. I'm pretty sure you are charged for all bandwidth usage when uploading and downloading, including request headers, not just the bodies containing your data. So if your files were really tiny the request headers might become significant, perhaps as much as the files themselves. In that case your bandwidth costs would double.
Glacier pricing is more complicated, and I've never used it myself. Basically, it reduces storage cost by almost ten-fold, leaving the other costs the same, and adding costs to archive and restore per object. Those costs seem to be significant if you have a lot of small objects, need to get a lot of your files at a time, or get files frequently. Glacier seems to be best when you have a lot of data (terabytes or more, not just gigabytes), but few operations. Given that you only have 10GB of data, S3 is so inexpensive it doesn't seem worth it to consider Glacier.
Finally, AWS has a free usage tier for the first year, which looks like it would cover all your costs except for half the storage charges.

Better use few larger files than lot of small ones
There are two approaches to putting files into Amazon Glacier. You either interact with vaults directly, or use S3 as frontend.
I am using S3 (and Amazon Management Console) so that I am able to see content of the archive and at the same time have it stored cheaply in Glacier.
This approach has one drawback - as storing any piece of information in Glacier has some data overhead (which you pay for too), then there is logically a break even point. Before 2014-04 price reduction I made a calculation and critical size is about 16 kB, storing smaller files in Glacier (using AWS S3 as frontend) was more expensive than keeping it only on S3. With price reduction for S3 storage (Glacier did not change) the break even point went even higher.
I guess, that even without S3 as frontend, the situation will be similar, even though a bit more friendly to smaller files.

Since November 21, 2016, Amazon updated the free tier policy for Glacier retrievals and updated the "5% of your average monthly storage" policy in favor of a flat 10GB free per month. However, if your retrieval policy was set prior to that day, then you're still on the "5%" policy and the other answers here still apply to you.
If your retrieval policy was set after Nov 21, 2016, and you're in the OP's shoes:
You're only storing 10GB, so you could retrieve all of your data for free once per month using Standard retrievals. It would make no difference if all 10,000 photos are zipped into one zip file or not (for retrievals).
The only variable in this scenario is number of upload requests. 10,000 requests at a price of $0.05 per 1,000 is only $0.50 and that's a one time fee for your specific case.
More pricing info at AWS Glacier FAQ
UPDATE:
Glacier docs recommend using multipart upload for files larger than 100MB.
I came to this conclusion independently after a couple timeouts when trying to upload an 8GB file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js