Looking at the breakdown of charges from AWS Sagemaker, I noticed only about 30% of total cost is from actually running the instances, surprisingly ~50 percent come from S3 (shows as ListBucket) and 20% for other overhead. I wonder if there is a way to decrease this massive extra charge from S3.
To give more background, I run hundreds of training jobs each roughly 3 hours long, and the data is hundreds of pickle files zipped into a tar.gz file of size ~10G (gets unzipped in the instance).
So If I run 1000 jobs on instances with pricing $0.1/hr, I expect to see around $300 charge (1000 jobs * 3 hours * $0.1), however it ends up being close to $1000 with around $500 coming from "ListBucket"!!
I wonder where this comes from, since the s3 folder with training data is simply a single zipped file, why would ListBucket cost so much?
Related
I am using digital ocean spaces as a cloud storage to store users data, and its costing me for both hosting the data and for datatransfer. So, I wanna migrate to Amazon Simple storage s3 (frequent access). I just went through the official docs of AWS S3 and found that, it will cost only for the data hosted in their storage, regardless of the retrieval numbers, I am new to AWS ecosystem and I am not sure about the pricing concept of AWS. Please let me know the pricing estimate for the following scenario:
=> any user can upload a data in my mobile applications
=> if i store around 100 gb of data with AWS s3,
=> if i retrieve that 100 gb around 50 to 100 times a day in my mobile app.
=> how much I need to pay per month,
=> current pricing to store 1 gb is around $0.02.($0.02/1gb)
Not sure what documentation you were reading, but the official S3 pricing page is pretty clear that you are charged for:
Data storage, which depends on region but is somewhere between 2 and 5 US cents per gigabyte, per month.
Number of requests, which again depends on region, but is on the order of a few US cents per 1,000 requests (retrieving a file is a GET request; uploading a file is a PUT request).
Data transfer, which again depends on region, but ranges from a low of $0.09/GB in the US regions, to a high (I think) of $0.154 in the Capetown region.
So, if you're retrieving 100 GB of data 100 times a day, you will be paying data transfer costs of anywhere from $900 to $1540 per day.
In my experience, Digital Ocean tends to be cheaper than AWS for most things (but you get fewer features). However, if you're really transferring 10 TB of data per day (I think that's unlikely, but it's what you asked), you should look for some hosting service that offers unlimited bandwidth.
Suppose I have a script which uploads a 100GB object every day to my S3 bucket. This same script will delete any file older than 1 week from the bucket. How much will I be charged at the end of the month?
Let's use pricing from the us-west-2 region. Suppose this is a 30-day month and I start with no data in the bucket at the beginning of the month.
If charged for maximum bucket volume per month, I would have 700GB at the end of the month and be charged $0.023 * 7 * 100 = $16.10. Also some money for my PUT requests ($0.005 per 1,000 requests so effectively 0).
If charged for total amount of data that had transited through the bucket over the course of that month, I would be charged $0.023 * 30 * 100 = $69. (again +effectively $0 for PUT requests)
I'm not clear on which of these two cases Amazon bills. This becomes very important for me, since I expect to have a high amount of churn in my bucket.
Both of your calculations are incorrect, although the first one comes close to the right answer, for the wrong reason. It is neither peak nor end-of-month that matters.
The charge for storage is calculated hourly. For all practical purposes, this is the same as saying that you are billed for your average storage over the course of a month -- not your maximum, and not the amount you uploaded.
Storing 30 GB for 30 days or storing 900 GB for 1 day would cost the same amount, $0.69.
The volume of storage billed in a month is based on the average storage used throughout the month. This includes all object data and metadata stored in buckets that you created under your AWS account. We measure your storage usage in “TimedStorage-ByteHrs,” which are added up at the end of the month to generate your monthly charges.
https://aws.amazon.com/s3/faqs/#billing
This is true for STANDARD storage.
STANDARD_IA and GLACIER are also billed hourly, but there is a notable penalty for early deletion: Each object stored in these classes has a minimum billable lifetime of 30 days in IA or 90 days in Glacier, no matter when you delete it. Both of these alternate storage classes are only appropriate for data you do not intend to delete soon or retrieve often, by design.
REDUCED_REDUNDANCY storage follows the same rules as STANDARD (hourly billing, no early delete penalty) but after the most recent round of price decreases, it is now only less expensive than STANDARD in regions with higher costs. It is an older offering that is no longer competitively priced in regions where STANDARD pricing is lowest.
Your bill will for storage will be closer to your #1 example, perhaps a bit higher because for brief amounts of time, while uploading the 8th day, you still have 7 days of storage accruing charges, but you won't be charged anywhere near your #2 example.
Firstly, you don't need a script to delete files older than 1 week. You can set a transition cycle on the bucket which will automatically do that; or might be transfer contents to Glacier ( with 10% cost ) if you might need them later.
Secondly, storage cost might not be huge.. Probably better idea would be to that script first deletes data from S3 ( if u want script to do that ) and then you add more data.. so that your bucket overall never have more data and you are always charged on consistent storage basis.
Thirdly, your main charges could be bandwidth charges (if not handled well) which can be really huge as you are transferring so much data. If all this data is generated internally from your grid then make sure u create VPC endpoint to your S3 so that you don't pay "bandwidth charges" as then this data transfer will be considered to be transferred on intranet.
We have the n number of files with total size of around 100 GiB. We need to upload all the files to EC2 Linux instance which is hosted in AWS (US region).
My office(in India) internet connection is 4Mbps dedicated leased line. Its taking more than 45 min to upload 500 MB file to EC2 instance. which is too slow.
How do we transfer this kind of bulk upload with minimum time period..?
If it is 100s of TB we can go with snowball import and export but this is 100 GiB.
It should be 3x faster than you experience.
If there are many small files you can try to "zip" them to send fewer large files.
And make sure you dont bottleneck the linux server by encrypting the data (ssh/sftp). Ftp may be your fastest way.
But 100GB will always take at least 57 hours with your max speed..
I have an s3 bucket in account A with millions of files that take up many GBs
I want to migrate all this data into a new bucket in account B
So far, I've given account B permissions to run s3 commands on the bucket in account A.
I am able to get some results with the
aws s3 sync command with the setting aws configure set default.s3.max_concurrent_requests 100
its fast but it only does a speed of some 20,000 parts per minute.
Is there an approach to sync/move data across aws buckets in different accounts REALLY fast?
I tried to do aws transfer acceleration but it seems that that is good for uploading and downloading from the buckets and I think it works within an aws account.
20,000 parts per minute.
That's > 300/sec, so, um... that's pretty fast. It's also 1.2 million per hour, which is also pretty respectable.
S3 Request Rate and Performance Considerations implies that 300 PUT req/sec is something of a default performance threshold.
At some point, make too many requests too quickly and you'll overwhelm your index partition and you'll start encountering 503 Slow Down errors -- though hopefully aws-cli will handle that gracefully.
The idea, though, seems to be that that S3 will scale up to accommodate the offered workload, so if you leave this process running, you may find that it actually does get faster with time.
Or...
If you expect a rapid increase in the request rate for a bucket to more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second, we recommend that you open a support case to prepare for the workload and avoid any temporary limits on your request rate.
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Note, also, that it says "temporary limits." This is where I come to the conclusion that, all on its own, S3 will -- at some point -- provision more index capacity (presumably this means a partition split) to accommodate the increased workload.
You might also find that you get away a much higher aggregate trx/sec if you run multiple separate jobs, each handling a different object prefix (e.g. asset/1, asset/2, asset/3, etc. depending on how the keys are designed in your bucket, because you're not creating such a hot spot in the object index.
The copy operation going on here is an internal S3-to-S3 copy. It isn't download + upload. Transfer acceleration is only used for actual downloads.
I'm using django and elastic beanstalk. I just made a new post and saw I was charged 0.01$ from aws which kinda worries me. Does this mean every time I make a post this amount will be charged? what if I make one then delete it, will I still be charged? can someone with an experience of elastic beanstalk help me out?
Why not delete it and see what happens to the cost? Deleting doesn't account for data transfer thus my guess is you won't pay a thing. Putting items on the queue does account for data transfer and you will pay. Keeping items on the queue (data storage) will cost you as you can see here: https://aws.amazon.com/elasticbeanstalk/pricing/
Amazon EC2 Pricing (includes pricing for instances, load balancing, elastic block storage, and data transfer)
Amazon S3 Pricing (includes pricing for storage and data transfer)
The actual issue here seems to be a misunderstanding of the terminology used in pricing.
S3 charges $0.005 per 1,000 PUT/POST/LIST requests (some regions are somewhat higher, but this pricing is used through the rest of the answer).
This terminology does not mean that each request will actually be billed as $0.005 ÷ 1000 = $0.000005, even though this is a close approximation of what they will ultimately cost.
It actually means you are billed CEIL(TOTAL_REQUESTS / 1000) * $0.005...
...where TOTAL_REQUESTS is the number of that type of request you made during a monthly billing interval within one S3 region.
So making 1, 2, 500, 999, or 1000 requests is still a total monthly usage of $0.005, rounded up to $0.01. Not $0.01 each.
Making 1001 through 2000 total requests is a total of $0.005 + $0.005 = $0.01.
Making 2001 through 3000 total requests is a total of $0.015, which rounds up to $0.02.
...ad infinitum...
You wouldn't billed more than $0.01 total until after the first 2000 requests.