When unloading data from Redshift to S3 with PARALLEL ON, results of the select statement are split across a set of files as described in the Redshift Documentation. How does this translate into number of requests sent to S3 endpoint ?
Ex - If the UNLOAD query ( with PARALLEL ON ) generated 6 files in S3 bucket, does that correspond to 6 PUT requests on S3 ? Or is the number of requests received by S3 via this UNLOAD query execution a higher number?
I would appreciate it if you could briefly clarify.
If you're worried about cost then you're looking in the wrong place.
PUT requests in Amazon S3 (in a US region) are a half-cent per 1000 requests. Compare that to a Redshift cluster that is a minimum of 25c per hour (which is still rather good value).
Take a look at your monthly bill and S3 request pricing is unlikely to feature in your top costs.
Related
My AWS S3 costs have been going up pretty quickly for usage type "DataTransfer-Out-Bytes". I have thousands of images in this one bucket and I can't seem to find a way to drill down into the bucket to see which individual bucket items might be causing the increase. Is there a way to see which individual files are attributing to the higher data transfer cost?
Use Cloudfront if you can - its cheaper(if you properly set your cache headers!) than directly hosting from S3 and Cloudfront includes a popular objects report - which would answer your question.
If your using S3 alone you need to enable logging on the bucket (more storage cost) and then crunch the data in the logs (more data transfer cost) to get your answer. You can use AWS Athena to process the s3 access logs or use unix command line tools like grep/wc/uniq/cut to operate on the log files locally/from a server to find the culprits.
I am confused about pricking criteria in aws s3.
Let's assume
SELECT * FROM TABLE => result 1000 rows
Is this command 1000 requests or 1 request?
If you using S3 Select to query, it is a single request and you get charged for every 1000 requests and charged at the same rate as object GET requests
If you are using Athena to query S3, the charge would be for the amount of data retrieved and how the file is stored like zip or parquet format
AWS S3 Select charges for the bytes scanned ($0.00200 per GB) and the bytes returned ($0.0007 per GB).
The charges are broken in your billing management console Under Billing>Bills>Details>Simple Storage Service>Region> "Amazon Simple Storage Service USE2-Select-Returned-Bytes" and "Amazon Simple Storage Service USE2-Select-Scanned-Bytes"
Using limit, where clauses or having a compressed file reduces the bytes scanned. The service supports gzipped Parquet files which is fantastic. I was able to reduce file sizes by 90% and take advantage of columnar data format rather than csv.
I want to query AWS load balancer log to automatically and on schedule send report for me.
I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html
However, I encounter following issues:
Logs bucket increases in size day by day. And I notice if Athena query need more than 5 minutes to return result, sometimes, it produce "unknown error"
Because the maximum timeout for AWS Lambda function is 15 minutes only. Therefore, I can not continue to increase Lambda function timeout to wait for Athena to return result (if in the case that Athena needs >15 minutes to return result, for example)
Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!
First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.
Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.
With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.
If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.
Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:
Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.
Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.
I'm looking to understand what charges exactly are going to be incurred if I was to for example create an API Gateway RestAPI-private ( and perhaps an asp.net core web API ) which stream images/documents into S3 bucket.
The reason why I am considering this is to utilize existing RestAPI authentication mechanism which is in place for private RestAPI, and avoid any complexity around trying to allow s3 uploads using things like direct connect.
I was told by someone that doing something such as this would cause the bill to rise, and there were concerns about costs.
Just looking to understand all the costs involved. Again, all I am looking for here is an API Endpoint which clients can upload images to, and avoiding all the complexity involved with trying to create some private connection between on prem clients and s3 (which looks complex)
Is anyone doing something similar to this?
So as per the AWS Documentation, the max single payload size apigateway can handle is 10mb. Taking the assumption that all request POST's will be under that limit the costs will take into consideration charges from ApiGateway, S3 and (assuming you want to handle the files first) Lambda. Without knowing your region, pricing is determined by US-East-2 (Ohio), and free-tier is taken as non-existent (max charges).
Breaking the pricing into those 3 sections, you can expect the following:
Total - $6.72 USD/month
ApiGateway ~ $0.88 USD
HTTPS API used for uploading data. The API is called 100k time a month to upload documents which on average is 5 MB in size: [100,000 * (5mb /512kb)] * [1/1,000,000] = $0.8789
S3 ~ $1.65 USD
50 GB of files for a month with standard storage and no GET requests: [50 * 0.023] + [100,000 x 0.000005] = $1.65
Lambda ~ $4.19 USD
512MB of memory for the function, executed 100k times in one month, and it ran for 5 seconds each time: [(100,000 * 5) * (512/1024) * $0.00001667] + [$0.20 * 0.1] = $4.1875
If you want more specific information on a service, I suggest you look at the AWS Doc pricing links I included for each service which have a very extensive breakdown of costs.
I am running simple query against my s3 bucket with CloudTrail logs. The bucket is big and after around 1 min and 45 seconds I get error
HIVE_CURSOR_ERROR: Please reduce your request rate.
Is there a way to limit request rate against my s3 bucket within Athena?
SELECT *
FROM default.cloudtrail_logs_cloudtraillog
WHERE eventname = 'DeleteUser' AND awsregion = 'us-east-1'
So I will summarize solutions suggested by AWS. None of them are great and I wonder why AWS would not throttle on their end and instead throw the error.
By default, S3 will scale automatically to support very high request rates. When your request rate is scaling, S3 automatically partitions your S3 bucket as needed to support higher request rates.However, sometimes it still errors out. So they suggest to wait (do not suggest time frame) to give S3 enough time to auto-partition your bucket based on the request rate it is receiving.
They also suggest:
1) Using S3distcp utility to combine small files into larger objects. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
2) Partitioning https://docs.aws.amazon.com/athena/latest/ug/partitions.html
I got the same answer from AWS support. Since I was doing a one-off analysis, I ended up writing a script to copy a small date range worth of logs to a separate bucket and using Athena to analyze the smaller dataset.