AWS Athena concurrency limits: Number of submitted queries VS number of running queries - concurrency

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being in state Running. Here is what would normally see in Athena hisotry tab:
I understand that, after I submit queries to Athena, it processes the queries by assigning resources based on the overall service load and the amount of incoming requests. But I tried to run them at different days and hours, still would get about 5 queries being executed at the same time.
So my question is this how it supposed to be? If it is then what is the point of being able to submit up to 20 queries if roughly 15 of them would be idling and waiting for available slots.
Update 2019-09-26
Just stumbled across HIVE CONNECTOR in presto documentation, which has a section AWS Glue Catalog Configuration Properties. There we can see
hive.metastore.glue.max-connections: Max number of concurrent connections to Glue (defaults to 5).
This got me wonder if it has something to do with my issue. As I understand, Athena is simply a Presto that runs on EMR cluster which is configured to use AWS Glue Data Catalog as the Metastore.
So what if my issue comes from the fact that EMR cluster for Athena simply uses default value for concurrent connections to Glue, which is 5 which and is exactly of how many concurrent queries are actually getting executed (on average) in my case.
Update 2019-11-27
The Athena team recently deployed a host of new functionality for Athena. although QUEUED has been in the state enum for some time is hasn't been used until now. So now I get, correct info about query state in a history tab, but everything else remains the same.
Also, another post was published with similar problem.

Your account's limits for the Athena service is not an SLA, it's more of a priority in the query scheduler.
Depending on available capacity your queries may be queued even though you're not running any other queries. Exactly what a higher concurrency limit means is internal and could change, but in my experience it's best to think of it as the priority by which he query scheduler will deal with your query. Queries for all accounts run in the same server pool(s) and if everyone is running queries there will not be any capacity left for you.
You can see this in action by running the same query over and over again and then plot the query execution metrics over time, you will notice that they vary a lot, and you will notice spikes in the time your queries are queued on the top of every hour – when everyone else is running their scheduled queries.

Related

Joining many large files on AWS

I am looking for advice which service should I use. I am new to big data and confused with differences between them on AWS.
Use case:
I receive 60-100 csv files daily (each one can be from few MB to few GB). There are six corresponding schemas, and each file can be treated as part of only one table.
I need to load those files to the six database tables and execute joins between them and generate daily output. After generation of the output, the data present in database is no longer need, so we can truncate that tables and await on the next day.
Files have predictable naming patterns:
A_<timestamp1>.csv goes to A table
A_<timestamp2>.csv goes to A table
B_<timestamp1>.csv goes to B table
etc ...
Which service could be used for that purpose?
AWS Redshift (execute here joins)
AWS Glue (load to redshift)
AWS EMR (spark)
or maybe something else? I heard that spark could be used to do the joins, but what is the proper, optimal and performant way of doing that?
Edit:
Thanks for the responses. I see two options for now:
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, ​execute SQL joins with Athena
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, trigger spark job (AWS Glue in serverless form) to do the SQL joins and setup output to the S3.
Edit 2:
But according to the: https://carbonrmp.com/knowledge-hub/tech-engineering/athena-vs-spark-lessons-from-implementing-a-fully-managed-query-system/
Presto is designed for low latency and uses a massively parallel processing (MPP) approach which is fast but requires everything to happen at once and in memory. It’s all or nothing, if you run out of memory, then “Query exhausted resources at this scale factor”. Spark is designed for scalability and follows a map-reduce design [1]. The job is split and processed in chunks, which are generally processed in batches. If you double the workload without changing the resource, it should take twice as long instead of failing [2]
So Athena (aka Presto) is not scalable as much as I want. I've seen "Query exhausted resources at this scale factor" for my case.
Any possibility of changing the file type to a columnar format like parquet? Then you can use AWS EMR and spark should be able to handle the joins easily. Obviously, you need to optimize the query depending on the data/cluster size etc.

How to query AWS load balancer log if there are terabytes of logs?

I want to query AWS load balancer log to automatically and on schedule send report for me.
I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html
However, I encounter following issues:
Logs bucket increases in size day by day. And I notice if Athena query need more than 5 minutes to return result, sometimes, it produce "unknown error"
Because the maximum timeout for AWS Lambda function is 15 minutes only. Therefore, I can not continue to increase Lambda function timeout to wait for Athena to return result (if in the case that Athena needs >15 minutes to return result, for example)
Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!
First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.
Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.
With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.
If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.
Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:
Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.
Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.

Lambda function to query AWS Athena gives timeout

I have written a Lambda function using athena-express that queries AWS Athena with S3 Parquet files as destination. Also using AWS Glue for ETL processes to S3.
I am receiving: "Task timed out after 6.01 seconds". Increasing the timeout gives me the same message but with higher amount of timeout seconds. The strange thing is that it works locally, but not when it's deployed to AWS. Code should work. I think something is missing in AWS configurations.
I have followed the instructions and using lambda settings: https://www.npmjs.com/package/athena-express.
I have added AmazonAthenaFullAccess and AmazonS3FullAccess policies to the Execution role for the Lambda function.
When you say it works locally I assume that when you run it locally you have no timeout on the code. Athena is not a low latency database, and queries very rarely run in less than a few seconds, and often in minutes – it all depends on how much data they scan and other factors.
Your Lambda function must have a timeout that is longer than the longest run time of the queries you run with it. If those queries run for more than a minute, the timeout must be more than a minute, if they run for more than ten minutes, your timeout must be more than ten minutes, and so on.
In general, long running queries is not a great fit for Lambda and API Gateway that are made for more low latency situations. Since your queries seem to run for minutes I suggest that you look other ways of solving your problem. If you provide more context about what you are trying to do, we may be able to help you with that.

Lambda architecture on AWS: choose database for batch layer

We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?
In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.

How to get sub 10ms response times from AWS DynamoDB?

In the DynamoDB documentation and in many places around the internet I've seen that single digit ms response times are typical, but I cannot seem to achieve that even with the simplest setup. I have configured a t2.micro ec2 instance and a DynamoDB table, both in us-west-2, and when running the command below from the aws cli on the ec2 instance I get responses averaging about 250 ms. The same command run from my local machine (Denver) averages about 700 ms.
aws dynamodb get-item --table-name my-table --key file://key.json
When looking at the CloudWatch metrics in the AWS console it says the average get latency is 12 ms though. If anyone could tell me what I'm doing wrong or point me in the direction of information where I can solve this on my own I would really appreciate it. Thanks in advance.
The response times you are seeing are largely do to the cold start times of the aws cli. When running your get-item command the cli has to get loaded into memory, fetch temporary credentials (if using an ec2 iam role when running on your t2.micro instance), and establish a secure connection to the DynamoDB service. After all that is completed then it executes the get-item request and finally prints the results to stdout. Your command is also introducing a need to read the key.json file off the filesystem, which adds additional overhead.
My experience running on a t2.micro instance is the aws cli has around 200ms of overhead when it starts, which seems inline with what you are seeing.
This will not be an issue with long running programs, as they only pay a similar overhead price at start time. I run a number of web services on t2.micro instances which work with DynamoDB and the DynamoDB response times are consistently sub 20ms.
There are a lot of factors that go into the latency you will see when making a REST API call. DynamoDB can provide latencies in the single digit milliseconds but there are some caveats and things you can do to minimize the latency.
The first thing to consider is distance and speed of light. Expect to get the best latency when accessing DynamoDB when you are using an EC2 instance located in the same region. It is normal to see higher latencies when accessing DynamoDB from your laptop or another data center. Note that each region also has multiple data centers.
There are also performance costs from the client side based on the hardware, network connection, and programming language that you are using. When you are talking millisecond latencies the processing time on your machine can make a difference.
Another likely source of the latency will be the TLS handshake. Establishing an encrypted connection requires multiple round trips and computation on both sides to get the encrypted channel established. However, as long as you are using a Keep Alive for the connection you will only pay this overheard for the first query. Successive queries will be substantially faster since they do not incur this initial penalty. Unfortunately the AWS CLI isn't going to keep the connection alive between requests, but the AWS SDKs for most languages will manage this for you automatically.
Another important consideration is that the latency that DynamoDB reports in the web console is the average. While DynamoDB does provide reliable average low double digit latency, the maximum latency will regularly be in the hundreds of milliseconds or even higher. This is visible by viewing the maximum latency in CloudWatch.
They recently announced DAX (Preview).
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. For more information, see In-Memory Acceleration with DAX (Preview).