Number of concurrent Big Query load jobs

Number of concurrent Big Query load jobs - google-cloud-platform

I am firing thousands of load jobs concurrently to different BigQuery tables. I see some of them to be executed instantly while others are queued. I was wondering how many load jobs can be run concurrently and if there is a way to run more of them instantaneously.

As seen in the documentation, the limit is 100 concurrent queries and to raise it you need to contact support or sales:
The following limits apply to query jobs created automatically by running interactive queries and to jobs submitted programmatically using jobs.query and query-type jobs.insert method calls.
Concurrent rate limit for on-demand, interactive queries — 100 concurrent queries
Queries with results that are returned from the query cache, and dry run queries do not count against this limit. You can specify a dry run query using the --dry_run flag or by setting the dryRun property in a query job.
This limit is applied at the project level. To raise the limit, contact support or contact sales.

Related

Alternatives for Athena to query the data on S3

I have around 300 GBs of data on S3. Lets say the data look like:
## S3://Bucket/Country/Month/Day/1.csv
S3://Countries/Germany/06/01/1.csv
S3://Countries/Germany/06/01/2.csv
S3://Countries/Germany/06/01/3.csv
S3://Countries/Germany/06/02/1.csv
S3://Countries/Germany/06/02/2.csv
We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use AWS Batch (Docker container) with Athena. One job works on one day of data per country.
Now there are roughly 1000 jobs which starts together and when they query Athena to read the data, containers failed because they reached Athena query limits.
Therefore, I would like to know what are the other possible ways to tackle this problem? Should I use Redshift cluster, load all the data there and all the containers query to Redshift cluster as they don't have query limitations. But it is expensive, and takes a lot of time to wramp up.
The other option would be to read data on EMR and use Hive or Presto on top of it to query the data, but again it will reach the query limitation.
It would be great if someone can give better options to tackle this problem.

As I understand, you simply send query to AWS Athena service and after all aggregation steps finish you simply retrieve resulting csv file from S3 bucket where Athena saves results, so you end up with 1000 files (one for each job). But the problem is number of concurrent Athena queries and not the total execution time.
Have you considered using Apache Airflow for orchestrating and scheduling your queries. I see airflow as an alternative to a combination of Lambda and Step Functions, but it is totally free. It is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. Airflow even has hooks to interact with AWS services. Hell, it even has a dedicated operator for sending queries to Athena, so sending queries is as easy as:
from airflow.models import DAG
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from datetime import datetime
with DAG(dag_id='simple_athena_query',
schedule_interval=None,
start_date=datetime(2019, 5, 21)) as dag:
run_query = AWSAthenaOperator(
task_id='run_query',
query='SELECT * FROM UNNEST(SEQUENCE(0, 100))',
output_location='s3://my-bucket/my-path/',
database='my_database'
)
I use it for similar type of daily/weekly tasks (processing data with CTAS statements) which exceed limitation on a number of concurrent queries.
There are plenty blog posts and documentation that can help you get started. For example:
Medium post: Automate executing AWS Athena queries and moving the results around S3 with Airflow.
Complete guide to installation of Airflow, link 1 and link 2
You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.
However, the main drawback I am facing is that only 4-5 queries are getting actually executed at the same time, whereas all others just idling.

One solution would be to not launch all jobs at the same time, but pace them to stay within the concurrency limits. I don't know if this is easy or hard with the tools you're using, but it's never going to work out well if you throw all the queries at Athena at the same time. Edit: it looks like you should be able to throttle jobs in Batch, see AWS batch - how to limit number of concurrent jobs (by default Athena allows 25 concurrent queries, so try 20 concurrent jobs to have a safety margin – but also add retry logic to the code that launches the job).
Another option would be to not do it as separate queries, but try to bake everything together into fewer, or even a single query – either by grouping on country and date, or by generating all queries and gluing them together with UNION ALL. If this is possible or not is hard to say without knowing more about the data and the query, though. You'll likely have to post-process the result anyway, and if you just sort by something meaningful it wouldn't be very hard to split the result into the necessary pieces after the query has run.
Using Redshift is probably not the solution, since it sounds like you're doing this only once per day, and you wouldn't use the cluster very much. It would Athena is a much better choice, you just have to handle the limits better.
With my limited understanding of your use case I think using Lambda and Step Functions would be a better way to go than Batch. With Step Functions you'd have one function that starts N number of queries (where N is equal to your concurrency limit, 25 if you haven't asked for it to be raised), and then a poll loop (check the examples for how to do this) that checks queries that have completed, and starts new queries to keep the number of running queries at the max. When all queries are run a final function can trigger whatever workflow you need to run after everything is done (or you can run that after each query).
The benefit of Lambda and Step Functions is that you don't pay for idle resources. With Batch, you will pay for resources that do nothing but wait for Athena to complete. Since Athena, in contrast to Redshift for example, has an asynchronous API you can run a Lambda function for 100ms to start queries, then 100ms every few seconds (or minutes) to check if any have completed, and then another 100ms or so to finish up. It's almost guaranteed to be less than the Lambda free tier.

As I know Redshift Spectrum and Athena cost same. You should not compare Redshift to Athena, they have different purpose. But first of all I would think about addressing you data skew issue. Since you mentioned AWS EMR I assume you use Spark. To deal with large and small partitions you need to repartition you dataset by months, or some other equally distributed value.Or you can use month and country for grouping. You got the idea.

You can use redshift spectrum for this purpose. Yes, it is a bit costly but it is scalable and very good for performing complex aggregations.

How to make Athena process multiple queries concurrently

I'm launching several concurrent queries to Athena via a Python application.
Given Athena's history of queries, it seems that multiple queries are all indeed received at the same time by Athena, and processed concurrently.
However, it turns out that the overall query running time is not that different from sending queries one after the other.
Example: sending three queries sequentially vs concurrently:
# sequentially
received at took finished at
query_1 22:01:14 6s 22:01:20
query_2 22:01:20 6s 22:01:27
query_3 22:01:27 5s 22:01:25
# concurrently
received at took finished at
query_1 22:02:25 17s 22:02:42
query_2 22:02:25 17s 22:02:42
query_3 22:02:25 17s 22:02:42
According to these results, in the second case, it seems that Athena, although pretending to be treating the queries concurrently, effectively processed them in a sequential manner.
Is there some configuration I wouldn't be aware of, to make Athena effectively process multiple queries concurrently? Ideally, in this example, the three queries processed concurrently would take a global running time of 6s (the longest time of the three individual queries).
Note: these are three queries targeting the same database/table, backed by the same (single) Parquet file in S3. This Parquet file is approx. 70Mb big and has 2.5M rows with a half-dozen columns.

In general the way you run concurrent queries in Athena is to run as many StartQueryExecution calls as you need, collect the query execution IDs, and then poll using GetQueryExecution for each one to be completed. Athena runs each query independently, concurrently, and asynchronously.
Depending on how long you wait between polling each query execution ID it may look like queries take different amounts of time. You can use the Statistics.EngineExecutionTimeInMillis property of the response from GetQueryExecution to see how long the query executed in Athena, and the difference between the Status.SubmissionDateTime and Status.CompletionDateTime properties to see the total time between when Athena received the query and when the response was available. Usually these two numbers are very close, and if there is a difference your query got queued internally in Athena.
The numbers in your question look unlikely. That they ended on the exact same second after running for 17 seconds looks suspicious. How many times did you run your experiment? If you look at Statistics.EngineExecutionTimeInMillis do they differ in the number of milliseconds, or are all numbers identical? Did you set ClientRequestToken, and if so, was it the same value for all three queries (in that case you actually only ran one).
What do you mean by "concurrently", do you start and poll from different threads, or poll in a single loop? How long did you wait between each poll call?

Concurrent Queries, COPY and Connections in AWS Redshift

I am trying to understand the difference between concurrent connections and concurrent queries in Redshift. As per documents, We can make 500 concurrent connections to a Redshift cluster but it says maximum 15 queries can be run at the same time in a cluster. Now what is the exact value?
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
How many concurrent COPY statement can run in a cluster ?
We are evaluating Redshift as our primary reporting data store. If we cannot run a large number of queries simultaneously it may be difficult for us to go with this model.

I think, you have misread somewhere, Max concurrent queries are 50 per WLM. Refer below thread for Amazon support response for more detail.
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
At a time, Max 50 queries could be running concurrently. Yes it does include INSERT/UPDATE/DELETE etc all.
How many concurrent COPY statement can run in a cluster ?
Ideally, you could Max go up to 50 concurrently, but Copy works bit differently.
Amazon Redshift automatically loads in parallel from multiple data files.
If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, which is much slower and requires a VACUUM at the end if the table has a sort column defined. For more information about using COPY to load data in parallel, see Loading Data from Amazon S3.
Meaning, you could run concurrent Copy commands but make sure one copy command at a time per table.
So practically, it doesn't depend on Nodes on cluster, but Number of tables as well.
So if you have only 1 table, you would like to execute 50 insert concurrently, it will result only 1 Copy concurrently.

Athena API get current number of Queries running concurrently

I have failures in my system due to Athena Limit. Athena has a limit of running only 5 concurrent queries.
My solution is, I want my system to check the current number of queries running concurrently in Athena and if it is less than 5 then I want to execute my query else wait for some time and check again.
Is there an Athena API which returns current number of queries running concurrently in Athena
Note
I can increase the limit with a request which I want to keep as my last option plus it is not scalable solution

The only way to achieve this is through multiple AWS API calls.
First, you have to call the aws athena list-query-executions method to get all recent execution IDs.
Afterwards you can query the status of those IDs through the following call:
aws athena batch-get-query-execution --query-execution-ids {ids}

You can also retry on throttling exceptions. Using Exponential backoff with jitter gives a good chance of queries succeeding in subsequent efforts.
See related AWS blog.

Long running prepare statements on Azure SQL Data Warehouse

I am running a functional test of a 3rd party application on an Azure SQL Data Warehouse database set at DWU 1000. In reviewing the current activity via:
sys.dm_pdw_exec_requests
I see:
prepare statements taking 30+ seconds,
NULL statements taking up to 25 seconds,
compilation of statements takes up to 60 seconds,
explain statements taking 60+ seconds, and
select count(1) from empty tables take 60+ seconds.
How does one identify the bottleneck involved?
The test has been running for a few hours and the Azure portal shows little DWU consumed on average, so I doubt that modifying the DWU will make any difference.
The third-party application has workload management feature, so I've specified a limit of 30 connections to the ADW database (understanding that only 32 sessions are active on the database itself.)
There are approximately ~1,800 tables and ~350 views in the database across 29 schemas (per information_schema.tables).
I am in a functional testing mode, so many of the tables involved in the queries have not yet been loaded, but statistics have been created on every column on every table in the scope of the test.
One userID is being used in the test. It is in smallrc.

have a look at your tables - in the query? Make sure all columns in joins, group by, and order by have up-to-date statistics.
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-statistics

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Number of concurrent Big Query load jobs - google-cloud-platform

I am firing thousands of load jobs concurrently to different BigQuery tables. I see some of them to be executed instantly while others are queued. I was wondering how many load jobs can be run concurrently and if there is a way to run more of them instantaneously.

Related

Alternatives for Athena to query the data on S3

How to make Athena process multiple queries concurrently

Concurrent Queries, COPY and Connections in AWS Redshift

Athena API get current number of Queries running concurrently

Long running prepare statements on Azure SQL Data Warehouse

Categories

Resources