I am looking for detailed explanation on how to understand the execution plan of my query on BigQuery ? Basically,
How to read execution plan to understand which part of query is taking most time and needs to change ?
How to get slot info and how to use it to update my query ?
What is the difference between actual time taken by the query and total slot time (which is huge) ?
Please help.
I tried to go through https://cloud.google.com/bigquery/docs/query-overview but it did not help much.
Related
I am trying to search for suggestions and solutions, but I am unable to find any.
After reading blogs, I am able to build a time series anomaly detection using BigQuery ML (Arima Plus).
My question is: how do I put such a model in production?
Probably I need to:
program the re-training of the model every X days
check whether there are new anomalies on the object table every X hours
record those anomalies in another table
But I also accept other suggestion on how to proceed.
Is there anyone out there that can give me any hint?
Thank you!
The best way I found is to create "scheduled queries":
schedule a query for re-training of the model every X days:
CREATE OR REPLACE MODEL mymodel
OPTIONS( model_type='arima_plus',
TIME_SERIES_DATA_COL='events',
TIME_SERIES_TIMESTAMP_COL='approx_hour',
HOLIDAY_REGION = 'GLOBAL',
CLEAN_SPIKES_AND_DIPS = FALSE,
DECOMPOSE_TIME_SERIES=TRUE)
AS (SELECT
TIMESTAMP_TRUNC( PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time), hour) as approx_hour,
COUNT(1) AS events
FROM `mytable`
GROUP BY approx_hour);
schedule a query to perform anomaly detection on the latest events, and eventually write them on a table:
insert into `events_anomalies_table`
SELECT approx_hour as hour,
cast(events as int64) as actual_events,
cast(lower_bound as int64) as expected_min_events,
cast(upper_bound as int64) as expected_max_events,
current_timestamp() as execution_timestamp
FROM ML.DETECT_ANOMALIES(
MODEL`my_model`,
STRUCT (0.98 AS anomaly_prob_threshold),
( SELECT
TIMESTAMP_TRUNC( PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time), hour) as approx_hour,
COUNT(1) AS events
FROM `my_table`
WHERE PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time) > TIMESTAMP_SUB(CURRENT_TIMESTAMP() , INTERVAL 1 HOUR)
GROUP BY approx_hour
LIMIT 1))
WHERE is_anomaly = True
We’re experiencing slow query performance on AWS Redshift. Frequently we see that queries can take ±12 seconds to run, but only very little time (<500ms) is spent actually executing the query (according to the AWS Redshift console for an individual query).
Querying from svl_compile we can confirm that the query compilation plan is already compiled.
In svl_query_report we see a long time delay between the start times of 2 segments accounting for the majority of the run time, although the segments themselves all execute very quickly (milliseconds)
There are a number of things that could be going on but I suspect network distribution is involved. Check STL_DIST.
Another possibility is that Redshift broke the query up and a subquery is running during that window. This can happen with very complex queries. Review the plan and see if there are any references to computer generated table names (I think they begin with't' but this is just from memory).
Spilling to disk could be happening but this seems unlikely given what you have said so far. Also queuing delays doesn't seem like a match. Both are possible but not likely.
If you post more info about how the query is running things will narrow down. Actual execution report, explain plan, and/or logging table info would help hone in on what is happening during this time window.
I have seen that the first time query execution taking longer time to execute but second execution takes less time, seems like query compile time is taking longer time at first, can we do anything here which will increase the performance of compile time ?
Scenario:
enable_result_cache_for_session is off
We have SLA defined to execute specific query is 15 seconds but when run for the first time it is taking 33 seconds to compile and run the query that time SLA is miss but subsequent run took 10 seconds which is SLA hit.
Q: How do I tune this part ? How do I make sure this does not happens ?
Do we have any database configuration parameter for the same?
The title of the question says compile time but I understand that you are interested in improving the execution time, right?
For sure the John Rotenstein comment makes total sense, to improve the Redshift execution query time you need to understand the RS architecture and how to distribute your data in the best way you can to improve the queries time.
You will need to understand the DISTKEY and SORTKEY
Useful links
Redshift Architecture
https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
https://medium.com/#dpazetojr/redshift-architecture-basics-4aae5068b8e3
Redshift Distribuition Styles
https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
https://medium.com/#dpazetojr/redshift-distkey-and-sortkey-d247b01b01f6
UPDATE 1:
In order you tune query and know how/when use DISTKEY and SORTKEY, we can start using the EXPLAIN command in the query you run and based on that act more precisely.
https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html
https://dev.to/ronsoak/the-r-a-g-redshift-analyst-guide-understanding-the-query-plan-explain-360d
this update takes 194 seconds for 220mln rows.
Is there a way to improve that?
#standardSQL
UPDATE dataset.people SET CBSA_CODE = '54620' where substr(zip,1,5) = '99047'
When asking for performance help, it is useful to include a screenshot of the Execution Plan from the BigQuery UI to see which stages are the most intensive, and where the time was spent. Without that, though, I suspect that this small optimization should help:
UPDATE dataset.people SET CBSA_CODE = '54620' WHERE zip LIKE '99047%'
BigQuery should be able to push this filter down to its storage system, since it's a more natural way to express string containment, so if you see a high "Read" time in the Execution Plan for the original query, this might reduce it.
I understand SimpleDB doesn't have an auto increment but I am working on a script where I need to query the database by sending the id of the last record I've already pulled and pull all subsequent records. In a normal SQL fashion if there were 6200 records I already have 6100 of them when I run the script I query records with an ID greater than > 6100. Looking at the response object, I don't see anything I can use. It just seems like there should be a sequential index there. The other option I was thinking would be a real time stamp. Any ideas are much appreciated.
Using a timestamp was perfect for what I needed to do. I followed this article to help me on my way:http://aws.amazon.com/articles/1232 I would still welcome if anyone knows if there is a way to get an incremental index number.