How to Decrease Query Compile Time in Redshift - amazon-web-services

I have seen that the first time query execution taking longer time to execute but second execution takes less time, seems like query compile time is taking longer time at first, can we do anything here which will increase the performance of compile time ?
Scenario:
enable_result_cache_for_session is off
We have SLA defined to execute specific query is 15 seconds but when run for the first time it is taking 33 seconds to compile and run the query that time SLA is miss but subsequent run took 10 seconds which is SLA hit.
Q: How do I tune this part ? How do I make sure this does not happens ?
Do we have any database configuration parameter for the same?

The title of the question says compile time but I understand that you are interested in improving the execution time, right?
For sure the John Rotenstein comment makes total sense, to improve the Redshift execution query time you need to understand the RS architecture and how to distribute your data in the best way you can to improve the queries time.
You will need to understand the DISTKEY and SORTKEY
Useful links
Redshift Architecture
https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
https://medium.com/#dpazetojr/redshift-architecture-basics-4aae5068b8e3
Redshift Distribuition Styles
https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
https://medium.com/#dpazetojr/redshift-distkey-and-sortkey-d247b01b01f6
UPDATE 1:
In order you tune query and know how/when use DISTKEY and SORTKEY, we can start using the EXPLAIN command in the query you run and based on that act more precisely.
https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html
https://dev.to/ronsoak/the-r-a-g-redshift-analyst-guide-understanding-the-query-plan-explain-360d

Related

Redshift: experiencing slow query performance between 2 segments

We’re experiencing slow query performance on AWS Redshift. Frequently we see that queries can take ±12 seconds to run, but only very little time (<500ms) is spent actually executing the query (according to the AWS Redshift console for an individual query).
Querying from svl_compile we can confirm that the query compilation plan is already compiled.
In svl_query_report we see a long time delay between the start times of 2 segments accounting for the majority of the run time, although the segments themselves all execute very quickly (milliseconds)
There are a number of things that could be going on but I suspect network distribution is involved. Check STL_DIST.
Another possibility is that Redshift broke the query up and a subquery is running during that window. This can happen with very complex queries. Review the plan and see if there are any references to computer generated table names (I think they begin with't' but this is just from memory).
Spilling to disk could be happening but this seems unlikely given what you have said so far. Also queuing delays doesn't seem like a match. Both are possible but not likely.
If you post more info about how the query is running things will narrow down. Actual execution report, explain plan, and/or logging table info would help hone in on what is happening during this time window.

Google Bigquery UPDATE row takes too long - need a improve performance solution

this update takes 194 seconds for 220mln rows.
Is there a way to improve that?
#standardSQL
UPDATE dataset.people SET CBSA_CODE = '54620' where substr(zip,1,5) = '99047'
When asking for performance help, it is useful to include a screenshot of the Execution Plan from the BigQuery UI to see which stages are the most intensive, and where the time was spent. Without that, though, I suspect that this small optimization should help:
UPDATE dataset.people SET CBSA_CODE = '54620' WHERE zip LIKE '99047%'
BigQuery should be able to push this filter down to its storage system, since it's a more natural way to express string containment, so if you see a high "Read" time in the Execution Plan for the original query, this might reduce it.

First-run of queries are extremely slow

Our Redshift queries are extremely slow during their first execution. Subsequent executions are much faster (e.g., 45 seconds -> 2 seconds). After investigating this problem, the query compilation appears to be the culprit. This is a known issue and is even referenced on the AWS Query Planning And Execution Workflow and Factors Affecting Query Performance pages. Amazon itself is quite tight lipped about how the query cache works (tl;dr it's a magic black box that you shouldn't worry about).
One of the things that we tried was increasing the number of nodes we had, however we didn't expect it to solve anything seeing as how query compilation is a single-node operation anyway. It did not solve anything but it was a fun diversion for a bit.
As noted, this is a known issue, however, anywhere it is discussed online, the only takeaway is either "this is just something you have to live with using Redshift" or "here's a super kludgy workaround that only works part of the time because we don't know how the query cache works".
Is there anything we can do to speed up the compilation process or otherwise deal with this? So far about the best solution that's been found is "pre-run every query you might expect to run in a given day on a schedule" which is....not great, especially given how little we know about how the query cache works.
there are 3 things to consider
The first run of any query causes the query to be "compiled" by
redshift . this can take 2-20 seconds depending on how big it is.
subsequent executions of the same query use the same compiled code,
even if the where clause parameters change there is no re-compile.
Data is measured as marked as "hot" when a query has been run
against it, and is cached in redshift memory. you cannot (reliably) manually
clear this in any way EXCEPT a restart of the cluster.
Redshift will "results cache", depending on your redshift parameters
(enabled by default) redshift will quickly return the same result
for the exact same query, if the underlying data has not changed. if
your query includes current_timestamp or similar, then this will
stop if from caching. This can be turned off with SET enable_result_cache_for_session TO OFF;.
Considering your issue, you may need to run some example queries to pre compile or redesign your queries ( i guess you have some dynamic query building going on that changes the shape of the query a lot).
In my experience, more nodes will increase the compile time. this process happens on the master node not the data nodes, and is made more complex by having more data nodes to consider.
The query is probably not actually running a second time -- rather, Redshift is just returning the same result for the same query.
This can be tested by turning off the cache. Run this command:
SET enable_result_cache_for_session TO OFF;
Then, run the query twice. It should take the same time for each execution.
The result cache is great for repeated queries. Rather than being disappointed that the first execution is 'slow', be happy that subsequent cached queries are 'fast'!

SAS PROC SQL: How to clear cache between testing

I am reading this paper: "Need for Speed - Boost Performance in Data Processing with SAS/Access® Interface to Oracle". And I would like to know how to clear the cache / buffer in SAS, so my repeated query / test will be reflective of the changes accurately?
I noticed the same query running the first time takes 10 seconds, and (without) changes running it immediately after will take shorter time (say 1-2 seconds). Is there a command / instruction to clear the cache / buffer. So I can have a clean test for my new changes.
I am using SAS Enterprise Guide with data hosted on an Oracle server. Thanks!
In order to flush caches on the Oracle side, you need both DBA privileges (to run alter system flush buffer_cache; in Oracle) and OS-level access (to flush the OS' buffer cache - echo 3 > /proc/sys/vm/drop_caches on common filesystems under Linux).
If you're running against a production database, you probably don't have those permissions -- you wouldn't want to run those commands on a production database anyways, since it would degrade the performance for all users of the database, and other queries would affect the time it takes to run yours.
Instead of trying to accurately measure the time it takes to run your query, I would suggest paying attention to how the query is executed:
what part of it is 'pushed down' to the DB and how much data flows between SAS and Oracle
what is Oracle's explain plan for the query -- does it have obvious inefficiencies
When a query is executed in a clearly suboptimal way, you will find (more often than not) that the fixed version will run faster both with cold and hot caches.
To apply this to the case you mention (10 seconds vs 2 seconds) - before thinking how to measure this accurately, start by looking
if your query gets correctly pushed down to Oracle (it probably does),
and whether it requires a full table (partition) scan of a sufficiently large table (depending on how slow the IO in your DB is - on the order of 1-10 GB).
If you find that the query needs to read 1 GB of data and your typical (in-database) read speed is 100MB/s, then 10s with cold cache is the expected time to run it.
I'm no Oracle expert but I doubt there's any way you can 'clear' the oracle cache (and if there were you would probably need to be a DBA to do so).
Typically what I do is I change the parameters of the query slightly so that the exact query no longer matches anything in the cache. For example, you could change the date range you are querying against.
It won't give you an exact performance comparison (because you're pulling different results) but it will give you a pretty good idea if one query performs significantly better than the other.

How to find resource intensive and time consuming queries in WX2?

Is there a way to find the resource intensive and time consuming queries in WX2?
I tried to check SYS.IPE_COMMAND and SYS.IPE_TRANSACTION tables but of no help.
The best way to identify such queries when they are still running is to connect as SYS with Kognitio Console and use Tools | Identify Problem Queries. This runs a number of queries against Kognitio virtual tables to understand how long current queries have been running, how much RAM they are using, etc. The most intensive queries are at the top of the list, ranked by the final column, "Relative Severity".
For queries which ran in the past, you can look in IPE_COMMAND to see duration but only for non-SELECT queries - this is because SELECT queries default to only logging the DECLARE CURSOR statement, which basically just measures compile time rather than run time. To see details for SELECT queries you should join to IPE_TRANSACTION to find the start and end time for the transaction.
For non-SELECT queries, IPE_COMMAND contains a breakdown of the time taken in a number of columns (all times in ms):
SM_TIME shows the compile time
TM_TIME shows the interpreter time
QUEUE_TIME shows the time the query was queued
TOTAL_TIME aggregates the above information
If it is for historic view image commands as mentioned in the comments, you can query
... SYS.IPE_COMMAND WHERE COMMAND IMATCHING 'create view image' AND TOTAL_TIME > 300000"
If it is for currently running commands you can look in SYS.IPE_CURTRANS and join to IPE_TRANSACTION to find the start time of the transaction (assuming your CVI runs in its own transaction - if not, you will need to look in IPE_COMMAND to find when the last statement in this TNO completed and use that as the start time)