Google Bigquery UPDATE row takes too long - need a improve performance solution

Google Bigquery UPDATE row takes too long - need a improve performance solution - sql-update

this update takes 194 seconds for 220mln rows.
Is there a way to improve that?
#standardSQL
UPDATE dataset.people SET CBSA_CODE = '54620' where substr(zip,1,5) = '99047'

When asking for performance help, it is useful to include a screenshot of the Execution Plan from the BigQuery UI to see which stages are the most intensive, and where the time was spent. Without that, though, I suspect that this small optimization should help:
UPDATE dataset.people SET CBSA_CODE = '54620' WHERE zip LIKE '99047%'
BigQuery should be able to push this filter down to its storage system, since it's a more natural way to express string containment, so if you see a high "Read" time in the Execution Plan for the original query, this might reduce it.

Related

How to Decrease Query Compile Time in Redshift

I have seen that the first time query execution taking longer time to execute but second execution takes less time, seems like query compile time is taking longer time at first, can we do anything here which will increase the performance of compile time ?
Scenario:
enable_result_cache_for_session is off
We have SLA defined to execute specific query is 15 seconds but when run for the first time it is taking 33 seconds to compile and run the query that time SLA is miss but subsequent run took 10 seconds which is SLA hit.
Q: How do I tune this part ? How do I make sure this does not happens ?
Do we have any database configuration parameter for the same?

The title of the question says compile time but I understand that you are interested in improving the execution time, right?
For sure the John Rotenstein comment makes total sense, to improve the Redshift execution query time you need to understand the RS architecture and how to distribute your data in the best way you can to improve the queries time.
You will need to understand the DISTKEY and SORTKEY
Useful links
Redshift Architecture
https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
https://medium.com/#dpazetojr/redshift-architecture-basics-4aae5068b8e3
Redshift Distribuition Styles
https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
https://medium.com/#dpazetojr/redshift-distkey-and-sortkey-d247b01b01f6
UPDATE 1:
In order you tune query and know how/when use DISTKEY and SORTKEY, we can start using the EXPLAIN command in the query you run and based on that act more precisely.
https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html
https://dev.to/ronsoak/the-r-a-g-redshift-analyst-guide-understanding-the-query-plan-explain-360d

Performant way to handle arrays in Athena/Quicksight

I currently have a large set of json data that I'd like to import into Amazon Athena for visualization in Amazon Quicksight. In each json, there are two fields: one is a comma separated string of ids (orderlist), and the other field is an array of strings(locations). Because Quicksight doesn't support array searching, I'm currently resorting to creating a view where I generate crossjoins across the two string arrays:
select id,
try_CAST(orderid AS bigint) orderid_targeting,
location
from advertising_json
CROSS JOIN UNNEST(split(orderlist, ',')) as x(orderid)
CROSS JOIN UNNEST(locations) t (location)
With two cross joins, this can explode out the data to 20x-30x the original size.
If I were working on individual queries on Athena, I could use Presto array functions to search through the arrays. Is there a better way to make these fields accessible for filtering on Quicksight?

You have two options: keep doing what you're doing or implement an ETL workflow where you periodically materialise the view, for example using CTAS. The latter has the added benefit that you can produce Parquet files, which could help speed up your queries.
On the other hand it's not as simple as it sounds. If you're in luck you can use INSERT INTO to transform partitions from your current table into an optimised table after a point in time when they will not change – but in my experience most of the time your most recent data gets updated during some window of time, but you still want to be able to query it during that window. In that situation the ETL process becomes much more complicated since you need to remove data from the optimised table to avoid ending up with duplicate data. It's not hard, it's just a lot of code and juggling S3 and Glue Data Catalog operations so that you never have tables that have duplicate data nor too little data.
Unless you feel like your current setup with the view is too slow, don't go implementing something big and complicated. Remember that you pay for bytes scanned in Athena, not the amount of time Athena spends crunching your query. You get quite a lot of compute power running your queries and in my experience there's rarely any point in micro-optimisation of queries, the gains you make are orders of magnitude lower than minimising the amount of data you process, either through clever partitioning or moving to columnar file formats. Most of the time the gains from small optimisations are not measurable because the error bars caused by Athena's query queue and waiting for S3 operations. You may get your query to run 50ms faster, but sometimes it gets queued for 500ms, and spends another 2000ms doing list operations on S3 so how can you tell?
If you decide to go down the materialisation route, first do it once using CTAS and run your QuickSight visualisation against the results. Don't implement the whole ETL workflow before you've checked that you get something that is significantly more performant.
If all you are worried about is that it's less performant to apply filters after the unnesting of your arrays than using array functions, write the two versions of the query and benchmark them against each other. I suspect array functions are going to be slightly faster – but for the same reasons I mentioned above, the gains may drown in the error bars caused by Athena's queuing and other operations.
Make sure to benchmark at different points during the day, and be especially conscious of the fact that top-of-the-hour behaviour in Athena is extremely different from other times (run queries at 10:00 and then at 10:10 – your total execution times will be very different because everyone's cron jobs run at the top of the hour).

First-run of queries are extremely slow

Our Redshift queries are extremely slow during their first execution. Subsequent executions are much faster (e.g., 45 seconds -> 2 seconds). After investigating this problem, the query compilation appears to be the culprit. This is a known issue and is even referenced on the AWS Query Planning And Execution Workflow and Factors Affecting Query Performance pages. Amazon itself is quite tight lipped about how the query cache works (tl;dr it's a magic black box that you shouldn't worry about).
One of the things that we tried was increasing the number of nodes we had, however we didn't expect it to solve anything seeing as how query compilation is a single-node operation anyway. It did not solve anything but it was a fun diversion for a bit.
As noted, this is a known issue, however, anywhere it is discussed online, the only takeaway is either "this is just something you have to live with using Redshift" or "here's a super kludgy workaround that only works part of the time because we don't know how the query cache works".
Is there anything we can do to speed up the compilation process or otherwise deal with this? So far about the best solution that's been found is "pre-run every query you might expect to run in a given day on a schedule" which is....not great, especially given how little we know about how the query cache works.

there are 3 things to consider
The first run of any query causes the query to be "compiled" by
redshift . this can take 2-20 seconds depending on how big it is.
subsequent executions of the same query use the same compiled code,
even if the where clause parameters change there is no re-compile.
Data is measured as marked as "hot" when a query has been run
against it, and is cached in redshift memory. you cannot (reliably) manually
clear this in any way EXCEPT a restart of the cluster.
Redshift will "results cache", depending on your redshift parameters
(enabled by default) redshift will quickly return the same result
for the exact same query, if the underlying data has not changed. if
your query includes current_timestamp or similar, then this will
stop if from caching. This can be turned off with SET enable_result_cache_for_session TO OFF;.
Considering your issue, you may need to run some example queries to pre compile or redesign your queries ( i guess you have some dynamic query building going on that changes the shape of the query a lot).
In my experience, more nodes will increase the compile time. this process happens on the master node not the data nodes, and is made more complex by having more data nodes to consider.

The query is probably not actually running a second time -- rather, Redshift is just returning the same result for the same query.
This can be tested by turning off the cache. Run this command:
SET enable_result_cache_for_session TO OFF;
Then, run the query twice. It should take the same time for each execution.
The result cache is great for repeated queries. Rather than being disappointed that the first execution is 'slow', be happy that subsequent cached queries are 'fast'!

SAS PROC SQL: How to clear cache between testing

I am reading this paper: "Need for Speed - Boost Performance in Data Processing with SAS/Access® Interface to Oracle". And I would like to know how to clear the cache / buffer in SAS, so my repeated query / test will be reflective of the changes accurately?
I noticed the same query running the first time takes 10 seconds, and (without) changes running it immediately after will take shorter time (say 1-2 seconds). Is there a command / instruction to clear the cache / buffer. So I can have a clean test for my new changes.
I am using SAS Enterprise Guide with data hosted on an Oracle server. Thanks!

In order to flush caches on the Oracle side, you need both DBA privileges (to run alter system flush buffer_cache; in Oracle) and OS-level access (to flush the OS' buffer cache - echo 3 > /proc/sys/vm/drop_caches on common filesystems under Linux).
If you're running against a production database, you probably don't have those permissions -- you wouldn't want to run those commands on a production database anyways, since it would degrade the performance for all users of the database, and other queries would affect the time it takes to run yours.
Instead of trying to accurately measure the time it takes to run your query, I would suggest paying attention to how the query is executed:
what part of it is 'pushed down' to the DB and how much data flows between SAS and Oracle
what is Oracle's explain plan for the query -- does it have obvious inefficiencies
When a query is executed in a clearly suboptimal way, you will find (more often than not) that the fixed version will run faster both with cold and hot caches.
To apply this to the case you mention (10 seconds vs 2 seconds) - before thinking how to measure this accurately, start by looking
if your query gets correctly pushed down to Oracle (it probably does),
and whether it requires a full table (partition) scan of a sufficiently large table (depending on how slow the IO in your DB is - on the order of 1-10 GB).
If you find that the query needs to read 1 GB of data and your typical (in-database) read speed is 100MB/s, then 10s with cold cache is the expected time to run it.

I'm no Oracle expert but I doubt there's any way you can 'clear' the oracle cache (and if there were you would probably need to be a DBA to do so).
Typically what I do is I change the parameters of the query slightly so that the exact query no longer matches anything in the cache. For example, you could change the date range you are querying against.
It won't give you an exact performance comparison (because you're pulling different results) but it will give you a pretty good idea if one query performs significantly better than the other.

Benchmarking SQL Data Warehouse DWU

I'm putting together some simple analysis to benchmark DWU impact on read and write based on a CTAS statement.
The query is aggregating 1.7b row table to a table of 993k rows. Source and destination tables are round-robin distribution (source won't be RR long-term, will move to HASH) the query is roughly as follows:
create table CTAS_My_DWU_Test
with (distribution = round_robin)
as
select TableKey1, TableKey2,
SumCcolumn=SUM(SalesAmt),
MaxQuantity=MAX(SalesQty),
MinQuantity=MIN(SalesQty)
from FactSales
group by TableKey1, TableKey2
option (label='MyDWUTest');
I am analysing the performance via the sys.dm_pdw_dms_workers DMV, getting an average bytes_per_second over each distribution for both type=DIRECT_READER and type=WRITER.
My process is to change the DWU, drop the CTAS, re-create it and analyse the data in the DMV.
I'm not seeing a consistent improvement in performance as I increase the DWU. My goal is to look for clear proof of increase compute, however sometimes a higher DWU is slower and returning less bytes_per_sec than a smaller DWU.
If I happen to run the CTAS statement twice on the same DWU, without going through the scale process, the second & subsequent executions run nearly 10x faster.
Looking for help to on the process based on one table, want to keep data movement/join out of the equation for the moment.

Good question! The architecture of Azure SQL Data Warehouse is more performant when there is less data movement. I recommend following the steps in this article to determine which step is slowing the process down: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-manage-monitor/
It's possible that your query is analyzing each of the aggregations over the 1.7b rows in serial, which doesn't maximize the parallel nature of our product, but the best way to find out what is going on is to take a look at the query plan, etc. in the link above.
As for the 10x performance on a repeat run, that's coming from internal caching in our system.
Let us know what you find in the query plan, execution plan, etc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Google Bigquery UPDATE row takes too long - need a improve performance solution - sql-update

this update takes 194 seconds for 220mln rows. Is there a way to improve that? #standardSQL UPDATE dataset.people SET CBSA_CODE = '54620' where substr(zip,1,5) = '99047'

Related

How to Decrease Query Compile Time in Redshift

Performant way to handle arrays in Athena/Quicksight

First-run of queries are extremely slow

SAS PROC SQL: How to clear cache between testing

Benchmarking SQL Data Warehouse DWU

Categories

Resources