Long running prepare statements on Azure SQL Data Warehouse - azure-sqldw

I am running a functional test of a 3rd party application on an Azure SQL Data Warehouse database set at DWU 1000. In reviewing the current activity via:
sys.dm_pdw_exec_requests
I see:
prepare statements taking 30+ seconds,
NULL statements taking up to 25 seconds,
compilation of statements takes up to 60 seconds,
explain statements taking 60+ seconds, and
select count(1) from empty tables take 60+ seconds.
How does one identify the bottleneck involved?
The test has been running for a few hours and the Azure portal shows little DWU consumed on average, so I doubt that modifying the DWU will make any difference.
The third-party application has workload management feature, so I've specified a limit of 30 connections to the ADW database (understanding that only 32 sessions are active on the database itself.)
There are approximately ~1,800 tables and ~350 views in the database across 29 schemas (per information_schema.tables).
I am in a functional testing mode, so many of the tables involved in the queries have not yet been loaded, but statistics have been created on every column on every table in the scope of the test.
One userID is being used in the test. It is in smallrc.

have a look at your tables - in the query? Make sure all columns in joins, group by, and order by have up-to-date statistics.
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-statistics

Related

Billions of rows table as input to Quicksight dataset

there are two redshift table named A & B and a Quicksight dashboard where it takes A MINUS B as query to display content for a visual. If we use DIRECT query option and it is getting timedout because query is not completing in 2 mins(Quicksight have hard limit to run query within 2 mins) . Is there a way to use such large datasets as input Quicksight dasboard visual ?
Can't use SPICE engine because it have limit 1B or 1TB size limit.Also, it have 15 mins of delay to refresh data.
You will likely need to provide more information to fully resolve. MINUS can be a very expensive operation especially if you haven't optimized the tables for this operation. Can you provide information about your table setup and the EXPLAIN plan of the query you are running?
Barring improving the query, one way to work around a poorly performing query behind quicksight is to move this query to a materialized view. This way the result of the query can be stored for later retrieval but needs to be refreshed when the source data changes. It sounds like your data only changes every 15 min (did I get this right?) then this may be an option.

Run update on multiple tables in BigQuery

I have lake dataset which take data from a OLTP system, with the nature of transactions we have lot of updates the next day, so to keep track of the latest record we are using active_flag = '1'.
We also created a update script which retires old records and updates active_flag = '0'.
Now the main question: how can i execute a update statement by changing table name automatically(programmatically).
I know we have a option of using cloudfunctions but it'll expire in 9 mins and I have atleast 350 tables to update.
Has anyone faced this situation earlier??
You can easily do this with Cloud Workflows.
There you setup the template calls to Bigquery as a substeps, and then you pass a list of tables, and loop through the items and invoke the BigQuery step for each item/table.
I wrote an article with samples that you can adapt: Automate the execution of BigQuery queries with Cloud Workflows

PowerBI Auto Refresh

Good Day
A client I am working with wants to use a PowerBI dashboard to display in their call centre with stats pulled from an Azure SQL Database.
Their specific requirement is that the dashboard automaticly refresh every minute between their operating hours (8am - 5pm).
I have been researching this a bit but can't find a definitive answer.
Is it possible for PowerBI to automaticly refresh every 1min?
Is it dependant on the type of license and/or the type of connection (DIRECTQUERY vs IMPORT)
You can set a report to refresh on a direct query source, using the automatic report refresh feature.
https://learn.microsoft.com/en-us/power-bi/create-reports/desktop-automatic-page-refresh
This will allow you to refresh the report every 1 minute or other defined interval. This is report only, not dashboards as it is desktop only.
When publishing to the service you will be limited to a minimum refresh of 30 mins, unless you have a dedicated capacity. You could add an A1 Power BI Embedded SKU and turn it on then off during business hours to reduce the cost. Which would work out around £200 per month.
Other options for importing data would be to set a Logic App or Power Automate task to refresh the dataset using an API call, for a lower level of frequency, say 5 mins. It would be best to optimise your query to return a small amount of pre aggregated data to the dataset.
You can use Power Automate to schedule refresh your dataset more than 48 times a day. You can refresh it every minute with Power Automate, it looks like. I can also see that you may be able to refresh your dataset more frequently than that with other tools.
Refreshing the data with 1 min frequency is not possible in PowerBI. If you are not using powerBI premium than you can schedule upto 8 times in a day, with the minimum gap of 15 minutes. If in case you are using PowerBI premium you are allowed to schedule 48 slots.
If you are not able to compromise with the above restrictions, then it might be worth to look into powerBI reports for streaming datasets. But then again there are some cons to that as well, like they work only with DirectQuery etc etc.

Timezone related issues in BigQuery (for partitioning and query)

We have a campaign management system. We create and run campaigns on various channels. When user clicks/accesses any of the Adv (as part of campaign), system generates a log. Our system is hosted in GCP. Using ‘Exports’ feature logs are exported to BigQuery
In BigQuery the Log Table is partitioned using ‘timestamp’ field (time when log is generated). We understand that BigQuery stores dates in UTC timezone and so partitions are also based on UTC time
Using this Log Table, We need to generate Reports per day. Reports can be like number of impressions per each day per campaign. And we need to show these reports as per ETC time.
Because the BigQuery table is partitioned by UTC timezone, query for ETC day would potentially need to scan multiple partitions. Had any one addressed this issue or have suggestions to optimise the storage and query so that its takes complete advantage of BigQuery partition feature
We are planning to use GCP Data studio for Reports.
BigQuery should be smart enough to filter for the correct timezones when dealing with partitions.
For example:
SELECT MIN(datehour) time_start, MAX(datehour) time_end, ANY_VALUE(title) title
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE DATE(datehour) = '2018-01-03'
5.0s elapsed, 4.56 GB processed
For this query we processed the 4.56GB in the 2018-01-03 partition. What if we want to adjust for a day in the US? Let's add this in the WHERE clause:
WHERE DATE(datehour, "America/Los_Angeles") = '2018-01-03'
4.4s elapsed, 9.04 GB processed
Now this query is automatically scanning 2 partitions, as it needs to go across days. For me this is good enough, as BigQuery is able to automatically figure this out.
But what if you wanted to permanently optimize for one timezone? You could create a generated, shifted DATE column - and use that one to PARTITION for.

Azure Datawarehouse performance issues

I have a very basic Azure SQL Warehouse setup for test purposes DWU100. It has one table in it with 60 million rows. I run a query of the form:
SELECT
SUM(TheValue), GroupId
FROM
[dbo].[Fact_TestTable]
GROUP BY
GroupId
Running this query takes 5 seconds.
Running the same query on a DTU 250 SQL database (equivalent by price), I get an execution time of 1 second.
I'm assuming there must be things I can do to speed this up, can anyone suggest what I can do to improve this?
The group by GroupId above is just an example, I can't assume people will always group by any one particular column.
based on your question, it's not clear how is your table designed - are you using ROUND-ROBIN or HASH distributed table design? If you did not choose distribution type during table creation, default table design is round robin. Given your query, choosing HASH distributed table design would likely lead to improved query execution time as this query would converted to local-global aggregation type of query. It's hard to comment exactly what is happening given you did not share query plan.
Below is a link to SQL DW documentation that talks about various table design options.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse?view=aps-pdw-2016-au7
hope this helps, Igor