Azure Datawarehouse performance issues

Azure Datawarehouse performance issues - azure-sqldw

I have a very basic Azure SQL Warehouse setup for test purposes DWU100. It has one table in it with 60 million rows. I run a query of the form:
SELECT
SUM(TheValue), GroupId
FROM
[dbo].[Fact_TestTable]
GROUP BY
GroupId
Running this query takes 5 seconds.
Running the same query on a DTU 250 SQL database (equivalent by price), I get an execution time of 1 second.
I'm assuming there must be things I can do to speed this up, can anyone suggest what I can do to improve this?
The group by GroupId above is just an example, I can't assume people will always group by any one particular column.

based on your question, it's not clear how is your table designed - are you using ROUND-ROBIN or HASH distributed table design? If you did not choose distribution type during table creation, default table design is round robin. Given your query, choosing HASH distributed table design would likely lead to improved query execution time as this query would converted to local-global aggregation type of query. It's hard to comment exactly what is happening given you did not share query plan.
Below is a link to SQL DW documentation that talks about various table design options.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse?view=aps-pdw-2016-au7
hope this helps, Igor

Related

Billions of rows table as input to Quicksight dataset

there are two redshift table named A & B and a Quicksight dashboard where it takes A MINUS B as query to display content for a visual. If we use DIRECT query option and it is getting timedout because query is not completing in 2 mins(Quicksight have hard limit to run query within 2 mins) . Is there a way to use such large datasets as input Quicksight dasboard visual ?
Can't use SPICE engine because it have limit 1B or 1TB size limit.Also, it have 15 mins of delay to refresh data.

You will likely need to provide more information to fully resolve. MINUS can be a very expensive operation especially if you haven't optimized the tables for this operation. Can you provide information about your table setup and the EXPLAIN plan of the query you are running?
Barring improving the query, one way to work around a poorly performing query behind quicksight is to move this query to a materialized view. This way the result of the query can be stored for later retrieval but needs to be refreshed when the source data changes. It sounds like your data only changes every 15 min (did I get this right?) then this may be an option.

concatenation of data into superset

there are two tables, one collects facts on a daily basis, the other on a monthly basis with the same set of attributes (for example, region, city, technology).
I need to calculate the formula in a superset
SUM(t1.count_exp) / SUM(t2.count_base)
which will be correctly visualized when calculating by region, or by city, or by region + city + technology per month.
in other bi systems, the group by is performed first, then the join is executed and the formula above is calculated, which gives the desired result. How to achieve a similar result in a superset?

Assuming both tables are in same database, then you can write your own query joining the two tables in 'SQL Lab' and then visualize the query results using 'Explore' option available there.
Once you click on 'Explore' from SQL Lab, Superset will create a Virtual Dataset(Table) inside Superset from results of SQL query. Any filters/group by/limit applied on this virtual table from visualization will query over this query.
https://superset.apache.org/docs/frequently-asked-questions
A view is a simple logical layer that abstract an arbitrary SQL
queries as a virtual table. This can allow you to join and union
multiple tables, and to apply some transformation using arbitrary SQL
expressions. The limitation there is your database performance as
Superset effectively will run a query on top of your query (view). A
good practice may be to limit yourself to joining your main large
table to one or many small tables only, and avoid using GROUP BY where
possible as Superset will do its own GROUP BY and doing the work twice
might slow down performance.

How to do "Select COUNT(*)" in DynamoDB from the AWS management console or any other 3rd party GUI?

Hi Im trying to query some table in DynamoDB. However from what I read I can only do it using some code or form the CLI. Is there a way to do complex queries from the GUI? I tried playing with it but can't seem to figure out how to do a simple COUNT(*). Please help.

Go to DynamoDB Console;
Select the table that you want to count
Go to "overview" page/tab
In table properties, click on Manage Live Count
Click Start Scan
This will give you the count of items of the table at that moment. Just be warned that this count is eventually consistent; what means that if someone is performing changes in the table at that exact moment your end result will not be exact (but probably very close to reality).
Digressing a little bit (only in case you're new to DynamoDB):
DynamoDB is a NoSQL database. It doesn't support the same commands that are common in SQL databases. Mainly because it doesn't support the same consistency model provided by SQL databases.
In SQL databases, when you send a count(*) query your RDMS make some very educated guesses and take some short paths to discover the number of lines in the table. It does that because reading your entire table to give you this answer would take too much time.
DynamoDB doesn't have means to make these educated guesses. When you want to know how many items one table have the only option it has is to read all of them counting one by one. That is the exact task that the command mentioned in the beginning of this answer does. It scans the entire table counting all the items one by one.
Because of that, when you perform this task it will bill you the entire table read (DynamoDB bills you per reads and writes). And maybe after you started the scan someone put another item in the the table while you are still counting. In that case it will not restart the count because by design DynamoDB is eventually consistent.

Using fake timestamps to create partitions on Google BigQuery

Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.

The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution

BigQuery - querying only a subset of keys in a table with key value schema

So I have a table with the following schema:
timestamp: TIMESTAMP
key: STRING
value: FLOAT
There are around 200 unique keys. I am partitioning the dataset by date.
I want to run several (5-6 currently, but I expect to add at least 15 more) queries on a daily basis on this database. Brute forcing these would cost me a lot daily, which I want to avoid.
The issue is that because of this key - value format, and BigQuery being a columnar database, each query queries the whole day's data, despite each query actually using a maximum of 4 keys. What is a best way to optimize this?
I am thinking the best way I can go about it right now is to create separate temp tables for each key as a daily batch process, run my queries on them and then delete them.
Ideal way I would want to go about it is partitioning by key, I am not sure there is any such provision?

You can try using recently introduced clustering partitioned tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Update (moved from comments)
Also have in mind below
Feature Partitioning Clustering
--------------- ------------- -------------
Cardinality Less than 10k Unlimited
Dry Run Pricing Available Not available
Query Pricing Exact Best Effort
Pay special attention to Dry Run Pricing - unfortunately - clustered tables do not support dry run (validation) based on clustered keys - and rather show only validation based on partitions. but if you set your clustering properly - actual run will end up with lower cost. you should try with smaller data to get comfortable with this
See more at Clustering partitioned tables

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js