BigQuery - Inequality vs Equality Joins - maximumBillingTier - google-cloud-platform

Here is the inequality condition that I have in my join (simple overlap conditions):
ON
(A.start <= B.End) AND (B.Start <= A.END)
It gives me the following error:
java.lang.RuntimeException:
BigQueryError{reason=billingTierLimitExceeded, location=null,
message=Query exceeded resource limits. 700920.3330645757 CPU seconds
were used, and this query must use less than 529900.0 CPU seconds.
Surprisingly, this operation takes more than running the sequential algorithm (w/o any join) on a single instance (n1-highmem-16).
I have a couple of questions:
1) How can I calculate maximumBillingTier for my query?
2) Can someone explain how inequality joins work in BigQuery?
3) Why inequality joins are so expensive?
Is it because of number of operations, or is it because of large number of outputs?
For the same query and input tables, inequality joins takes more than 13000 seconds and eventually gets canceled due to time-out, but if I change the condition to only cover equality, it would take only 70 secs.
Thanks!

1) How can I calculate maximumBillingTier for my query?
I think this goes down to the notion of Slots
A BigQuery slot is a unit of computational capacity required to execute SQL queries. BigQuery automatically calculates how many slots are required by each query, depending on query size and complexity.
The default number of slots for on-demand queries is shared among all queries in a single project. As a rule, if you're processing less than 100 GB of queries at once, you're unlikely to be using all 2,000 slots.
To check how many slots you're using, see Monitoring BigQuery Using Stackdriver.
See more details at Query Jobs Quotas
2) Can someone explain how inequality joins work in BigQuery?
This can really depends on data size and distribution
I would recommend Query Plan Explanation - it can help not only in understanding what is going on under-hood but also will help you to optimize your query

Related

How does Cloud Bigtable read rows that are non-contiguous?

Given a large number of known row keys. How does bigtable read(not a scan operation) those rows? Does it read the rows one after the other or all at once? If I have a large number of non-contiguous rows that I want to read, is it better to make separate concurrent or parallel hits to read each or to give all rows to bigtable i.e. a "batch read"?
There are three options for a non-contiguous batch read which depend on your latency and CPU requirements. You can do all the reads as get requests in parallel, you can issue a read rows request/scan with multiple ranges that include only one row, or you can do a hybrid.
Reading with multiple parallel get requests
This option can be great if you have a lot of processing power or don't need to read a huge number of rows. This will issue multiple requests to Bigtable, so it's going to have an impact on your CPU utilization. One Bigtable node supports around 10K reads per second, but if you have 1000 rows you need to read individually that might make a dent in your capacity.
Also, if you need all of the requests to resolve before you can process the data, you may run into performance issues if one request is slow, it slows down the entire result.
Scan with multiple rows
Bigtable supports scanning with multiple filters. One filter is a row range based on the row key. You can create a row range filter that includes exactly one row and do a scan with a filter for each row.
The Bigtable client libraries support queries like this, so you can just pass the row keys and don't need to create all of those row range filters. However, it's important to know what is happening under the hood for performance. This one query will be performed sequentially on the Bigtable server, so it could take a lot more time than multiple gets.
In Java, to do this kind of query, you just pass multiple row keys to the Query builder like so:
Query query = Query.create(tableId).rowKey("phone#4c410523#20190501").rowKey("phone#4c410523#20190502");
ServerStream<Row> rows = dataClient.readRows(query);
for (Row row : rows) {
printRow(row);
}
Hybrid approach
Depending on the scale of rows you're working with, it may make sense to take your set of row keys, divide them up and issue multiple scans in parallel. You can get the benefit of fewer requests while still potentially getting better latency since the requests are parallelized.
I would recommend experimenting to see which scenario works best for your use case, or leave a comment with more information on your use case and I can see if there is more information I can offer you.

Redshift Query Performance to reduce CPU utilisation

I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.

DynamoDB Scan/Query Return x Number of Items

If I scan or query in DynamoDB it is possible to set the Limit property. The DynamoDB documentation says the following:
The maximum number of items to evaluate (not necessarily the number of
matching items).
So the problem with this is if you set filters and such it won't return all the items.
My goal that I'm trying to figure out how to achieve is to have a filter in a scan or query, but have it return x number of items. No matter what. I'm ok with having to use LastEvaluatedKey and make multiple requests, but I would like to try to make it as seamless and easy as possible (so not doing that would be best.
The only way I have thought to do this is to set the Limit property to say 1 or something. Then just keep scanning or querying using the LastEvaluatedKey until I reach that x number of items I'm looking for. Problem is, this seems VERY wasteful and inefficient. I mean if you have a table of millions of records you might have to make thousands and thousands of requests. It doesn't seem like it scales very well. Of course I'm sure it's no different than what DynamoDB would be doing behind the scenes.
But is there a way to do this more efficiently where I can reduce the number of requests I have to make? Or is that the only way to achieve this?
How would you achieve this goal?
A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.
You're 100% right that Limit is applied before FilterExpression. Meaning Dynamo might return some number or documents less than the Limit while other documents that satisfy the FilterExpression still exist in the table but aren't returned.
Its sounds like it would be unacceptable for your api to behave in the same manner. That is going to mean that in some cases, a single request to your service will result in multiple requests to Dynamo. Also, keep in mind that there is no way to predict what the LastEvaluatedKey will be which would be required to parallelize these requests. So in the case that your service makes multiple requests to Dynamo, they will be serial. To me, this is a rather heavy tradeoff but, if it is a requirement that you satisfy the Limit whenever possible, you have options.
First, Dynamo will automatically page at 1 MB. That means you could simply send your query to Dynamo without a Limit and implement the Limit on your end. You may still need to make multiple requests to ensure that your've satisfied the Limit but this approach will result in the fewest number of requests to Dynamo. The trade off here is the total data being read and transferred. Chances are your Limit will not happen to line up perfectly with the 1 MB limit which means the excess data being read, filtered, and transferred is wasted.
You already mentioned the other extreme of sending a Limit of 1 and pointed out that will result in the maximum number of requests to Dynamo
Another approach along these lines is to create some sort of probabilistic function that takes the Limit given to your service by the client and computes a new Limit for Dynamo. For example, your FilterExpression filters out about half of the documents in the table. That means you can multiply the client Limit by 2 and that would be a reasonable Limit to send to Dynamo. Of the approaches we've talked about so far, this one has the highest potential for efficiency however, it also has the highest potential for complexity. For example, you might find that using a simple linear function is not good enough and instead you need to use machine learning to find a multi-variate non-linear function to calculate the new Limit. This approach also heavily depends on the uniformity of your data in Dynamo as well as your access patterns. Again, you might need machine learning to optimize for those variables.
In any of the cases where you are implementing the Limit on your end, if you plan on sending back the LastEvaluatedKey to the client for subsequent calls to your service, you will also need to take care to keep track of the LastEvaluatedKey that you evaluated. You will no longer be able to rely on the LastEvaluatedKey returned from Dynamo.
The final approach would be to reorganize/regroup your data either with a GSI, a separate table that you keep in sync using Dynamo Streams or a different schema altogether with the goal of not requiring a FilterExpression.

Benchmarking SQL Data Warehouse DWU

I'm putting together some simple analysis to benchmark DWU impact on read and write based on a CTAS statement.
The query is aggregating 1.7b row table to a table of 993k rows. Source and destination tables are round-robin distribution (source won't be RR long-term, will move to HASH) the query is roughly as follows:
create table CTAS_My_DWU_Test
with (distribution = round_robin)
as
select TableKey1, TableKey2,
SumCcolumn=SUM(SalesAmt),
MaxQuantity=MAX(SalesQty),
MinQuantity=MIN(SalesQty)
from FactSales
group by TableKey1, TableKey2
option (label='MyDWUTest');
I am analysing the performance via the sys.dm_pdw_dms_workers DMV, getting an average bytes_per_second over each distribution for both type=DIRECT_READER and type=WRITER.
My process is to change the DWU, drop the CTAS, re-create it and analyse the data in the DMV.
I'm not seeing a consistent improvement in performance as I increase the DWU. My goal is to look for clear proof of increase compute, however sometimes a higher DWU is slower and returning less bytes_per_sec than a smaller DWU.
If I happen to run the CTAS statement twice on the same DWU, without going through the scale process, the second & subsequent executions run nearly 10x faster.
Looking for help to on the process based on one table, want to keep data movement/join out of the equation for the moment.
Good question! The architecture of Azure SQL Data Warehouse is more performant when there is less data movement. I recommend following the steps in this article to determine which step is slowing the process down: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-manage-monitor/
It's possible that your query is analyzing each of the aggregations over the 1.7b rows in serial, which doesn't maximize the parallel nature of our product, but the best way to find out what is going on is to take a look at the query plan, etc. in the link above.
As for the 10x performance on a repeat run, that's coming from internal caching in our system.
Let us know what you find in the query plan, execution plan, etc.

concurrent query performance in amazon redshift

On Amazon Redshift, do concurrent queries affect each others performance?
For example, lets say there are two queries: one on a relatively small table (~5m rows) retrieving all rows, and another on a large table (~500m) rows. Both tables have the same fields, both have no compression. Both queries retrieve all data in their respective tables to compute their results. There are no joins or filters. Both queries retrieve about 2-4 fields for their computations.
Running by itself, the small query returns in about 700ms. However, while the large query is running (which by itself takes a few minutes), the small query returns in 4-6 seconds.
This is the observed behavior on a cluster with a single XL node.
Is this the expected behavior? Is there a configuration setting that will promise performance consistency of the small query, even if the large query is running?
Copy-pasted from: https://forums.aws.amazon.com/thread.jspa?threadID=137540#
I've performed some concurrent query benchmarking.
I created a straightforward query which by itself took about a minute
to run. I then ran one of those queries at once, then two, them three,
etc, and timed each query.
Each query basically halved database performance - e.g. what you'd
expect; double the load, halve the performance.
Actually, it's a bit better than halving - you get about an extra 10%
performance.
This performance behaviour held true up to 5 concurrent queries, which
is the max number of concurrent queries configured on the database I
was working with. If I ran six queries, the final query could not
execute until one of the first queries had finished and freed up a
slot.
Finally, vacuum acts much like a normal query - it halves performance.
It's not special.
Actually, vacuum is something more than halving - it's equivelent to a
pretty heavy query.
There are no guarantees because all of this is running on a fixed number of CPUs. With a fixed capacity of work when you increase the work it lowers the throughput. The short answer is get a bigger machine (ie more nodes).
Here is the specifics of your answer:
https://forums.aws.amazon.com/message.jspa?messageID=437015#
http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html