Azure SQL DW table not using all the distributions on the compute nodes to store data - azure-sqldw

One of the Fact tables in our Azure SQL DW (stores the train telemetry data) is created as a HASH distributed table (HASH key is VehicleDimId – integer field referencing the Vehicle Dimension table). The total number of records in the table are approx. 1.3 billion.
There are 60 unique VehicleDimId (i.e. we have data for 60 unique vehicles) values in the table which means that they have 60 unique hash keys as well. Based on my understanding, I expect the records corresponding to these 60 unique hash key VehicleDimId should be distributed across 60 distributions available (1 hash key for 1 distribution).
However, currently all the data is distributed across just 36 distributions leaving other 24 distributions with no records. In effect, that is just 60% usage of the compute nodes available. Changing the Data Warehouse scale does not have any effect as the number of distributions remain the same to 60. We are currently running our SQL DW at DW400 level. Below is the compute node level record counts of the table.
You can see that the data is not evenly distributed across compute nodes (which is due to the data not being distributed evenly across the underlying distributions).
I am struggling to understand what I need to do to get the SQL DW to use all the distributions rather than just 60% of them.

Hash distribution takes a hash of the binary representation of your distribution key then deterministically sends the row to the assigned distribution. Basically an int value of 999 ends up on the same distribution on every Azure SQL DW predictably. It doesn't look at your specific 60 unique vehicle IDs and evenly divide them.
The best practice is to choose a field (best if it is used in joins or group bys or distinct counts) which has at least 600 (10x the number of distributions) fairly evenly used values. Are there other fields that meet that criteria?
To quote from this article adding some emphasis:
Has many unique values. The column can have some duplicate values.
However, all rows with the same value are assigned to the same
distribution. Since there are 60 distributions, the column should have
at least 60 unique values. Usually the number of unique values is much
greater.
If you only have 60 distinct values your likelihood of ending up with even distribution is very small. With 10x more distinct values your likelihood of achieving even distribution is much higher.
The fallback is to use round robin distribution. Only do this if there are no other good distribution keys which produce even distribution and which are used in queries. Round robin should achieve optimal loading performance but query performance will suffer because the first step of every query will be a shuffle.
In my opinion, concatenating two columns together (as Ellis' answer suggests) to use as the distribution key is usually a worse option than round robin distribution unless you actually use the concatenated column in group bys or joins or distinct counts.
It is possible that keeping the current Vehicle ID distribution is the best choice for query performance since it will eliminate a shuffle step in many queries that join or group on Vehicle ID. However the load performance may be much worse because of the heavy skew (uneven distribution).

Another option is to create a concatenated join key which could be a concatenation of two different keys which will create a higher cardinality than what you have now with 60 x new row cardinality should generally be in the thousands or greater. The caveat here is that key needs to then be referenced in every join so that the work is done one each node. Then, when you hash on this key, you'll get a more even spread.
The only downside is that you have to propagate this concatenated key to the dimension table as well and ensure that your join conditions include this concatenated key until the last query. As an example, you keep the surrogate key in the subqueries and remove it only in the top level query to force collocated joins.

Related

Determine read capacity unit for an Amazon DynamoDB table

How to determine read capacity unit for a table when get query returns different number of items in each api call(eg:- one get query returns 50 items , another get query returns 500 items from the same table )
Its all about averages.
If your average fluctuates significantly over some time period e.g. over the course of a day, you can use autoscaling.
If your table doesn't see enough requests to have a stable average throughput, you probably don't need to worry too much. Give yourself some breathing room but also keep in mind that DynamoDB allows bursting so you don't need to be too exact over time.
Also consider how your data is distributed and the relative temperatures of your data in your table. Read and write throughput gets spread across all partitions equally, meaning cold partitions get an equal read throughput as hot partitions. It is always the goal to structure your data so that it is evenly distributed and equal temperature.

Show top 10 scores for a game using DynamoDB

I created a table "scores" in DynamoDB to store the scores of a game.
The table has the following attributes:
uuid
username
score
datetime
Now I want to query the "top 10" scores for the game (leaderboard).
In DynamoDB, it is not possible to create an index without a partition key.
So, how do I perform this query in a scalable manner?
No. You will always need the partition key. DynamoDB is able to provide the required speed at scale because it stores records with the same partition key physically close to each other instead of on 100 different disks (or SSDs, or even servers).
If you have a mature querying use case (which is what DynamoDB was designed for) e.g. "Top 10 monthly scores" then you can hash datetime into a derived month attribute like 12/01/2017, 01/01/2018, and so on and then use this as your partition key so all the scores generated in the same month will get "bucketized" into the same partition for the best lookup performance. You can then keep score as the sort key.
You can of course have other tables (or LSIs) for weekly and daily scores. Or you can even choose the most granular bucket and sum up the results yourself in your application code. You'll prob need to pull a lot more then 10 records to get close enough to 100% accuracy on the aggregate level... I don't know. Test it. I wouldn't rely on it if I were dealing with money/commissions but for game scores who cares :)
Note: If you decide to use this route then instead of using "12/10/2017" etc. as the partition values I'd use integer offsets e.g. UNIX epoch (rounded off to represent the first midnight of the month) to make it easier to compute/code against. I used the friendly dates to better illustrate the approach.

Dynamo DB: global secondary index, sparse index

I am considering taking advantage of sparse indexes as described in the AWS guidelines. In the example described --
... in the GameScores table, certain players might have earned a particular achievement for a game - such as "Champ" - but most players have not. Rather than scanning the entire GameScores table for Champs, you could create a global secondary index with a partition key of Champ and a sort key of UserId.
My question is: what happens when the number of champs becomes very large? I suppose that the "Champ" partition will become very large and you would start to experience uneven load distribution. In order to get uniform load distribution, would I need to randomize the "Champ" value by (effectively) sharding over n shards, e.g. Champ.0, Champ.1 ... Champ.99?
Alternatively, is there a different access pattern that can be used when fetching entities with a specific attribute that may grow large over time?
this is exactly the solution you need (Champ.0, Champ.1 ... Champ.N)
N should be [expected partitions for this index + some growth gap] (if you expect for high load, or many 'champs' then you can choose N=200) (for a good hash distribution over partitions). i recommend that N will be modulo on userId. (this can help you to do some manipulations by userId.)
we also use this solution if your hash key is Boolean (in dynamodb you can represent boolean as string), so in this case the hash will be "true.0", "true.1" .... "true.N" and the same for "false".

Redshift -- Query Performance Issues

SELECT
a.id,
b.url as codingurl
FROM fact_A a
INNER JOIN dim_B b
ON strpos(a.url,b.url)> 0
Records Count in Fact_A: 2 Million
Records Count in Dim_B : 1500
Time Taken to Execute : 10 Mins
No of Nodes: 2
Could someone help me with an understanding why the above query takes more time to execute?
We have declared the distribution key in Fact_A to appropriately distribute the records evenly in both the nodes and also Sort Key is created on URL in Fact_A.
Dim_B table is created with DISTRIBUTION ALL.
Redshift does not have full-text search indexes or prefix indexes, so a query like this (with strpos used in filter) will result in full table scan, executing strpos 3 billion times.
Depending on which urls are in dim_B, you might be able to optimise this by extracting prefixes into separate columns. For example, if you always compare subpaths of the form http[s]://hostname/part1/part2/part3 then you can extract "part1/part2/part3" as a separate column both in fact_A and dim_B, and make it the dist and sort keys.
You can also rely on parallelism of Redshift. If you resize your cluster from 2 nodes to 20 nodes, you should see immediate performance improvement of 8-10 times as this kind of query can be executed by each node in parallel (for the most part).

What's the difference between BatchGetItem and Query in DynamoDB?

I've been going through AWS DynamoDB docs and, for the life of me, cannot figure out what's the core difference between batchGetItem() and Query(). Both retrieve items based on primary keys from tables and indexes. The only difference is in the size of the items retrieved but that doesn't seem like a ground breaking difference. Both also support conditional updates.
In what cases should I use batchGetItem over Query and vice-versa?
There’s an important distinction that is missing from the other answers:
Query requires a partition key
BatchGetItems requires a primary key
Query is only useful if the items you want to get happen to share a partition (hash) key, and you must provide this value. Furthermore, you have to provide the exact value; you can’t do any partial matching against the partition key. From there you can specify an additional (and potentially partial/conditional) value for the sort key to reduce the amount of data read, and further reduce the output with a FilterExpression. This is great, but it has the big limitation that you can’t get data that lives outside a single partition.
BatchGetItems is the flip side of this. You can get data across many partitions (and even across multiple tables), but you have to know the full and exact primary key: that is, both the partition (hash) key and any sort (range). It’s literally like calling GetItem multiple times in a single operation. You don’t have the partial-searching and filtering options of Query, but you’re not limited to a single partition either.
As per the official documentation:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.html#CapacityUnitCalculations
For BatchGetItem, each item in the batch is read separately, so DynamoDB first rounds up the size of each item to the next 4 KB and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchGetItem reads a 1.5 KB item and a 6.5 KB item, DynamoDB will calculate the size as 12 KB (4 KB + 8 KB), not 8 KB (1.5 KB + 6.5 KB).
For Query, all items returned are treated as a single read operation. As a result, DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.
You should use BatchGetItem if you need to retrieve many items with little HTTP overhead when compared to GetItem.
A BatchGetItem costs the same as calling GetItem for each individual item. However, it can be faster since you are making fewer network requests.
In a nutshell:
BatchGetItem works on tables and uses the hash key to identify the items you want to retrieve. You can get up to 16MB or 100 items in a response
Query works on tables, local secondary indexes and global secondary indexes. You can get at most 1MB of data in a response. The biggest difference is that query support filter expressions, which means that you can request data and DDB will filter it server side for you.
You can probably achieve the same thing if you want using any of these if you really want to, but rule of the thumb is you do a BatchGet when you need to bulk dump stuff from DDB and you query when you need to narrow down what you want to retrieve (and you want dynamo to do the heavy lifting filtering the data for you).
DynamoDB stores values in two kinds of keys: a single key, called a partition key, like "jupiter"; or a compound partition and range key, like "jupiter"/"planetInfo", "jupiter"/"moon001" and "jupiter"/"moon002".
A BatchGet helps you fetch the values for a large number of keys at the same time. This assumes that you know the full key(s) for each item you want to fetch. So you can do a BatchGet("jupiter", "satrun", "neptune") if you have only partition keys, or BatchGet(["jupiter","planetInfo"], ["satrun","planetInfo"], ["neptune", "planetInfo"]) if you're using partition + range keys. Each item is charged independently and the cost is same as individual gets, it's just that the results are batched and the call saves time (not money).
A Query on the other hand, works only inside a partition + range key combo and helps you find items and keys that you don't necessarily know. If you wanted to count Jupiter's moons, you'd do a Query(select(COUNT), partitionKey: "jupiter", rangeKeyCondition: "startsWith:'moon'"). Or if you wanted the fetch moons no. 7 to 15 you'd do Query(select(ALL), partitionKey: "jupiter", rangeKeyCondition: "BETWEEN:'moon007'-'moon015'"). Here you're charged based on the size of the data items read by the query, irrespective of how many there are.
Adding an important difference. Query supports Consistent Reads, while BatchGetITem does not.
BatchGetITem Can use Consistent Reads through TableKeysAndAttributes
Thanks #colmlg for the information.