I am considering taking advantage of sparse indexes as described in the AWS guidelines. In the example described --
... in the GameScores table, certain players might have earned a particular achievement for a game - such as "Champ" - but most players have not. Rather than scanning the entire GameScores table for Champs, you could create a global secondary index with a partition key of Champ and a sort key of UserId.
My question is: what happens when the number of champs becomes very large? I suppose that the "Champ" partition will become very large and you would start to experience uneven load distribution. In order to get uniform load distribution, would I need to randomize the "Champ" value by (effectively) sharding over n shards, e.g. Champ.0, Champ.1 ... Champ.99?
Alternatively, is there a different access pattern that can be used when fetching entities with a specific attribute that may grow large over time?
this is exactly the solution you need (Champ.0, Champ.1 ... Champ.N)
N should be [expected partitions for this index + some growth gap] (if you expect for high load, or many 'champs' then you can choose N=200) (for a good hash distribution over partitions). i recommend that N will be modulo on userId. (this can help you to do some manipulations by userId.)
we also use this solution if your hash key is Boolean (in dynamodb you can represent boolean as string), so in this case the hash will be "true.0", "true.1" .... "true.N" and the same for "false".
Related
For a data quality check I need to collect data in a specific interval.
Some tables are huge in size.
Is there any hack to do this without affecting the performance?
Like select 100 rows randomly.
How random do you need? The classic way to do this is with "WHERE RANDOM() < .001". If you need it to give you a repeatable "random" set then you can add a seed. The issue is that your tables are huge and this means reading (scanning) every row from disk just to throw most of them away and since table scan can take a significant time this isn't what you want to do.
So you may want to take advantage of Redshift "limited table scan" capabilities as part of your "random" sampling. (The fastest data to read from disk is the data you don't read from disk.) The issue here is that this solution will depend on your table sort keys and ordering which will push the solution into even "more pseudo" random territory (less of a true random sampling). In many cases this isn't a big deal but if the statistics really matter then this may not work for you.
This is done by sampling "blocks", not rows, based on the sort key(s). This sampling of blocks can be done randomly and each block of data will represent about 250K rows (based on sort key data type, compression etc. and COULD range anywhere from <100K rows to 2M rows). Doing this process will take a little inspection of STV_BLOCKLIST. The storage quanta for Redshift is the 1MB block and each and every block's metadata in the system can be referenced in STV_BLOCKLIST. This system table contains min and max values for each block. First find all the blocks for the sort key for the table in question. Next pick a random sample of these blocks (and if you are still dealing with a lot of data make sure that this sampling picks an even number from across all the slices to avoid execution skew).
Now the trick is to translate these min a max metadata values into a WHERE clause the performs the desired sampling. These min and max values are BIGINTs and are hashed from the data in the sort key column. This hash is data type dependent. If the data type is BIGINT then the has is quite simple - if the data type is timestamp then it is a bit more complex. But the ordering will be preserved across the hashing function for the data type involved. Reverse engineering this hash isn't hard - just perform a few experiments - but I can help if you tell me the type involved as I've done this for just about every data type at this point.
You can even do a random sampling rows on top of this random sampling of blocks. Or if you want you can just pick some narrow ranges of the sort key value and then randomly sample row and avoid all this reverse engineering business. The idea is to use Redshift "reduced scan" capability to greatly reduce the amount of data read from disk. To do this you need to be metadata aware in your choice of sampling windows which often means a sort key where clause. This is all about understanding how the database engine works and using its capabilities to your advantage.
I understand that this answer is based on some unstated information so please reach out in a comment if something isn't clear.
Is it an issue in Datastore to index a property that can only have 4-5 possible values? Would this lead to tablet hotspots?
I am thinking of a property with an enum of string values like "done", "working", "complete". The reason for indexing such a property would be so you can create a composite index that let's you query on all entities that are "done" for example.
Yes, it would be an issue if/when you have high rates of queries using these composite indexes you mentioned, listed in Indexes:
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to
hotspots that impact Cloud Datastore latency for applications with
high read and write rates. For further guidance on dealing with
monotonic properties, see High read/write rates for a narrow key
range below.
You would also have a tablet hotspot problem if/when you hit high rates of datastore writes for entities with the same property value (for example 100s of entities becoming done per second) - another facet of the same problem. It's this case mentioned in High read/write rates to a narrow key range:
You will also see this problem if you create new entities at a high rate with a monotonically increasing indexed property like a
timestamp, because these properties are the keys for rows in the index
tables in Bigtable.
TLDR: It scales so long as entity keys are scattered.
DR:
Lets first consider the index entries being written.
We have something like:
SomeKind\E1 -> FullEntityKey1
SomeKind\E2 -> FullEntityKey2
SomeKind\E2 -> FullEntityKey3
SomeKind\E3 -> FullEntityKey4
We note that each individual index entry points to some entity.
As far as the load sharding is concerned the value being sharded is like the following:
SomeKind\E1\FullEntityKey1
SomeKind\E2\FullEntityKey2
SomeKind\E2\FullEntityKey3
SomeKind\E3\FullEntityKey4
Now lets imagine we were using randomly allocated ids for the entity keys (range [0,2] to be simple) -- we assume even distribution of writes across the random entity ids.
SomeKind\E1\0\RestOfKey1
SomeKind\E2\0\RestOfKey2
SomeKind\E2\1\RestOfKey3
SomeKind\E3\2\RestOfKey4
And then we can note that there are clear split points for the load to shard across -- that is every of the [0,2] possible random ids is a shard and the system can scale indefinitely so long as the writes are evenly distributed across the enties in SomeKind written (make the random id longer for more split points/scaling)
So the is index enum value scaling/hotspotting is highly associated with the entity keys being indexed, which are generally constructed in ways that are shardable which means that the associated index entries also are.
This is not to say that it is impossible to create situations in which hotspots may occur (for example, if the entity keys had a monotonically increasing value (like a timestamp)), or by targeting a small section of keys for a very high write rate -- but that shouldn't happen by default with typical traffic patterns and entity keys.
One of the Fact tables in our Azure SQL DW (stores the train telemetry data) is created as a HASH distributed table (HASH key is VehicleDimId – integer field referencing the Vehicle Dimension table). The total number of records in the table are approx. 1.3 billion.
There are 60 unique VehicleDimId (i.e. we have data for 60 unique vehicles) values in the table which means that they have 60 unique hash keys as well. Based on my understanding, I expect the records corresponding to these 60 unique hash key VehicleDimId should be distributed across 60 distributions available (1 hash key for 1 distribution).
However, currently all the data is distributed across just 36 distributions leaving other 24 distributions with no records. In effect, that is just 60% usage of the compute nodes available. Changing the Data Warehouse scale does not have any effect as the number of distributions remain the same to 60. We are currently running our SQL DW at DW400 level. Below is the compute node level record counts of the table.
You can see that the data is not evenly distributed across compute nodes (which is due to the data not being distributed evenly across the underlying distributions).
I am struggling to understand what I need to do to get the SQL DW to use all the distributions rather than just 60% of them.
Hash distribution takes a hash of the binary representation of your distribution key then deterministically sends the row to the assigned distribution. Basically an int value of 999 ends up on the same distribution on every Azure SQL DW predictably. It doesn't look at your specific 60 unique vehicle IDs and evenly divide them.
The best practice is to choose a field (best if it is used in joins or group bys or distinct counts) which has at least 600 (10x the number of distributions) fairly evenly used values. Are there other fields that meet that criteria?
To quote from this article adding some emphasis:
Has many unique values. The column can have some duplicate values.
However, all rows with the same value are assigned to the same
distribution. Since there are 60 distributions, the column should have
at least 60 unique values. Usually the number of unique values is much
greater.
If you only have 60 distinct values your likelihood of ending up with even distribution is very small. With 10x more distinct values your likelihood of achieving even distribution is much higher.
The fallback is to use round robin distribution. Only do this if there are no other good distribution keys which produce even distribution and which are used in queries. Round robin should achieve optimal loading performance but query performance will suffer because the first step of every query will be a shuffle.
In my opinion, concatenating two columns together (as Ellis' answer suggests) to use as the distribution key is usually a worse option than round robin distribution unless you actually use the concatenated column in group bys or joins or distinct counts.
It is possible that keeping the current Vehicle ID distribution is the best choice for query performance since it will eliminate a shuffle step in many queries that join or group on Vehicle ID. However the load performance may be much worse because of the heavy skew (uneven distribution).
Another option is to create a concatenated join key which could be a concatenation of two different keys which will create a higher cardinality than what you have now with 60 x new row cardinality should generally be in the thousands or greater. The caveat here is that key needs to then be referenced in every join so that the work is done one each node. Then, when you hash on this key, you'll get a more even spread.
The only downside is that you have to propagate this concatenated key to the dimension table as well and ensure that your join conditions include this concatenated key until the last query. As an example, you keep the surrogate key in the subqueries and remove it only in the top level query to force collocated joins.
Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.
I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.
Moving from RDBMS and I am not sure how best to design for below scenario
I have a table with around 200,000 questions with question id as partition key.
Users view questions and i do not wish to show viewed question again to the user. So which one is better option?
Have a table with question id as partition key and a set of user IDs as attribute
Have a table with User id as partition key and a set of question IDs they have viewed as attribute
Have a table with question id as partition key and user id as sort key. Once user has viewed question, add row to this table
1 and 2 might have a problem with the 400 kb size limit for item. The third seems better option though i would end up with 100 million items as there will be one row per user per question viewed. but i assume this is not a problem for dynamo?
Another problem is how to get 10 random questions not viewed by the user. Do i generate 10 random numbers between 1 and 200,000 (the number of questions) and then check if not in table mentioned in point 3 above?
I definitely would not go with option 1 or 2 for the reason you mentioned: you would already be limiting your scalability by the 400kb limit. With a UUID of 128 bits, you would be limited to about 250 users per question.
Option 3 is the way to go with DynamoDB, but what you need to consider is what is the partition key and what is the range key. You could have user_id as the partition key and question_id as the range key. The answer to that decision depends on how your data is going to be accessed. DynamoDB divides the total table throughput by each partition key: each one of your n partition keys gets 1/nth of the table throughput. For example, if you have a subset of partition keys that are accessed more than the others, then you won't be efficiently utilizing your table throughput because those partition keys that actually use up less than 1/nth of the throughput are still provisioned for 1/nth of the throughput. The general idea is that you want to have the each of your partition keys utilized equally. I think that you have it correct, I'm assuming that each question is given randomly and is no more popular than another, while some users might be more active than others.
The other part of your question is a little bit more difficult to answer / determine. You could do it your way where you have tables that contain question and user pairs for the questions that those users have read or you could have tables that contain the pairs for the questions those users haven't read. The tradeoff here is between initial write cost and subsequent read cost, and the answer depends on the amount of questions that you have compared to the consumption rate.
When you have a large amount of questions compared to the rate that users will progress through them, the chances of randomly selecting an already chosen one are small, so you're going to want to store have-read question-user pairs. With this setup you don't pay a lot to initialize a user (you don't have to write a question-user pair for each question) and you won't have a lot of miss-read costs (i.e. where you select a question-user pair and it turns out they already read it, this still consumes read-write units).
If you have a small amount of questions compared to the rate that users consume them, then you're going to want to store the haven't-read question-user pairs. You pay something to initialize each user (writing in one question-user pair for each question), but then you don't have any accidental miss-reads. If you stored them as have-read pairs when their is a small amount of questions, then you will encounter a lot of miss-reads as the percentage of read questions approaches 100% (to the point where you would have been better off just setting them up as haven't-read pairs).
I hope this helps with your design considerations. Drop a comment if you need clarification!