I have been reading the Amazon DynamoDB documentation to compare Global Secondary Index (GSI) and Local Secondary Index (LSI). I am still unclear that in the below use case, does it matter to me what I use? I am familiar with things like LSI ought to use the same partition key etc.
Here is the use case:
I already know the sort key for my index.
My partition key is the same in both cases
I want to project ALL the attributes from original table onto my index
I know already prior to creating the table what index I need for my use case.
In the above use case, there is absolutely no difference apart from minor latency gain in LSI Vs GSI because LSI might end up in the same shard. I want to understand the Pro Vs Con in my use case.
Here are some questions that I am trying to find the answer to and I have not encountered a blog that is explicit about these:
Use GSI only because the partition key is different?
Use GSI even if the partition key is same but I did not know during table creation that I would need such an index?
Are there any other major reasons where one is superior than the other (barring basic stuff like count limit of 5 vs 20 and all).
There are two more key differences that are not mentioned. You can see a full comparison between the two index types in the official documentation.
If you use a LSI, you can have a maximum of 10 Gb of data per partition key value (table plus all LSIs). For some use cases, this is a deal breaker. Before you use a LSI, make sure this isn’t the case for you.
LSIs allow you to perform strongly consistent queries. This is the only real benefit of using a LSI.
The AWS general guidelines for indexes say
In general, you should use global secondary indexes rather than local secondary indexes. The exception is when you need strong consistency in your query results, which a local secondary index can provide but a global secondary index cannot (global secondary index queries only support eventual consistency).
You may also find this SO answer a helpful discussion about why you should prefer a GSI over a LSI.
Related
I am creating a DDB table which has multiple values make up its partition key and sort key. The primary key is a composite of the partition and sort key.
The partition key would be something like region+date+location and the sort key would be zone+update timestamp millis.
What's the norm for naming these attributes? Is it just naming out the values like region+date+location ? Or some other kind of delimitation? I've also read that it might be better to be generic and just name it something like partitionKey and rangeKey or <typeofthing>id etc. but I've gotten a little pushback on this from my team that the names aren't helpful in that case.
I can't seem to find best practices for this specific question anywhere? Is there a preferred approach for this written down somewhere that I could point to?
There is no "standard" way of naming attributes. But there are two things to consider against mandating a naming standard like region+date+location:
A very long attribute name is wasteful - you need to send it over the network when writing and reading items, and it is included in the item length for which you pay per operation and for storage. I'm not saying it means you should name your attributes "a" and "b", but try not to go overboard on the other direction either.
An attribute name region+date+location implies that this attribute contains only this combination, and will forever contain only this combination. But often in DynamoDB the same attribute name is reused for multiple different types - this reuse is the hallmark of "single table design". That being said, these counterexamples aren't too relevant to your use case, because the attributes overloaded in this way are usually not the key columns as in is your use case.
In your case I think that whatever you decide will be fine. There is no compelling reason to stay away from one of the options you mentioned.
The limits for partition and sort keys of dynamoDB are such that if I want to create a table with lots of users (e.g. the entire world population), then I can't just use a unique partition key to represent the personId, I need to use both partition key and sort key to represent a personId.
$util.autoId() in AppSync returns a 128-bit String. If I want to use this as the primary key in the dynamoDB table, then I need to split it into two Strings, one being the partition key and the other being the sort key.
What is the best way to perform this split? Or if this is not the best way to approach the design, how should I design it instead?
Also, do the limits on partition and sort keys apply to secondary indexes as well?
Regarding $util.autoId(), since it's generated randomly, if I call it many times, is there a chance that it will generate two id's that are exactly the same?
I think I'm misunderstanding something from your question's premise because to my brain, using AppSync's $util.autoId() gives you back a 128 bit UUID. The point of UUIDs is that they're unique, so you can absolutely have one UUID per person in the world. And the UUID string will definitely fit within the maximum character length limits of Dynamo's partition key requirements.
You also asked:
if I call it many times, is there a chance that it will generate two
id's that are exactly the same?
It's extremely unlikely.
My main table, Users, stores information about users. I plan to have a UserId field as the primary key of the table. I have full control of creation and assignment of these keys, and I want to ensure that I assign keys in a way that provides good performance. What should I do?
You have a few options:
The most generic solution is to use UUIDs, as specified in RFC 4122.
For example, you could have a STRING(36) that stores UUIDs. Or you could store the UUID as a pair of INT64s or as a BYTE(16). There are some pitfalls to using UUIDs, so read the details of this answer.
If you want to save a bit of space and are absolutely sure that you will have fewer than a few billion users, then you could use an INT64 and then assign UserIds using a random number generator. The reason you want to be sure you have fewer than a few billion users is because of the Birthday Problem, the odds that you get at least one collision are about 50% once you have 4B users, and they increase very fast from there. If you assign a UserId that has already been assigned to a previous user, then your insertion transaction will fail, so you'll need to be prepared for that (by retrying the transaction after generating a new random number).
If there's some column, MyColumn, in the Users table that you would like to have as primary key (possibly because you know you'll want to look up entries using this column frequently), but you're not sure about the tendency of this column to cause hotspots (say, because it's generated sequentially or based on timestamps), then you two other options:
3a) You could "encrypt" MyColumn and use this as your primary key. In mathematical terms, you could use an automorphism on the key values, which has the effect of chaotically scrambling them while still never assigning the same value multiple times. In this case, you wouldn't need to store MyColumn separately at all, but rather you would only store/use the encrypted version and could decrypt it when necessary in your application code. Note that this encryption doesn't need to be secure but instead just needs to guarantee that the bits of the original value are sufficiently scrambled in a reversible way. For example: If your values of MyColumn are integers assigned sequentially, you could just reverse the bits of MyColumn to create a sufficiently scrambled primary key. If you have a more interesting use-case, you could use an encryption algorithm like XTEA.
3b) Have a compound primary key where the first part is a ShardId, computed ashash(MyColumn) % numShards and the second part is MyColumn. The hash function will ensure that you don't create a hot-spot by allocating your rows to a single split. More information on this approach can be found here. Note that you do not need to use a cryptographic hash, although md5 or sha512 are fine functions. SpookyHash is a good option too. Picking the right number of shards is an interesting question and can depend upon the number of nodes in your instance; it's effectively a trade-off between hotspot-avoiding power (more shards) and read/scan efficiency (fewer shards). If you only have 3 nodes, then 8 shards is probably fine. If you have 100 nodes; then 32 shards is a reasonable value to try.
My users can create entries, each of which I want to automatically assign an ID number to. The entries are stored in a DynamoDB table and are created via lambda functions and api gateway.
The ID number must be unique and the process of assigning it must be robust and garuntee uniqueness.
My thinking now is to use a "global" variable that starts at 1 and every time an entry is created it assigns that entries ID to the global variables value, then increments the global variables value.
Would this approach work, and if so what would the best approach be to implement it? Can you think of a better approach?
Your solution will not scale.
Having a global variable will need you to be increment its value in concurrent safe manner(to avoid race conditions) and will also need persistence for the variable to support increments after application restarts.
To avoid this exact problem, one of the patterns used is to use UUID as your key. Dynamo DB's java sdk supports this pattern by providing a custom annotation #DynamoDBAutoGeneratedKey. This ensures each entry has a random identifier generated for itself.
You should be able to get a library to generate UUID, if your preferred language in not Java.
From the latest source code (not certain if it's C or C++) of MySQL, how does it do an autoincrement? I mean, is it efficient in that it stores like a metadata resource on the table where it last left off, or does it have to do a table scan to find the greatest ID in use in the table? Also, do you see any negative aspects of using autoincrement when you look at how it's implemented versus, say, PostgreSQL?
That will depend on which engine the database is using. InnoDB is storing the largest value in memory and not on disk. Very efficient. I would guess most engines would do something similar, but cannot guarantee it.
InnoDB's Auto Increment Is going to run the below query once when DB is loaded and store the variable in memory:
SELECT MAX(ai_col) FROM t FOR UPDATE;
Comparing that to PostgreSQL's complete lack of an auto_increment depends on how you would implement the field yourself. (At least it lacked it last time I used it. They may have changed) Most would create a SEQUENCE. Which appears to be stored in an in memory pseudo-table. I'd take InnoDBs to be a simpler better way. I'd guess InnoDB would be more efficient if they are not equal.