have the following fields in RDS
id auto increment
creationDate datetime
status string
reason string
description string
customerID string
there are multiple records for the customerID + creationDate, so I am not able to use creationDate as sort key. The status field combo with customerID won't work as customer can have the duplicate record for same status. If I use the id field I can't auto increment as dynamoDB doesn't allow that? What are my options here? How should my ddb table look like?
The key to DynamoDB is knowing your access patterns. You haven't stated how you plan to query the data, so I can't advise on the overall design, but here is what you can do in order to have a unique primary key.
Do you really need auto-incrementing IDs? If not, consider using a UUID for all new data. Then you could use the ID field as the partition key; you could also use customerId as the partition key and id as the sort key.
If you must have auto-incrementing IDs, then you should store your creationDate in DynamoDB as an ISO 8061 string. Then, you can append a random UUID to the end of creationDate to avoid key collisions. This will allow you to use customerId and creationDate as the primary keys, and you are still able to query using the date (but instead of checking for equality, you should use the begins_with function).
Finally, you can introduce a new field specifically to ensure uniqueness. You could call it simply rangeKey, and it would be a randomly generated UUID that you could use with any other field as the partition key. You can still have your sequential ID field (and create a GSI for querying it, if you want).
I've presented 3 solutions, but they are really all the sameāfind a way to add a UUID to your primary key.
Related
I am having a table user.
user_id -> unique, partiton key
user_city -> primary sort key
Would the query perform a full scan or would it benefit from sort key?
Also, what would be results if i used gsi on user_city?
pseudocode: fetch all user_id that have user_city="abc"
If your partition key is unique, you don't need a sort key nor does having one provide any benefit. In fact, having one is a bad idea because now your user_id doesn't have to be unique. Also you'd have to use Query() to return a users information with just the user_id. GetItem() would need user_id and user_city
Simply define the table with user_id as the primary key.
Then create a GSI with a partition key of user_city.
You don't even need a sort key on the GSI unless you want the data returned in a particular order. Perhaps user_id or perhaps user_name.
Say if I had a DynamoDB table:
UserId (hash): S
BookName (range): S
BorrowedTime: S
ReturnedTime: S
UserId is the primary key (hash), and I needed to set BookName as sort key (range) because another item being added to the database was overwriting the previous with the same UserId.
How would I go about creating a query using DynamoDBMapper, but the fields being queried are the time fields (which are non-keys)? For instance, say if I wanted to return the UserId and BookName of any book borrowed over 2 weeks ago that hasn't been returned yet?
Do I need to setup a GSI on both BorrowedTime and ReturnedTime fields?
Yes you can make a GSI using BorrowedTime and ReturnedTime or you can use scan instead of a query , if you use scan you dont need to make a gsi but scan operations scan the whole database so it is not really recommended on large db or frequent use.
I need to create a way to have a email and username be unique acreoss all partions of a table (or tables if needed). I can't seem to find the way other then making only 1 unique (the primary key) and then the other being unique in the partion only.
I want to have an email address check so that each user has both a UNIQUE email AND a UNIQUE username.
So the database CANNOT have:
email username
a#a.com aa
b#b.com aa
OR:
email username
a#a.com a
a#a.com b
I need both to be independently unique across the entire system/database.
How is this done? I am using Lambda and DynamoDB.
And I also NEED to know independently which one is NOT UNIQUE too.
My understanding is that you want to ensure that user_names are unique AND email_addresses are unique and that a user_name maps to 1 and only 1 email_address and an email_address maps to 1 and only user_name.
One way to do this would be to use two DynamoDB tables. The first (table A) would use the user_name as the HASH and the record associated with it would contain all information about that user. The second table (table B) would use email_address as the HASH and would contain a single additional attribute, the user_name.
When creating a new user, you could do a conditional put on table A with a condition of attribute_not_exists(user_name) If this fails, the user_name already exists and so the new record would not be created. If it succeeds, the user_name was unique. You could then do a conditional put to table B with a condition of attribute_not_exists(email_address). If this fails, the email_address is already in use and you would either have to delete the record from table A or otherwise resolve the email address conflict with the user. If the conditional PUT succeeds then you know that the email_address is unique and you have successfully created a new, unique user record.
This is a bit more complicated but it does allow you to rely on DynamoDB to guarantee uniqueness and consistency rather than try to achieve that at the application level.
dynamodb uniqueness is on hash_key (or composite key: hash+range) only.
i think that the best option in this case is to ensure uniqueness on application level (add GSI on username and try to query for the new username). on email it will be easy to check uniqueness since it table hash key..
I am trying to create a table to store invoice line items in DynamoDB. Let's say the item is defined by CompanyCode, InvoiceNumber and LineItemId, amount and other line item details.
A unique item is defined by the combination of the first 3 attributes. Any 2 of those attributes can be same for the different items. What should I select as the Hash Attribute and the Range Attribute?
Some Intro
For efficiency I would propose totally different design. With NoSQL databases (and DynamoDB is not different) we always need to consider the access patterns first. Also, if possible we should strive to fit all our data within same table and several indexes. From what we have from OP and his comments, these are the two access patterns:
For a company X, get complete invoice Y (including all items or range of items) [based on this comment ]
Get all invoices for company X [ based on this comment ]
We now wonder what is a good Primary Key? Translates to question what is a good Partition Key (PK) and what is a good Sort Key (SK) and which secondary indexes do we need to create and of what kind (local or global)? Some reminders:
Primary Key can be on one column or composite
Composite primary key consists of Partition Key and Sort Key
Partition key is used as input to the hashing function that will determine partition of the items
Sort key can also be composite, which allows us to model one-to-many relationships in DynamoDB as given in one of the comments links: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html
When creating query on the table or index, you always need to use '=' operator on the Partition Key
When querying ranges on Sort Key you have option for KeyConditionExpression which provides you with set of operators for sorting and everything in between (one of them being function begins_with (a, substr) )
You are also allowed to use FilterExpression if you need to further refine the Query results (filter on the projected attributes)
Local Secondary Indexes (LSI) have same Partition Key but different Sort Key than your original table and give you different view of your data, organized according to an alternative Sort Key
Global Secondary Indexes (GSI) have different Partition Key and different Sort Key than your original table and give you completely different view on data
All items with the same partition key are stored together, and for composite Primary keys, are ordered by the sort key value. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB.
Back To Modeling
It is obvious that we are dealing with multiple entities that need to be modeled and fit into the same table. To satisfy condition of Partition Key being unique on the table, CompanyCode comes as a natural Partition Key - so I would ensure that is unique. If not then you need to ask yourself how can you model the second access pattern?
Assuming we have established uniqueness on the CompanyCode let's simplify and say that it comes in the form of an e-mail (or could be domain or just a code, but I will use email for demonstration).
Relationship between Company and Invoices is always 1:many.
Relationship between Invoice and Items is always 1:many.
I propose design as in the image below:
With PK being CompanyCode and SK being InvoiceNumber can store all attributes about that invoice for that company.
Nothing prevents me to also add record where the SK is Customer which allows me to store all attributes about the company.
With GSI1 , we will create reverse lookup where GSI1PK is my tables SK (InvoiceNumber) and my GSI1SK is my tables PK (CompanyCode).
I am using same table to store line items with PK being LineItemId and SK being CompanyCode (still unique)
For Item entity items my GSI1PK is still InvoiceNumber and my GSI1SK is LineItemId which is tables PK so its same as for Invoice entity items.
Now the access patterns supported with this:
If I want to get invoice Y for company X and all the items (access pattern 1): Query the table where CompanyCode=X and use KeyConditionExpression with = operator on the Sort Key InvoiceNumber. If I want to get all the items tied to that invoice, I will project Items attribute using ProjectionExpression.
By retrieving all the items with previous query for company X and invoice Y, I can now run BatchGetItem API call (using my unique composite key LineItemId+CompanyCode) on table to get all items belonging to that particular invoice of that particular customer. (this comes with some constraints of BatchGetItem API)
To support access pattern 2, I will do a query with CompanyCode=X on PK and use KeyConditionExpression on the SK with begins_with (a, substr) function/operator to get only invoices for company X and not the metadata about that company. That will give me all invoices for given company/customer.
Additionally, with above GSI1, for any given InvoiceNumber I can easily select all the line items that belong to that particular invoice. REMEMBER: The key values in a global secondary index do not need to be unique - so in my GSI1 I could have had easily invoice_1 -> (item_1, item_2) and then another invoice_1 -> (item_1,item_2) but the difference between two items in GSI would be in the SK (it would be associated with different CompanyCode (but for demonstration purposes I used invoice_1 and invoice_2).
I believe the first option offered by #georgeaf99 won't work, because if you do it that way, then CompanyCode has to be unique in the table. Therefore, there would only be one item allowed per company. I think the second solution is the only real way to do it.
You can use CompanyCode as the Hash Key, and then all other fields that combine to make the item unique (in this case InvoiceNumber and LineItemId) need to be somehow combined into one value (such as concatenation with a field delimiter), which would be your Range Key. Unfortunately that is kind of ugly, but that's the nature of a NoSQL database like DynamoDB. However, it will allow you to successfully store records with the correct uniqueness. When reading the records back, if you don't want to parse the combined field back out to its individual parts, then you'll have to add additional separate fields for InvoiceNumber and LineItemID.
If you don't have a large number of invoices per company, you can query by only the Hash Key and do the filtering on the client side. If you have a large number of invoices per company and need to be able to query only the items for a single invoice, then I would create a secondary index on CompanyCode and InvoiceNumber.
As I'm sure you have figured out you cannot have more than two attributes form your primary key (hash+range). Thus, depending on the type of queries you will be performing and the size of your data you can structure your table in different ways.
(Optimized for the query type you mentioned above: only CompanyCode & all 3)
Best sol'n for small/medium size data sets:
Hash Key: CompanyCode
Perform the query using only CompanyCode and
then filter your results on the other two attributes
Optimal solution for large data sets:
Hash Key: CompanyCode
Range Key: InvoiceNumber+LineItemId
This allows you to query only on an index, but the table structure is pretty ugly
Our schema has a USER table...
USER(
userId,
firstname,
lastname,
email)
and we want to ensure all user's have unique email addresses. Is it possible to create a unique index in VoltDB to enforce this constraint?
VoltDB supports primary key indexes (which are always unique) as well as secondary indexes that can be defined as unique.
For your particular table you have two choices to enforce uniqueness on the email column:
Define the USER table as replicated.
Partition the USER table on the email column.
If you create a unique index on email and partition the table on userId then the uniqueness enforcement of the email column will be within individual partitions.
VoltDB provides implicit indexes for primary keys. For example if you assign userID as a primary key then userID will be unique (because of VoltDB's implicitly index assignment on primary key) but to make email column as unique you have to explicitly assign constraint 'UNIQUE' on email column.
Similarly suppose you are partitioning a table and partitioning is done on column userID then to enforce email to be unique in every partition, you should explicitly assign 'ASSUMEUNIQUE' constraint on email column.