DynamoDB - how to query by something that is not the primary key - amazon-web-services

So, I have a table on DybamoDB with this structure:
- userId as the primarykey (it's a uuid)
- email
- hashedPassword
I want to, as someone is signing up, find out if there's already someone using that email.
This should be easy but, as far as I know, you can't query on DynamoDB unless you are using the primary key as parameters or the sort key (and I'm not sure if it would make sense to make email a sort key).
The other way I found out was using a Global Secondary Index, which is pretty much an index table you create using another field as the primary sort of, but this is billable and since I'm still developing and testing I did not want to have expenses.
Does anyone have another option? Or am I wrong and there's another way to do it?

Like other answers, I also think that GSI is the best option here.
But I would like to also add that since search capabilities of DynamoDB are very limited, it is not uncommon to use DynamoDB with something else for that very purpose. One such use case is described in the AWS blog:
Indexing Amazon DynamoDB Content with Amazon Elasticsearch Service Using AWS Lambda
The main querying capabilities of DynamoDB are centered around lookups using a primary key. However, there are certain times where richer querying capabilities are required. Indexing the content of your DynamoDB tables with a search engine such as Elasticsearch would allow for full-text search.
Obviously, I don't recommend using ES over GSI in your scenario. But it is worth knowing that DynamoDB can be, and is often, used with other services to extend its search capabilities.

Even you put email as sort key alongside of userId as primary key, you can't query only using email(unless it is scan operation). You don't want to use scan to see whether email exists in your table. It's like iterating the each value by scanning the whole table.
I think your best option is global secondary index. Another option would be creating a new table which only includes email values, but in that case you have to write/maintain to multiple tables which is unnecessary.

The other way I found out was using a Global Secondary Index, which is pretty much a index table you create using another field as the primary sort of, but this is billable and since I'm still developing and testing I did not want to have expenses.
As #Ersoy has said, GSI is the legit solution, even it will increase the consumed writes units.
Dynamodb is cheap for a low-traffic app and/or a test environment, but to hold these expenses flat, you can:
Use dynamodb local during local devs/tests and CI builds
Choose a provisioned capacity mode for your table (you may find its free-tier interesting)

Related

Best practice of using Dynamo table when it needs to be periodically updated

In my use case, I need to periodically update a Dynamo table (like once per day). And considering lots of entries need to be inserted, deleted or modified, I plan to drop the old table and create a new one in this case.
How could I make the table queryable while I recreate it? Which API shall I use? It's fine that the old table is the target table. So that customer won't experience any outage.
Is it possible I have something like version number of the table so that I could perform rollback quickly?
I would suggest table name with a common suffix (some people use date, others use a version number).
Store the usable DynamoDB table name in a configuration store (if you are not already using one, you could use Secrets Manager, SSM Parameter Store, another DynamoDB table, a Redis cluster or a third party solution such as Consul).
Automate the creation and insertion of data into a new DynamoDB table. Then update the config store with the name of the newly created DynamoDB table. Allow enough time to switchover, then remove the previous DynamoDB table.
You could do the final part by using Step Functions to automate the workflow with a Wait of a few hours to ensure that nothing is happening, in fact you could even add a Lambda function that would validate whether any traffic is hitting the old DynamoDB.

How do I avoid hot partitions when using DyanmoDB row level access control?

I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?
In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies

DynamoDB table/index schema design for querying multi-valued attributes

I'm building a DynamoDB app that will eventually serve a large number (millions) of users. Currently the app's item schema is simple:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
email: "foo#foo.com",
... other attributes ...
}
When a new user signs up, or if a user wants to find another user by email address, we'll need to look up users by email instead of by userId. With the current schema that's easy: just use a global secondary index with email as the Partition Key.
But we want to enable multiple email addresses per user, and the DynamoDB Query operation doesn't support a List-typed KeyConditionExpression. So I'm weighing several options to avoid an expensive Scan operation every time a user signs up or wants to find another user by email address.
Below is what I'm planning to change to enable additional emails per user. Is this a good approach? Is there a better option?
Add a sort key column (e.g. itemTypeAndIndex) to allow multiple items per userId.
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "main", // sort key
email: "foo#foo.com",
... other attributes ...
}
If the user adds a second, third, etc. email, then add a new item for each email, like this:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "Email-2", // sort key
email: "bar#bar.com"
// no more attributes
}
The same global secondary index (with email as the Partition Key) can still be used to find both primary and non-primary email addresses.
If a user wants to change their primary email address, we'd swap the email values in the "primary" and "non-primary" items. (Now that DynamoDB supports transactions, doing this will be safer than before!)
If we need to delete a user, we'd have to delete all the items for that userId. If we need to merge two users then we'd have to merge all items for that userId.
The same approach (new items with same userId but different sort keys) could be used for other 1-user-has-many-values data that needs to be Query-able
Is this a good way to do it? Is there a better way?
Justin, for searching on attributes I would strongly advise not to use DynamoDB. I am not saying, you can't achieve this. However, I see a few problems that will eventually come in your path if you will go this root.
Using sort-key on email-id will result in creating duplicate records for the same user i.e. if a user has registered 5 email, that implies 5 records in your table with the same schema and attribute except email-id attribute.
What if a new use-case comes in the future, where now you also want to search for a user based on some other attribute(for example cell phone number, assuming a user may have more then one cell phone number)
DynamoDB has a hard limit of the number of secondary indexes you can create for a table i.e. 5.
Thus with increasing use-case on search criteria, this solution will easily become a bottle-neck for your system. As a result, your system may not scale well.
To best of my knowledge, I can suggest a few options that you may choose based on your requirement/budget to address this problem using a combination of databases.
Option 1. DynamoDB as a primary store and AWS Elasticsearch as secondary storage [Preferred]
Store the user records in DynamoDB table(let's call it UserTable)as and when a user registers.
Enable DynamoDB table streams on UserTable table.
Build an AWS Lambda function that reads from the table's stream and persists the records in AWS Elasticsearch.
Now in your application, use DynamoDB for fetching user records from id. For all other search criteria(like searching on emailId, phone number, zip code, location etc) fetch the records from AWS Elasticsearch. AWS Elasticsearch by default indexes all the attributes of your record, so you can search on any field within millisecond of latency.
Option 2. Use AWS Aurora [Less preferred solution]
If your application has a relational use-case where data are related, you may consider this option. Just to call out, Aurora is a SQL database.
Since this is a relational storage, you can opt for organizing the records in multiple tables and join them based on the primary key of those tables.
I will suggest for 1st option as:
DynamoDB will provide you durable, highly available, low latency primary storage for your application.
AWS Elasticsearch will act as secondary storage, which is also durable, scalable and low latency storage.
With AWS Elasticsearch, you can run any search query on your table. You can also do analytics on data. Kibana UI is provided out of the box, that you may use to plot the analytical data on a dashboard like (how user growth is trending, how many users belong to a specific location, user distribution based on city/state/country etc)
With DynamoDB streams and AWS Lambda, you will be syncing these two databases in near real-time [within few milliseconds]
Your application will be scalable and the search feature can further be enhanced to do filtering on multi-level attributes. [One such example: search all users who belong to a given city]
Having said that, now I will leave this up to you to decide. 😊

Should I use a secondary index or separate ID lookup table in DynamoDB?

I'm migrating a database from mongodb to dynamodb and trying to understand best practices, especially with using secondary local indexes and sort keys.
My application pulls in html data from the web, and loads the data into several tables/collections. At the time of extraction it gives each item an extracted_id, unique to the website it's pulled from. Before loading the items, it gives each item a UUID as its primary/partition key.
Problem: In order to avoid assigning different uuids to the same extracted_id I query the db to check if the entity has a preexisting entity_uuid.
Current Solution: Currently in mongodb, I have two sets of tables/collections. One for storing all items, and one for storing an entity's extracted_id(as key) / entity_uuid (as value) lookup table.
Better Solution?: As I move to DynamoDB would it be better to only create one database with extracted_id as a local secondary index, as to not store duplicate data? I'm unsure as the docs say to use indexes sparingly. I don't use the extracted_id for anything other than providing items with their uuid for a given site.
Hopefully this makes sense, I'm new to AWS / DynamoDB and would appreciate any tips / better solutions to the ones mentioned.
Why not just make extracted_id the partition key of your new DynamoDB table and use a ConditionExpression attribute_not_exists(extracted_id) to prevent your application from writing duplicate entries?

DynamoDb table design: Single table or multiple tables

I’m quite new to NoSQL and DynamoDB and I used to RDBMS. I’m designing database for a game and we're using DynamoDB and AWS Lambda for our backend. I created a table name “Users” for player profile that contains the user information and resources. Because the game has inventory system I also created a table name “UserItems”.
It’s all good until I realized DynamoDB don’t have transaction and any operation that is executed on both table (for example using an item that increase resource) has a chance of failure on one table while success on other and will cause missing data which affect our customers.
So I was thinking maybe my multiple tables design is not good since it’s a habit of me to design multiple table when I’m working with RDBMS. Which let me to think of storing the entire “UserItems” as hash in “Users” but I’m not sure this is a good practice because the size of a single row in Users table will be really big (we may have 500 unique items per users) and each time I pull or put data from/to “Users” (most of the time don’t need “UserItems” data) the read/write throughput will be also really large.
What should I do, keep the multiple tables design and handle transaction manually or switch to single table design? Or maybe there is a 3rd option?
Updated: more information about my use case
Currently I have 2 tables
Users: UserId (key), Username, Gold
UserItems: UserId (partition key), ItemId (sort key), Name, GoldValue
Scenarios:
User buy an item: Users.Gold will be deduced, new UserItem will be add to UserItems table.
User sell an item: Users.Gold will be increased, the Item will be deleted from UserItems table.
In both scenarios above I will have to do 2 update operation for 2 tables which without transaction there is a chance one of them failed.
To solve that I consider using single table solution which is a single Users table with 4 columns UserId(key), Username, Gold, UserItems. However there are two things I'm worried about:
Data in UserItems might be come to big for a single cell because one user could have up to 500 items.
To add/delete item I have to pull the UserItems from dynamodb, add/delete item and then put it back into Users. So I have to do 1 read and 1 write operation for 1 action. And because of issue (1) the read/write data size could become really big.
FWIW, the AWS documentation on NoSQL Design for DynamoDB suggests to use a single table:
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
NoSql database is best suited for non-trasactional data. If you bring normalization(splitting your data into multiple tables) into noSQL, then you are beating the whole purpose of it. If performance is what matters most, then you should consider only having a single table for your use case. DynamoDB supports Range Keys, and also supports Secondary Indices. For your usecase, it would be better to redesign your table to use Range Keys.
If you can share more details about your current table, maybe i can help you with more inputs.