DynamoDB query by secondary index only - amazon-web-services

I store user accounts in DynamoDB:
{
email: 'user1#xxx.com',
expires: 1548807053247,
}
My hash key is the email field.
I want to add a daily cron job which will send a reminder email for all accounts about to expire (in the next 14 days).
For that, I need to query on expires field alone - without using the hash key.
I assume I need to define a secondary index on this field (probably global and not local?), but I'm not sure on how to write the proper query for it.
I'm using AWS.DynamoDB.DocumentClient for accessing the table, thanks in advance!

Just specify the IndexName in addition to the TableName when you call the Query API. (Docs.) The rest is the same as if you were querying the table.

Im having the same problem very often
what I do is , add another attribute named "all" with value of 1
(you can use any key/value )
and then create a GSI
partition: all
sort: expires
to optimise a bit you could add this attribute only to active users
or you could add an attribute active Number and use it as partition key for the GSI
this is a very inefficient partition key distribution
since all items belong to one partition
but I found no other way around
I'd be happy to hear another solution

Related

Ensure GSI isn't duplicated in DynamoDB

My DynamoDB table currently has the PK, SK and a GSI called "EmailIndex".
The idea behind this is that when users create their account, the username will be set as the PK and the email would be set as the EmailIndex GSI.
This should let me allow the user to login with either the username or the email.
Is it possible to reliably ensure there is no duplicated EmailIndex record? If for example, two different people enter the exact same email at the exact same time with different usernames, how can I ensure both users don't end up creating a record with the same GSI?
Right now I'm under the assumption that this isn't actually possible with DynamoDB. In which case, what would be the acceptable and recommended approach of allowing username or email logins instead of mandating one or the other?
There isn't a way to guarantee unique values on a GSI. What I've done in the past to solve this is to have two records for the user, one with the data you have today, and one that is keyed on the email address. When doing a Put operation you can use a transaction to be sure both records succeed. You can still put the email in the GSI on the main record and use that for querying, or use the second record instead, and duplicate the data (which is what the GSI would do if you have the projection set to ALL).

How to use dynamodb:LeadingKeys when Partition key has more than one kind of values

My Dynamo Tables have tenant_id as the partition key in my multi-tenant application but my partition key also has other types of entities in it in addition to tenant_id.
For example: (This is a small example, we are using this pattern throughout)
PK SK Att
Customer-4312a674-54a user-abc 672453782
user-abc user-abc 672453782
I would like to use dynamodb:LeadingKeys to ensure data of one tenant can never be accessed by another tenant. How can I go about that in this case when PK is overloaded and has other entities in it as well.
In a multi-tenant system my recommendation would be to add the tenant-id as a prefix to the partition key of all items belonging to the tenant. That way you can use the dynamodb:LeadingKeys condition for access control.
The tenant-id should be known at query time for every query anyway, my guess is that it's probably stored in the session information. This means you can add the tenant-id to every Key and still do partition key overloading.

DynamoDB table/index schema design for querying multi-valued attributes

I'm building a DynamoDB app that will eventually serve a large number (millions) of users. Currently the app's item schema is simple:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
email: "foo#foo.com",
... other attributes ...
}
When a new user signs up, or if a user wants to find another user by email address, we'll need to look up users by email instead of by userId. With the current schema that's easy: just use a global secondary index with email as the Partition Key.
But we want to enable multiple email addresses per user, and the DynamoDB Query operation doesn't support a List-typed KeyConditionExpression. So I'm weighing several options to avoid an expensive Scan operation every time a user signs up or wants to find another user by email address.
Below is what I'm planning to change to enable additional emails per user. Is this a good approach? Is there a better option?
Add a sort key column (e.g. itemTypeAndIndex) to allow multiple items per userId.
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "main", // sort key
email: "foo#foo.com",
... other attributes ...
}
If the user adds a second, third, etc. email, then add a new item for each email, like this:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "Email-2", // sort key
email: "bar#bar.com"
// no more attributes
}
The same global secondary index (with email as the Partition Key) can still be used to find both primary and non-primary email addresses.
If a user wants to change their primary email address, we'd swap the email values in the "primary" and "non-primary" items. (Now that DynamoDB supports transactions, doing this will be safer than before!)
If we need to delete a user, we'd have to delete all the items for that userId. If we need to merge two users then we'd have to merge all items for that userId.
The same approach (new items with same userId but different sort keys) could be used for other 1-user-has-many-values data that needs to be Query-able
Is this a good way to do it? Is there a better way?
Justin, for searching on attributes I would strongly advise not to use DynamoDB. I am not saying, you can't achieve this. However, I see a few problems that will eventually come in your path if you will go this root.
Using sort-key on email-id will result in creating duplicate records for the same user i.e. if a user has registered 5 email, that implies 5 records in your table with the same schema and attribute except email-id attribute.
What if a new use-case comes in the future, where now you also want to search for a user based on some other attribute(for example cell phone number, assuming a user may have more then one cell phone number)
DynamoDB has a hard limit of the number of secondary indexes you can create for a table i.e. 5.
Thus with increasing use-case on search criteria, this solution will easily become a bottle-neck for your system. As a result, your system may not scale well.
To best of my knowledge, I can suggest a few options that you may choose based on your requirement/budget to address this problem using a combination of databases.
Option 1. DynamoDB as a primary store and AWS Elasticsearch as secondary storage [Preferred]
Store the user records in DynamoDB table(let's call it UserTable)as and when a user registers.
Enable DynamoDB table streams on UserTable table.
Build an AWS Lambda function that reads from the table's stream and persists the records in AWS Elasticsearch.
Now in your application, use DynamoDB for fetching user records from id. For all other search criteria(like searching on emailId, phone number, zip code, location etc) fetch the records from AWS Elasticsearch. AWS Elasticsearch by default indexes all the attributes of your record, so you can search on any field within millisecond of latency.
Option 2. Use AWS Aurora [Less preferred solution]
If your application has a relational use-case where data are related, you may consider this option. Just to call out, Aurora is a SQL database.
Since this is a relational storage, you can opt for organizing the records in multiple tables and join them based on the primary key of those tables.
I will suggest for 1st option as:
DynamoDB will provide you durable, highly available, low latency primary storage for your application.
AWS Elasticsearch will act as secondary storage, which is also durable, scalable and low latency storage.
With AWS Elasticsearch, you can run any search query on your table. You can also do analytics on data. Kibana UI is provided out of the box, that you may use to plot the analytical data on a dashboard like (how user growth is trending, how many users belong to a specific location, user distribution based on city/state/country etc)
With DynamoDB streams and AWS Lambda, you will be syncing these two databases in near real-time [within few milliseconds]
Your application will be scalable and the search feature can further be enhanced to do filtering on multi-level attributes. [One such example: search all users who belong to a given city]
Having said that, now I will leave this up to you to decide. 😊

How to query data in AWS AppSync in a specific range then sort its result by another key?

I create a temple name BlogAuthor in AWS DynamoDB with following structure:
authorId | orgId | age |name
Later I need to make a query like this: get all authors from organization id = orgId123 with age between 30 and 50, then sort their name in alphabet order.
I'm not sure it's possible to perform such query in DynamoDB (later I'll apply it in AppSync), hence the first solution is to create an index (GSI) with partitionKey=orgId, sortKey=age (final name is orgId-age-index).
But next, when try to query in DynamoDB, set partitionKey orgId=orgId123, sortKey age=[30;50] and no filter; then I can have a list of authors. However, there is no way to sort that list by name from above query.
I retry another solution by create new index with partitionKey=orgId and sortKey=name. Then, query (not scan) in DynamoDB with partitionKey orgId=orgId123, set empty sortKey value (because we only want to sort by name instead of getting a specific name), and filter age in range [30;50]. This solution seems works, however I notice the filter is applied on the result list - for example the result list with 100 items, but after apply filter by age, then may by 70 items remaining, or nothing. But I always hope it returns 100 items.
Could you please tell me is there anything wrong with my approaches? Or, is it possible to make such query in DynamoDB?
Another (small) question is when connect that table to an AppSync API: if it's not possible to perform such query, then it's not possible for such query in AppSync too?
You are not going to be able to do everything you want in a single DynamoDB query.
Option 1:
You can do what you want as long as you are ok with sorting objects on the client. This would work for organizations with a relatively small number of people.
Pros:
Allows you to efficiently query users in a particular organization between a range of users.
Cons:
Results are not sorted by name on the server.
Option 2:
Pros:
Allows you to paginate through users at an organization that are ordered by the name.
Cons:
You cannot efficiently get all users in an organization within an age range. You would effectively be scanning the index and would need multiple round trip calls.
Option 3:
A third option, would be to stream information from DynamoDB into ElasticSearch using DynamoDB streams and AWS Lambda. Once the data is in Elasticsearch, you can do much more advanced queries. You can see more information on the Elasticsearch search APIs here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html.
Pros:
Much more powerful query engine.
Cons:
More overhead w/ the DynamoDB stream and AWS Lambda function.

How to select a partition key for for a DynamoDB query?

I have created a dynamo db table with name- "sample".It has below columns. CreatedDate will have creation time of any records inserted to this table.
Itemid,
ItemName,
ItemDescription,
CreatedDate,
UpdatedDate
I am creating a python-flask based rest api which always fetches last 100 records inserted to this table. This API (python-flask function) does not have any input parameters. It should just return the last records inserted to this table.
Question 1
What should be the partition key for this table? I am using the boto3 library to fetch records from DynamoDB. I prefer not to do scan operation because it may cause performance issues. If I use the query function it asks for a partition key. Since this rest API does not accept any input I am not sure how to use it.
Question 2
Has anyone faced similar situation? And what was done to fix this?
Note: I am pretty much newbie to DynamoDB, NoSQL and Boto
To query your table using CreatedDate without knowing the ItemId, you can use Global Secondary Index write sharding by adding an attribute (e.g., ShardId) containing a (0-N) value to every item that you will use for the global secondary index partition key.
Depending on how your items are distributed against CreatedDate, you can set the ShardId so that it is likely to have evenly distributed access patterns. For example: YYYY, YYYYMM or YYYYMMDD. Then, you create a global secondary index with ShardId as an index partition key and CreatedDate as an index sort key.
Knowing the primary key for your GSI (since the ShardId value is derived from CreatedDate), you can query the table for the 100 most recent items with query's Limit parameter (or LastEvaluatedKey if your items set size is larger than 1 MB of data).
See Using Global Secondary Index Write Sharding for Selective Table Queries.