Searching items in Amazon DynamoDB - amazon-web-services

Is it possible to get the items from DynamoDB table by querying against any attribute of the table except the primary key? In my table I have product ID as the Hash key and I have not specified any Range key. I want to add support for filters based on various attributes such as product price, product brand, available units in stock etc. And while filtering I do not want to provide the product ID since in most of the cases I may not know the product ID. Coming from a SQL background I was assuming DynamoDB to also have some sort of 'where' clause to list records that match certain criteria/value of attribute(s). However, so far I haven't had success.
After going through the Query and Scan documentation also,I couldn't figure out how I can optimally use these operations to suit my needs. And how can I perform search/filter in my application without burning through my provisioned throughput capacity.
Any ideas as to how this can be done?

Create a Global Secondary Index on the attributes you want to query on. It will have its own capacity in both read and write units as well as other considerations. If you need to add indices to an existing table, AWS preannounced Online Indexing a couple months ago so look forward to hearing more news on when that is released. If you need more than just simple queries against these indices (EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN) you may want to consider using an search solution, such as AWS Cloud Search

Related

Is there a way to query multiple Partial Keys in dynamo DB table using AWS dashboard?

I would like to know if there's an option to query with multiple partition keys from DynamoDB table in AWS dashboard. Unable to find any article or similar requests for dashboard on the web. Will keep you posted if I find an answer for the same.
Thanks in advance.
The Console doesn't support this directly, because there is no support in the underlying API. What you're looking for is the equivalent of the following SQL query:
select *
from table
where PK in ('value_1', 'value_2') /*equivalent to: PK = 'value_1' or PK = 'value_2' */
The console supports using the Query and Scan operations. Query always operates on an item collection, so all items that share the same partition key, which means it can't be used for your use case.
Scan on the other hand is a full table scan, which allows you to optionally filter the results. The filter language has no support for this kind of or logical operator so that won't really help you. It will however allow you to view all items, which includes the ones you're looking for, but as I said, it's not really possible.

Using fake timestamps to create partitions on Google BigQuery

Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.
The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution

How do I avoid hot partitions when using DyanmoDB row level access control?

I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?
In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies

How to query data in AWS AppSync in a specific range then sort its result by another key?

I create a temple name BlogAuthor in AWS DynamoDB with following structure:
authorId | orgId | age |name
Later I need to make a query like this: get all authors from organization id = orgId123 with age between 30 and 50, then sort their name in alphabet order.
I'm not sure it's possible to perform such query in DynamoDB (later I'll apply it in AppSync), hence the first solution is to create an index (GSI) with partitionKey=orgId, sortKey=age (final name is orgId-age-index).
But next, when try to query in DynamoDB, set partitionKey orgId=orgId123, sortKey age=[30;50] and no filter; then I can have a list of authors. However, there is no way to sort that list by name from above query.
I retry another solution by create new index with partitionKey=orgId and sortKey=name. Then, query (not scan) in DynamoDB with partitionKey orgId=orgId123, set empty sortKey value (because we only want to sort by name instead of getting a specific name), and filter age in range [30;50]. This solution seems works, however I notice the filter is applied on the result list - for example the result list with 100 items, but after apply filter by age, then may by 70 items remaining, or nothing. But I always hope it returns 100 items.
Could you please tell me is there anything wrong with my approaches? Or, is it possible to make such query in DynamoDB?
Another (small) question is when connect that table to an AppSync API: if it's not possible to perform such query, then it's not possible for such query in AppSync too?
You are not going to be able to do everything you want in a single DynamoDB query.
Option 1:
You can do what you want as long as you are ok with sorting objects on the client. This would work for organizations with a relatively small number of people.
Pros:
Allows you to efficiently query users in a particular organization between a range of users.
Cons:
Results are not sorted by name on the server.
Option 2:
Pros:
Allows you to paginate through users at an organization that are ordered by the name.
Cons:
You cannot efficiently get all users in an organization within an age range. You would effectively be scanning the index and would need multiple round trip calls.
Option 3:
A third option, would be to stream information from DynamoDB into ElasticSearch using DynamoDB streams and AWS Lambda. Once the data is in Elasticsearch, you can do much more advanced queries. You can see more information on the Elasticsearch search APIs here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html.
Pros:
Much more powerful query engine.
Cons:
More overhead w/ the DynamoDB stream and AWS Lambda function.

DynamoDb table design: Single table or multiple tables

I’m quite new to NoSQL and DynamoDB and I used to RDBMS. I’m designing database for a game and we're using DynamoDB and AWS Lambda for our backend. I created a table name “Users” for player profile that contains the user information and resources. Because the game has inventory system I also created a table name “UserItems”.
It’s all good until I realized DynamoDB don’t have transaction and any operation that is executed on both table (for example using an item that increase resource) has a chance of failure on one table while success on other and will cause missing data which affect our customers.
So I was thinking maybe my multiple tables design is not good since it’s a habit of me to design multiple table when I’m working with RDBMS. Which let me to think of storing the entire “UserItems” as hash in “Users” but I’m not sure this is a good practice because the size of a single row in Users table will be really big (we may have 500 unique items per users) and each time I pull or put data from/to “Users” (most of the time don’t need “UserItems” data) the read/write throughput will be also really large.
What should I do, keep the multiple tables design and handle transaction manually or switch to single table design? Or maybe there is a 3rd option?
Updated: more information about my use case
Currently I have 2 tables
Users: UserId (key), Username, Gold
UserItems: UserId (partition key), ItemId (sort key), Name, GoldValue
Scenarios:
User buy an item: Users.Gold will be deduced, new UserItem will be add to UserItems table.
User sell an item: Users.Gold will be increased, the Item will be deleted from UserItems table.
In both scenarios above I will have to do 2 update operation for 2 tables which without transaction there is a chance one of them failed.
To solve that I consider using single table solution which is a single Users table with 4 columns UserId(key), Username, Gold, UserItems. However there are two things I'm worried about:
Data in UserItems might be come to big for a single cell because one user could have up to 500 items.
To add/delete item I have to pull the UserItems from dynamodb, add/delete item and then put it back into Users. So I have to do 1 read and 1 write operation for 1 action. And because of issue (1) the read/write data size could become really big.
FWIW, the AWS documentation on NoSQL Design for DynamoDB suggests to use a single table:
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
NoSql database is best suited for non-trasactional data. If you bring normalization(splitting your data into multiple tables) into noSQL, then you are beating the whole purpose of it. If performance is what matters most, then you should consider only having a single table for your use case. DynamoDB supports Range Keys, and also supports Secondary Indices. For your usecase, it would be better to redesign your table to use Range Keys.
If you can share more details about your current table, maybe i can help you with more inputs.