I am trying to build out a scalable smart home infrastructure on AWS using iot core, lambda, and dynamodb along with the serverless framework and subsequent Android/iOS app.
I am implementing locations and rooms in dynamodb. A user can have many locations, and locations can have many rooms. I am used to using Firebase Firestore, so the use of partition keys and sort keys (hash and range?) and the combination to query are a little confusing. I implemented my own hash to use as a primary (partition? hash?) id. Here is the structure I am thinking of:
Location
id
name
username
I also added a secondary index on username, so that a user could query all of their locations.
Room
id
name
locationId
I also added a secondary index on locationId, so that a user could query all rooms for a given location
Here is the code in which I create the id's:
// need a unique hash for the id
let hash = event.name + event.username + new Date().getTime();
let id = crypto.createHash('md5').update(hash).digest('hex');
let location = {
id: id,
name: event.name,
username: event.username
};
And for rooms:
// need a unique hash for the id
let hash = event.name + event.locationId + new Date().getTime();
Since I'm fairly new to Dynamo/AWS, I'm wondering if this is an acceptable solution. Obviously I would expand on this by adding multiple devices under rooms by associating via the roomId. I would also like to be able to share devices, so I'm not quite sure how that would work, as the association for a user is on location - so I assume I would have to share location, room(s), and device(s) (which I think is how Google Home does it)
Any suggestions would be greatly appreciated!
EDIT
The queries that I can think of would be:
Get Location by Id
Get all Locations by User
Get Room by Id
Get all Rooms by Location
However as the app expands in the future, I would want these queries to be flexible (share location, get shared locations, etc)
I would want these queries to be flexible
Then noSQL in general and Dynamo specifically may not be the right choice.
As #varnit alludes to, noSQL DB's are very flexible in what you store, but very inflexible in how you can query that data.
Dynamo for instance can only return a list (Query) if you use a sort key (SK) or if you do a full table scan (not recommended). Otherwise, it can only return a single record.
I don't understand what a "shared location" would entail.
But with multiple tenets in Dynamo, (each user is only looking at their data) the easy solution would be to use userID as the partition key (PK).
I'd use a composite sort key of location#room
Get Location by Id --> GetItem(PK = User, SK = location)
Get all Locations by User --> Query (PK = User)
Get all rooms by Location --> Query (PK = User, SK starts with Location)
This one is a little trickier...
- Get Room by Id -->
If you really need to get a room without having the location, then you'd want to have room as stand-a-lone attribute in addition to having it as part of the sort key. Then you can create a local secondary index over it and query (PK = User, Index SK = Room)
I suspect that finding a room via GetItem(PK = User, SK = location#room) might work for you instead.
Key point, the partition key comparision is always equal. There's no start with, ends with or contains for the partition key comparison.
If you haven't seen them, take a look at the following videos
AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301)
AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)
Also be sure to read the SaaS Storage Strategies - Building a Multitenant Storage Model on AWS whitepaper.
EDIT
"location" and "room" can be whatever makes the most sense to your application. GUID or a natural key such as "Home". In a noSQL db, GUIDs are useful when multiple nodes are adding records. But a natural key is good when that what the application user will have handy. Since you don't want to have to look up a guid by the natural key. RDBMS practices don't apply to noSQL DBs.
So yes, I'd use "Home" as the location, meaning the user won't be able to have multiple "Home"s. But I don't see that as a big deal, I'd use "Home" and "Vacation House" in real life.
EDIT2
Dynamo doesn't care if it's a GUID or a natural key. It internally hashes the whatever value you use for partition key. All that matters is the number of distinct values. Distinct is distinct, doesn't matter if the value is '0ae4ad25-5551-46a7-8e39-64619645bd58' or 'charles.wilt#mydomain.com'. If your authorization process returns a GUID, use that. Otherwise use the username.
Related
I have a simple single-table design that I want to keep flexible for the future, I currently have 2 entity types: users and videos. Users have a 1:n relationship to videos.
The table's partition key is pk and sort key is sk.
Users: pk=u#<id> and sk=u#<id>, entityType: user
Videos: pk=u#<id> and sk=v#<id>, entityType: video
If I want to fetch all users, does it make sense to create a GSI with PK=entityType and SK=sk?
No, because then all user writes will go to the same PK which isn’t ideal. Instead, setup a GSI with a GSI1PK holding your user ID and you can do a scan against it. Project in the essential attributes. Only set the GSI1PK for user entity types so it’s a sparse GSI.
That is one approach you could take and it would get the job done, but it has a few drawbacks/side effects:
You would also replicate all videos in that GSI, which increases the storage and throughput cost of it
You would create a potentially huge item collection that contains all users, which could lead to a hot partition and may not scale well.
Instead, consider splitting up the huge user partition in the GSI into multiple ones with predictable keys.
If you plan to list your users by username later, you could take the first letter of their username as the partition key and thereby create around 26 (depending on capitalization and character set) different partitions, which would spread out the load a lot better. To list all users, you'd have to issue queries on all the partitions, which is annoying at small sizes, but will be more scalable.
Another option would be to define that you want to spread the users out among n partitions and then use something like hash(user_id) mod n to get a partition key for the GSI. That way you'd have to do n queries to get the values of all partitions.
ex.
Let's say I'm trying to create an eCommerce platform with multiple sellers and I create a Table called Orders. The partition key will be storeID and the sort key will be orderNumber.
Stores can call API.get('Orders', {storeID}), which will return all the items with the partition key of storeID.
My application uses Amazon Cognito and each user is assigned a username which is a uuid. My question is can I use the uuid as the storeID in my DynamoDB table? The key assumption is that attackers won't be able to guess the uuid.
You should always validate access to resources server-side.
Not being able to guess a uuid isn't a safe assumption. Since you are using Amazon Cognito, there should be a way in your server code to get the logged-in user (the uuid). When you are making a query to DynamoDB, you shouldn't rely on a uuid passed by an HTTP query (client-side), but use instead the uuid value of the logged-in user.
The uuid could be used in a Global Secondary Index, so that you can quickly query the orders of a user.
you can use UUID to store the same users data in dynamo as well. But make sure while fetching the details in dynamo what detail an API is actually needed like whether a store API should request only orders and not users so to differentiate those stuffs just add an attribute called "entity_type" which will have
order
user
store
other resources
So while fetching add this entity_type as well one of the where condition to filter out rest of the unwanted stuffs.
I am new to GCP and NOSQL.
is it possible to have primary and foreign key in the GCP fire-store
Example: I have two table STUDENT and DEPARTMENT
table looks like below
Department-table
dept-id(primary key)
deptname
Student-table
dept-id(foreign key)
student-id
student name
can anybody please help in design this in GCP Fire-store?
To a database, a key is the same as any UUID/randomID and can be shared and used between users, teams, admins, businesses, of all kinds. what matters is how that data is associated. Since Firestore is a noSQL database, there is no direct relational references, so one key cannot be equal to another without including secondary lookups.
In the same way you would define a user profile by an ID, you can create an empty document with a random ID to facilitate the ID of a team, or in this case the department. You can also utilize string combinations if you have a team and a sub-team, so long as at the point of the database request you have access to the team/department ID, you can use Regex to match a string comparison.
Example: request.resource.data.name.matches('/^' + departmentID)
To make a foreign key work with Security Rules or within the client, you must get the key that contains the data as the key should be the name of the document in question to streamline the request as you cannot perform queries or loop through data within Security Rules.
I great read on this subject, I highly suggest this article
https://medium.com/firebase-developers/a-list-of-firebase-firestore-security-rules-for-your-project-fe46cfaf8b2a
But my suggestion is to use a key that represents the department directly rather than using additional resource to have a foreign key and managing it.
Firestore won't support referential integrity.
It means that you can use any (subject to rules and conventions) names for fields, but the semantic and additional functionality is to be maintained by you, rather than by the system.
I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?
In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies
I'm building a DynamoDB app that will eventually serve a large number (millions) of users. Currently the app's item schema is simple:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
email: "foo#foo.com",
... other attributes ...
}
When a new user signs up, or if a user wants to find another user by email address, we'll need to look up users by email instead of by userId. With the current schema that's easy: just use a global secondary index with email as the Partition Key.
But we want to enable multiple email addresses per user, and the DynamoDB Query operation doesn't support a List-typed KeyConditionExpression. So I'm weighing several options to avoid an expensive Scan operation every time a user signs up or wants to find another user by email address.
Below is what I'm planning to change to enable additional emails per user. Is this a good approach? Is there a better option?
Add a sort key column (e.g. itemTypeAndIndex) to allow multiple items per userId.
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "main", // sort key
email: "foo#foo.com",
... other attributes ...
}
If the user adds a second, third, etc. email, then add a new item for each email, like this:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "Email-2", // sort key
email: "bar#bar.com"
// no more attributes
}
The same global secondary index (with email as the Partition Key) can still be used to find both primary and non-primary email addresses.
If a user wants to change their primary email address, we'd swap the email values in the "primary" and "non-primary" items. (Now that DynamoDB supports transactions, doing this will be safer than before!)
If we need to delete a user, we'd have to delete all the items for that userId. If we need to merge two users then we'd have to merge all items for that userId.
The same approach (new items with same userId but different sort keys) could be used for other 1-user-has-many-values data that needs to be Query-able
Is this a good way to do it? Is there a better way?
Justin, for searching on attributes I would strongly advise not to use DynamoDB. I am not saying, you can't achieve this. However, I see a few problems that will eventually come in your path if you will go this root.
Using sort-key on email-id will result in creating duplicate records for the same user i.e. if a user has registered 5 email, that implies 5 records in your table with the same schema and attribute except email-id attribute.
What if a new use-case comes in the future, where now you also want to search for a user based on some other attribute(for example cell phone number, assuming a user may have more then one cell phone number)
DynamoDB has a hard limit of the number of secondary indexes you can create for a table i.e. 5.
Thus with increasing use-case on search criteria, this solution will easily become a bottle-neck for your system. As a result, your system may not scale well.
To best of my knowledge, I can suggest a few options that you may choose based on your requirement/budget to address this problem using a combination of databases.
Option 1. DynamoDB as a primary store and AWS Elasticsearch as secondary storage [Preferred]
Store the user records in DynamoDB table(let's call it UserTable)as and when a user registers.
Enable DynamoDB table streams on UserTable table.
Build an AWS Lambda function that reads from the table's stream and persists the records in AWS Elasticsearch.
Now in your application, use DynamoDB for fetching user records from id. For all other search criteria(like searching on emailId, phone number, zip code, location etc) fetch the records from AWS Elasticsearch. AWS Elasticsearch by default indexes all the attributes of your record, so you can search on any field within millisecond of latency.
Option 2. Use AWS Aurora [Less preferred solution]
If your application has a relational use-case where data are related, you may consider this option. Just to call out, Aurora is a SQL database.
Since this is a relational storage, you can opt for organizing the records in multiple tables and join them based on the primary key of those tables.
I will suggest for 1st option as:
DynamoDB will provide you durable, highly available, low latency primary storage for your application.
AWS Elasticsearch will act as secondary storage, which is also durable, scalable and low latency storage.
With AWS Elasticsearch, you can run any search query on your table. You can also do analytics on data. Kibana UI is provided out of the box, that you may use to plot the analytical data on a dashboard like (how user growth is trending, how many users belong to a specific location, user distribution based on city/state/country etc)
With DynamoDB streams and AWS Lambda, you will be syncing these two databases in near real-time [within few milliseconds]
Your application will be scalable and the search feature can further be enhanced to do filtering on multi-level attributes. [One such example: search all users who belong to a given city]
Having said that, now I will leave this up to you to decide. 😊