I have an entity that has many related entities that look like this:
Authorizable(one Authorization(many Role, many Acl))
User < Authorizable(
many Profile < Authorizable
)
Loading the User entity and resolving the roles and acls takes about 15 queries and I would like to reduce that to a minimum (one ideally).
The fetch="EAGER" option does not seem to use joined queries.
Related
I have a simple single-table design that I want to keep flexible for the future, I currently have 2 entity types: users and videos. Users have a 1:n relationship to videos.
The table's partition key is pk and sort key is sk.
Users: pk=u#<id> and sk=u#<id>, entityType: user
Videos: pk=u#<id> and sk=v#<id>, entityType: video
If I want to fetch all users, does it make sense to create a GSI with PK=entityType and SK=sk?
No, because then all user writes will go to the same PK which isn’t ideal. Instead, setup a GSI with a GSI1PK holding your user ID and you can do a scan against it. Project in the essential attributes. Only set the GSI1PK for user entity types so it’s a sparse GSI.
That is one approach you could take and it would get the job done, but it has a few drawbacks/side effects:
You would also replicate all videos in that GSI, which increases the storage and throughput cost of it
You would create a potentially huge item collection that contains all users, which could lead to a hot partition and may not scale well.
Instead, consider splitting up the huge user partition in the GSI into multiple ones with predictable keys.
If you plan to list your users by username later, you could take the first letter of their username as the partition key and thereby create around 26 (depending on capitalization and character set) different partitions, which would spread out the load a lot better. To list all users, you'd have to issue queries on all the partitions, which is annoying at small sizes, but will be more scalable.
Another option would be to define that you want to spread the users out among n partitions and then use something like hash(user_id) mod n to get a partition key for the GSI. That way you'd have to do n queries to get the values of all partitions.
I am building a project using AppSync and GraphQL to enable Restaurants to track orders. There are four DynamoDB tables (one for each of the following entities): Restaurants, Staff, Tables and Orders. Each Restaurant can have many members of Staff, who are each allocated to one or more Tables. Each Table can have many orders, but an order can only belong to one table (see the System Design diagram for a visualisation of these relationships).
Problem
My issue is that I need very fine-grained hierarchical access control, with 3 main concerns:
Staff belonging to one Restaurant must not be able to Create, Read, Update or Delete any entities belonging to other Restaurants.
All staff in a Restaurant can view all tables in the Restaurant. However, they can only view orders belonging to a table if they are allocated to that table (e.g. a StaffTableJoin object which connects that particular Staff member to that table exists.) OR they are a Restaurant admin (see part 3)
A member of Staff who is a Restaurant Admin can view all orders belonging to any table in the restaurant.
A cognito user is created for each member of staff, and their permissions should be assigned based on the relationships between entities in my DynamoDB table.
Solutions Considered
I have visited the Authorization and Authentication page in the AWS docs to explore options for restricting permissions. So far, I have considered using COGNITO_USER_POOLS and AWS_LAMDBA authorization.
For the approach using COGNITO_USER_POOLS, I would create a Cognito User Group for each Restaurant. When new members of staff register, they are assigned to their restaurant's user group. I would then add an groupsCanAccess field to each entity in each database. My resolvers would check that the requesting user belongs to a group which is allowed to access each resource. However this would only address concern 1, as all staff in a restaurant would then have the same permissions to access their restaurant's resources.
For the approach using AWS_LAMBDA, I am not too sure how this would work, but I considered creating an Authorization lambda which checks which restaurant the requesting user belongs to. For instance, if the User was requesting an Order, I would need to check which table the order belongs to, then check if a StaffUserJoin exists (connecting the requesting User to the table). This approach seems very difficult (maybe impossible).
Any advice that could be offered is much appreciated, as I have been struggling with this for a long time. It seems like a common use case, where permissions are needed based on an object hierachy. Thanks in advance :)
I'm building a DynamoDB app that will eventually serve a large number (millions) of users. Currently the app's item schema is simple:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
email: "foo#foo.com",
... other attributes ...
}
When a new user signs up, or if a user wants to find another user by email address, we'll need to look up users by email instead of by userId. With the current schema that's easy: just use a global secondary index with email as the Partition Key.
But we want to enable multiple email addresses per user, and the DynamoDB Query operation doesn't support a List-typed KeyConditionExpression. So I'm weighing several options to avoid an expensive Scan operation every time a user signs up or wants to find another user by email address.
Below is what I'm planning to change to enable additional emails per user. Is this a good approach? Is there a better option?
Add a sort key column (e.g. itemTypeAndIndex) to allow multiple items per userId.
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "main", // sort key
email: "foo#foo.com",
... other attributes ...
}
If the user adds a second, third, etc. email, then add a new item for each email, like this:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "Email-2", // sort key
email: "bar#bar.com"
// no more attributes
}
The same global secondary index (with email as the Partition Key) can still be used to find both primary and non-primary email addresses.
If a user wants to change their primary email address, we'd swap the email values in the "primary" and "non-primary" items. (Now that DynamoDB supports transactions, doing this will be safer than before!)
If we need to delete a user, we'd have to delete all the items for that userId. If we need to merge two users then we'd have to merge all items for that userId.
The same approach (new items with same userId but different sort keys) could be used for other 1-user-has-many-values data that needs to be Query-able
Is this a good way to do it? Is there a better way?
Justin, for searching on attributes I would strongly advise not to use DynamoDB. I am not saying, you can't achieve this. However, I see a few problems that will eventually come in your path if you will go this root.
Using sort-key on email-id will result in creating duplicate records for the same user i.e. if a user has registered 5 email, that implies 5 records in your table with the same schema and attribute except email-id attribute.
What if a new use-case comes in the future, where now you also want to search for a user based on some other attribute(for example cell phone number, assuming a user may have more then one cell phone number)
DynamoDB has a hard limit of the number of secondary indexes you can create for a table i.e. 5.
Thus with increasing use-case on search criteria, this solution will easily become a bottle-neck for your system. As a result, your system may not scale well.
To best of my knowledge, I can suggest a few options that you may choose based on your requirement/budget to address this problem using a combination of databases.
Option 1. DynamoDB as a primary store and AWS Elasticsearch as secondary storage [Preferred]
Store the user records in DynamoDB table(let's call it UserTable)as and when a user registers.
Enable DynamoDB table streams on UserTable table.
Build an AWS Lambda function that reads from the table's stream and persists the records in AWS Elasticsearch.
Now in your application, use DynamoDB for fetching user records from id. For all other search criteria(like searching on emailId, phone number, zip code, location etc) fetch the records from AWS Elasticsearch. AWS Elasticsearch by default indexes all the attributes of your record, so you can search on any field within millisecond of latency.
Option 2. Use AWS Aurora [Less preferred solution]
If your application has a relational use-case where data are related, you may consider this option. Just to call out, Aurora is a SQL database.
Since this is a relational storage, you can opt for organizing the records in multiple tables and join them based on the primary key of those tables.
I will suggest for 1st option as:
DynamoDB will provide you durable, highly available, low latency primary storage for your application.
AWS Elasticsearch will act as secondary storage, which is also durable, scalable and low latency storage.
With AWS Elasticsearch, you can run any search query on your table. You can also do analytics on data. Kibana UI is provided out of the box, that you may use to plot the analytical data on a dashboard like (how user growth is trending, how many users belong to a specific location, user distribution based on city/state/country etc)
With DynamoDB streams and AWS Lambda, you will be syncing these two databases in near real-time [within few milliseconds]
Your application will be scalable and the search feature can further be enhanced to do filtering on multi-level attributes. [One such example: search all users who belong to a given city]
Having said that, now I will leave this up to you to decide. 😊
This looks like it should be easy but I just can't find it.
I'm creating an application where I want to give admin site access to people from different departments. Those people will read and write the same tables, BUT they must only access rows belonging to their department! I.e. they must not see any records produced by the other departments and should be able to modify only the records from their own department. If they create a record, it should automatically "belong" to the department of the user which created it (they will create records only from the admin site).
I've found django-guardian, but it looks like an overkill - I don't really want to have arbitrary per-record permissions.
Also, the number of records will potentially be large, so any kind of front-end permission checking on a per-record basis is not suitable - it must be done by DB-side filtering. Other than that, I'm not really particular how it will be done. E.g. I'm perfectly fine with mapping departments to auth groups.
I have a Django project based on multiple PostgreSQL servers.
I want users to be sharded across those database servers using the same sharding logic used by Instagram:
User ID => logical shard ID => physical shard ID => database server => schema => user table
The logical shard ID is directly calculated from the user ID (13 bits embedded in the user id).
The mapping from logical to physical shard ID is hard coded (in some configuration file or static table).
The mapping from physical shard ID to database server is also hard coded. Instagram uses Pgbouncer at this point to retrieve a pooled database connection to the appropriate database server.
Each logical shard lives in its own PostgreSQL schema (for those not familiar with PostgreSQL, this is not a table schema, it's rather like a namespace, similar to MySQL 'databases'). The schema is simply named something like "shardNNNN", where NNNN is the logical shard ID.
Finally, the user table in the appropriate schema is queried.
How can this be achieved as simply as possible in Django ?
Ideally, I would love to be able to write Django code such as:
Fetching an instance
# this gets the user object on the appropriate server, in the appropriate schema:
user = User.objects.get(pk = user_id)
Fetching related objects
# this gets the user's posted articles, located in the same logical shard:
articles = user.articles
Creating an instance
# this selects a random logical shard and creates the user there:
user = User.create(name = "Arthur", title = "King")
# or:
user = User(name = "Arthur", title = "King")
user.save()
Searching users by name
# fetches all relevant users (kings) from all relevant logical shards
# - either by querying *all* database servers (not good)
# - or by querying a "name_to_user" table then querying just the
# relevant database servers.
users = User.objects.filter(title = "King")
To make things even more complex, I use Streaming Replication to replicate every database server's data to multiple slave servers. The masters should be used for writes, and the slaves should be used for reads.
Django provides support for automatic database routing which is probably sufficient for most of the above, but I'm stuck with User.objects.get(pk = user_id) because the router does not have access to the query parameters, so it does not know what the user ID is, it just knows that the code is trying to read the User model.
I am well aware that sharding should probably be used only as a last resort optimization since it has limitations and really makes things quite complex. Most people don't need sharding: an optimized master/slave architecture can go a very long way. But let's assume I do need sharding.
In short: how can I shard data in Django, as simply as possible?
Thanks a lot for your kind help.
Note
There is an existing question which is quite similar, but IMHO it's too general and lacks precise examples. I wanted to narrow things down to a particular sharding technique I'm interested in (the Instagram way).
Mike Clarke recently gave a talk at PyPgDay on how Disqus shards their users with Django and PostgreSQL. He wrote up a blog post on how they do it.
Several strategies can be employed when sharding Postgres databases. At Disqus, we chose to shard based on table name. Where as the original table name as generated by Django might be comments_post, our sharding tools will rewrite the SQL to query a table comments_post_X, where X is the shard ID calculated based on a consistent hashing scheme. All these tables live in a single schema, on a single database instance.
In addition, they released some code as part of a sample application demonstrating how they shard.
You really don't want to be in the position of asking this question. If you are sharding by user id then you probably don't want to search by name.
If you are sharding your database then it's not going to be invisible to your application and will probably end up requiring schema alterations.
You might find SkyTools useful - read up on PL/Proxy. It's how Skype shard their databases.
it is better to use professional sharding middleware, for example: Apache ShardingSphere.
The project contains 2 productions, ShardingSphere-JDBC for java driver, and ShardingSphere-Proxy for all programing languages. It can support python and Django as well.