Parallel insertion in Dynamo DB - amazon-web-services

Parallel insertion in Dynamo DB - amazon-web-services

I have 3 clients. All of them want to insert items in the same database.
Whenever a client sends a request,
I need to read the last entered record in ddb.
Increase its id by 1.
Push this new request in the ddb with the increased id.
What's the best aws based architecture to implement this?
What if there were 100 clients?

What use is it to have an increasing Id as a partition key, assuming that's the use-case?
Unlike relational databases where this would be a good pattern, typically in a key-value store it's not as it leada.to difficulty reading the data back.
My suggestion would be to use a useful Id that is known to your application to allow you to read the items back efficiently. If those known values are not unique, then you can add a sort key which will become a primary key to define your uniqueness.

Related

AWS DynamoDB and Lambda: Scan optimizations / performance

To store api-gateway websocket-connections, I use a dynamoDB-table.
When posting to stored connections, I retrieve the connection in a lambda-function via:
const dynamodb = new DynamoDB.DocumentClient();
const { Items, Count } = await dynamodb.scan({ TableName: 'Websocket' }).promise();
// post to connections
This is not really fast; the query takes around 400 - 800ms which could be better in my opinion. Can I change something on my implementation or is there maybe another aws-service which is better for storing these tiny infos about the websocket-connection (its really just a small connection-id and a user-id)?

It has nothing to do with dynamodb, if you do a scan on any database which reads from disk, it will take time and money from your pocket.
You can use any of below solution to achieve what you are doing.
Instead of storing all the websocket ids as separate row, consider having single record in which ids are stored, so that you can do a single query (not scan) and proceed.
Cons:
a. multiple writes to same row will result in race condition. and few reads might get lost, you can use conditional write to update record to solve this problem (have a always increasing version, and update the record only if version in db = version you read from db)
b. There is a limit on size of single document in dynamodb. As of now it is 400kb.
Store websocket id as separate row but group them by different keys, and create secondary index on these keys. Store the keys in a single row. While doing a fetch first get all relevant groups, and then query (not scan) all the items of that group. It will not exactly solve your problem but you can do interesting things like, let's say there are 10 groups, every second, messages for 1 groups are sent. this will make sure that load on your message sending infrastructure is also balanced. And you can keep incrementing number of groups as user increases.
Keep the ids in a cache like aws elastic cache and add/remove ids as new entries are made in dynamodb by using aws lambda and dyanmodb streams. It will make sure that you reads are fast. At the same time if cache goes down you can use dynamodb to populate it again by doing scan on dynamodb.
Cons:
a. Extra component to maintain.

UUID as primary key in DynamoDB -- good or bad idea?

In a new DynamoDB table, my use cases are already fulfilled by the following key schema design:
partition key: user_id
sort key: entity_id
Basically, access patterns are:
Get specific post by a specific user.
Get specific comment by a specific user.
List all posts by specific user.
List all comments by specific user.
List all entities (post or comment) by a specific user.
What benefits do I get if I use a more random ID as partition key instead and simply use GSIs for my access patterns above?
partition key: pseudo_random_id (This is going to be a UUID in reality. Please ignore that this is NOT a UUID in the illustration).
GSI:
partition key: user_id
sort key: entity_id

You don’t need UUIDs or any pseudo-random ID.
It was once possible that you could have a hot partition if one user is particularly active, but hot partitions are basically a non-issue now because of DynamoDB’s adaptive capacity. Furthermore, you should probably be limiting how fast users can create comments/posts, which would prevent hot partitions even if adaptive capacity didn’t exist.
(Why should you limit the rate a user can post? You don’t want a malicious actor to be able to create a new post every few milliseconds—you should have some sort of rate limit as a protection against denial of service attacks.)

Using a UUID doesn't do anything for you...
It doesn't matter how random the partition key is. All that matters is how many distinct partition keys you have and the volume/velocity of entries for that partition key.
In other words, a unique value is a unique value. Dynamo doesn't care if it's 16 bytes, 36 bytes or 128 bytes.
Dynamo applies it's own hash to the partition key to determine which partition the data will be placed into.

if you are looking at unique + sequence number in DynamoDB, worth to read Atomic Counters as a option. That maintains a counter in a table. But could be a problem for a high load application that requests IDs. Because the UpdateItem is a synchronized per tuple.

How do I avoid hot partitions when using DyanmoDB row level access control?

I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?

In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/

Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies

DynamoDB table/index schema design for querying multi-valued attributes

I'm building a DynamoDB app that will eventually serve a large number (millions) of users. Currently the app's item schema is simple:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
email: "foo#foo.com",
... other attributes ...
}
When a new user signs up, or if a user wants to find another user by email address, we'll need to look up users by email instead of by userId. With the current schema that's easy: just use a global secondary index with email as the Partition Key.
But we want to enable multiple email addresses per user, and the DynamoDB Query operation doesn't support a List-typed KeyConditionExpression. So I'm weighing several options to avoid an expensive Scan operation every time a user signs up or wants to find another user by email address.
Below is what I'm planning to change to enable additional emails per user. Is this a good approach? Is there a better option?
Add a sort key column (e.g. itemTypeAndIndex) to allow multiple items per userId.
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "main", // sort key
email: "foo#foo.com",
... other attributes ...
}
If the user adds a second, third, etc. email, then add a new item for each email, like this:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "Email-2", // sort key
email: "bar#bar.com"
// no more attributes
}
The same global secondary index (with email as the Partition Key) can still be used to find both primary and non-primary email addresses.
If a user wants to change their primary email address, we'd swap the email values in the "primary" and "non-primary" items. (Now that DynamoDB supports transactions, doing this will be safer than before!)
If we need to delete a user, we'd have to delete all the items for that userId. If we need to merge two users then we'd have to merge all items for that userId.
The same approach (new items with same userId but different sort keys) could be used for other 1-user-has-many-values data that needs to be Query-able
Is this a good way to do it? Is there a better way?

Justin, for searching on attributes I would strongly advise not to use DynamoDB. I am not saying, you can't achieve this. However, I see a few problems that will eventually come in your path if you will go this root.
Using sort-key on email-id will result in creating duplicate records for the same user i.e. if a user has registered 5 email, that implies 5 records in your table with the same schema and attribute except email-id attribute.
What if a new use-case comes in the future, where now you also want to search for a user based on some other attribute(for example cell phone number, assuming a user may have more then one cell phone number)
DynamoDB has a hard limit of the number of secondary indexes you can create for a table i.e. 5.
Thus with increasing use-case on search criteria, this solution will easily become a bottle-neck for your system. As a result, your system may not scale well.
To best of my knowledge, I can suggest a few options that you may choose based on your requirement/budget to address this problem using a combination of databases.
Option 1. DynamoDB as a primary store and AWS Elasticsearch as secondary storage [Preferred]
Store the user records in DynamoDB table(let's call it UserTable)as and when a user registers.
Enable DynamoDB table streams on UserTable table.
Build an AWS Lambda function that reads from the table's stream and persists the records in AWS Elasticsearch.
Now in your application, use DynamoDB for fetching user records from id. For all other search criteria(like searching on emailId, phone number, zip code, location etc) fetch the records from AWS Elasticsearch. AWS Elasticsearch by default indexes all the attributes of your record, so you can search on any field within millisecond of latency.
Option 2. Use AWS Aurora [Less preferred solution]
If your application has a relational use-case where data are related, you may consider this option. Just to call out, Aurora is a SQL database.
Since this is a relational storage, you can opt for organizing the records in multiple tables and join them based on the primary key of those tables.
I will suggest for 1st option as:
DynamoDB will provide you durable, highly available, low latency primary storage for your application.
AWS Elasticsearch will act as secondary storage, which is also durable, scalable and low latency storage.
With AWS Elasticsearch, you can run any search query on your table. You can also do analytics on data. Kibana UI is provided out of the box, that you may use to plot the analytical data on a dashboard like (how user growth is trending, how many users belong to a specific location, user distribution based on city/state/country etc)
With DynamoDB streams and AWS Lambda, you will be syncing these two databases in near real-time [within few milliseconds]
Your application will be scalable and the search feature can further be enhanced to do filtering on multi-level attributes. [One such example: search all users who belong to a given city]
Having said that, now I will leave this up to you to decide. 😊

Should I use a secondary index or separate ID lookup table in DynamoDB?

I'm migrating a database from mongodb to dynamodb and trying to understand best practices, especially with using secondary local indexes and sort keys.
My application pulls in html data from the web, and loads the data into several tables/collections. At the time of extraction it gives each item an extracted_id, unique to the website it's pulled from. Before loading the items, it gives each item a UUID as its primary/partition key.
Problem: In order to avoid assigning different uuids to the same extracted_id I query the db to check if the entity has a preexisting entity_uuid.
Current Solution: Currently in mongodb, I have two sets of tables/collections. One for storing all items, and one for storing an entity's extracted_id(as key) / entity_uuid (as value) lookup table.
Better Solution?: As I move to DynamoDB would it be better to only create one database with extracted_id as a local secondary index, as to not store duplicate data? I'm unsure as the docs say to use indexes sparingly. I don't use the extracted_id for anything other than providing items with their uuid for a given site.
Hopefully this makes sense, I'm new to AWS / DynamoDB and would appreciate any tips / better solutions to the ones mentioned.

Why not just make extracted_id the partition key of your new DynamoDB table and use a ConditionExpression attribute_not_exists(extracted_id) to prevent your application from writing duplicate entries?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js