AWS DynamoDB and Lambda: Scan optimizations / performance

AWS DynamoDB and Lambda: Scan optimizations / performance - amazon-web-services

To store api-gateway websocket-connections, I use a dynamoDB-table.
When posting to stored connections, I retrieve the connection in a lambda-function via:
const dynamodb = new DynamoDB.DocumentClient();
const { Items, Count } = await dynamodb.scan({ TableName: 'Websocket' }).promise();
// post to connections
This is not really fast; the query takes around 400 - 800ms which could be better in my opinion. Can I change something on my implementation or is there maybe another aws-service which is better for storing these tiny infos about the websocket-connection (its really just a small connection-id and a user-id)?

It has nothing to do with dynamodb, if you do a scan on any database which reads from disk, it will take time and money from your pocket.
You can use any of below solution to achieve what you are doing.
Instead of storing all the websocket ids as separate row, consider having single record in which ids are stored, so that you can do a single query (not scan) and proceed.
Cons:
a. multiple writes to same row will result in race condition. and few reads might get lost, you can use conditional write to update record to solve this problem (have a always increasing version, and update the record only if version in db = version you read from db)
b. There is a limit on size of single document in dynamodb. As of now it is 400kb.
Store websocket id as separate row but group them by different keys, and create secondary index on these keys. Store the keys in a single row. While doing a fetch first get all relevant groups, and then query (not scan) all the items of that group. It will not exactly solve your problem but you can do interesting things like, let's say there are 10 groups, every second, messages for 1 groups are sent. this will make sure that load on your message sending infrastructure is also balanced. And you can keep incrementing number of groups as user increases.
Keep the ids in a cache like aws elastic cache and add/remove ids as new entries are made in dynamodb by using aws lambda and dyanmodb streams. It will make sure that you reads are fast. At the same time if cache goes down you can use dynamodb to populate it again by doing scan on dynamodb.
Cons:
a. Extra component to maintain.

Related

Will a click counter slow down my DynamoDB API?

I want to create a DynamoDB WebAPI. It allows the creation and reading of Posts. Now I would like to implement a click counter that updates the popularity of a post each time a user requests it. For this reason, every time a GET request for a posts comes in, I would change the Post object itself.
But I know that DynamoDB is optimized for reads, not for writes. So updating the object that is being fetched everytime would probably be a problem.
So how can I measure the popularity of posts without slowing down the API itself? I was thinking of generating a random number for every fetch and only updating it if it is below 0.05 or something similar.
But is there a better solution for this?

Dynamo DB isn't "optimized for reads" it's optimized to provide "consistent, single-digit millisecond response times at any scale."
To optimize DDB for reads, you'd want to stick a Amazon DynamoDB Accelerator (DAX) instance in front of it for "faster access with microsecond latency".
In actuality, the DDB read/write performance isn't going to be an issue. In your case the network latency between your app and DDB will be orders of magnitude higher. By making two calls synchronously one after the other you'd be doubling your response time; regardless of what cloud DB you're writing too.
Assuming the data and counter are in the same record, the simple DDB solution in this case would be to not make a call to GetItem() and one to UpdateItem(). Instead, simply call UpdateItem() with an UpdateExpression that uses the ADD expression to add 1 to your counter and the ReturnValues attribute to return either ALL_OLD or ALL_NEW.
Other more complex solutions
assuming you've already got the data for display, do an async call to UpdateItem().
At scale, you might consider disconnecting the counter update from your app. Your app post a SQS message, that's processed by a lambda which could use batch updates to DDB.

Fetching large amount of data from dynamo DB table using primary key

I am quite new to dynamo DB I have a requirement in which I need to fetch around 120 million rows from the dynamo DB table. Criteria to fetch is based on PK(basically I need to fetch all the rows pertaining to CAR_********* Primary key pattern). The only way which I can figure out is to perform get operation but it's consuming a lot of time. I also looked for the option of a bulk get but that too has a limit of 100 rows or 16mb of data.
So, Can someone suggest a better and faster approach to extract this data?

First off, DynamoDB is optimized for storing and retrieving single data objects by primary key. If you need to regularly retrieve or update millions of rows, you should look at an alternative datastore.
With that out of the way, if this is a one-time task I recommend spinning up a Redshift database and using the COPY command to retrieve the data from Dynamo. You can then download that data using a single SQL statement.
If you don't want to do this, or are expecting to retrieve the data more than once, you need to use the Scan API. This will return at most 1 MB per call, so you'll need to call it in a loop.
Regardless, you will almost certainly need to increase your read capacity to handle this task.

Dynamodb one bulk scan vs many single gets

Suppose I have a lambda function and as the event param I get about 50 primary ids that I have to look for inside a dynamodb table, what would be the better way to do it - 50 get queries each one by different primary id OR one scan and then comparing the scan primary ids results to the primary ids recieved as param?
I think 50 get query would be better on the performance side because if tomorrow I will have one million records it would be a waste of time and memory to scan them all and then filter only 50 of them but on the other side isn't making 50 requests to dynamodb could have performance issues and require more provisioning ?

You're right that a Scan operation, assuming you will only need to read 50 records out of a million, is the worst possible solution. It will be very slow, and will cost you a pretty penny because when you scan, you pay Amazon to read all your data - even if you filter most of it out.
Making 50 separate GetItem requests isn't so bad - it's certainly better than a scan. You only pay Amazon for the actual retrieved item - you don't pay more because it's 50 separate requests. Of course, if you don't want huge latency, don't just start these requests one after another - start them all in parallel.
But for this use-case, DynamoDB provides an even better operation BatchGetItem. With this operation you give DynamoDB the list of 50 required keys, in just one HTTP request, and it will fetch all of them (in parallel) and return all the responses to you. It seems to be that BatchGetItem is the best fit for your use case.

Is Redis atomic when multiple clients attempt to read/write an item at the same time?

Let's say that I have several AWS Lambda functions that make up my API. One of the functions reads a specific value from a specific key on a single Redis node. The business logic goes as follows:
if the key exists:
serve the value of that key to the client
if the key does not exist:
get the most recent item from dynamoDB
insert that item as the value for that key, and set an expiration time
delete that item from dynamoDB, so that it only gets read into memory once
Serve the value of that key to the client
The idea is that every time a client makes a request, they get the value they need. If the key has expired, then lambda needs to first get the item from the database and put it back into Redis.
But what happens if 2 clients make an API call to lambda simultaneously? Will both lambda processes read that there is no key, and both will take an item from a database?
My goal is to implement a queue where a certain item lives in memory for only X amount of time, and as soon as that item expires, the next item should be pulled from the database, and when it is pulled, it should also be deleted so that it won't be pulled again.
I'm trying to see if there's a way to do this without having a separate EC2 process that's just keeping track of timing.
Is redis+lambda+dynamoDB a good setup for what I'm trying to accomplish, or are there better ways?

A Redis server will execute commands (or transactions, or scripts) atomically. But a sequence of operations involving separate services (e.g. Redis and DynamoDB) will not be atomic.
One approach is to make them atomic by adding some kind of lock around your business logic. This can be done with Redis, for example.
However, that's a costly and rather cumbersome solution, so if possible it's better to simply design your business logic to be resilient in the face of concurrent operations. To do that you have to look at the steps and imagine what can happen if multiple clients are running at the same time.
In your case, the flaw I can see is that two values can be read and deleted from DynamoDB, one writing over the other in Redis. That can be avoided by using Redis's SETNX (SET if Not eXists) command. Something like this:
GET the key from Redis
If the value exists:
Serve the value to the client
If the value does not exist:
Get the most recent item from DynamoDB
Insert that item into Redis with SETNX
If the key already exists, go back to step 1
Set an expiration time with EXPIRE
Delete that item from DynamoDB
Serve the value to the client

Storing Chat Log on AWS DynamoDB?

I am thinking of building a chat app with AWS DynamoDB. The app will support 1:1 and group chats.
I want to create one table for each one of the chats, where there is a record for each sent chat text line. Is DynamoDB suitable for this kind of job?
I am also thinking of merging both tables. But is this a good idea, if there are – let's assume – 100k or 1000k users?

I think you may run into problems with the read capacity on your table. The write capacity should be ok, as there are not so many messages coming in per second (e.g. 10 or so), but you'll need to constantly read from it for all users, so that'll be expensive.
If you want to use DynamoDB just as storage and distribute the chat messages like in any normal chat over the network, then it may make sense, depending on your use cases. You could, assuming you have a hash key UserId and Timestamp, query all messages from a specific user during a specific period as a result. If you want, however, search within the chat text (a much more useful feature, probably), then DynamoDB won't work per se. It's not like SQL, where you could do a LIKE '%abc%' query (which isn't a good idea in SQL either).
Probably you're better off using S3 as data storage and ElasticSearch as search instrument. If you require the aforementioned use case "get all messages from user X in timespan S" (as a simple example) you could additionally use DynamoDB to store metadata, such as UserId, Timestamp, PositionInFile or something like that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js