I have the following Dynamo DB table structure:
item_type (string) --> Partition Key
item_id (number) --> Secondary Index
The table has around 5 million records and auto scaling is enabled with default read capacity of 5. I need to fetch the item_id given certain item type. We have around 500000 item_types and each item type will be associated with multiple item ids. I see a response of around 4 seconds for popular item_types. I am testing this on AWS Lambda, I start the timer when we make the query and end it once we get the response. Both Lambda and Dynamo DB are in the same region.
This is the query I am using:
response = items_table.query(
KeyConditionExpression=Key('item_type').eq(search_term),
ProjectionExpression='item_id'
)
Following are some of the observations:
It takes more time to fetch popular items
As the number of records increase, the response time increases
I have tried Dynamo DB Cache but the Python SDK is not up to the mark and it has certain limitations.
Given these details following are the questions:
Why is the response time so high? Is it because I am querying on a string not a number.
Increasing the read capacity also did not help but why?
Is there any other aws service which is faster than Dynamo DB for such type of queries.
I have seen seminars where they claim to get sub millisecond response times on billions of records with multiple users accessing the table. Any pointers towards achieving sub second response time will be helpful. Thanks.
Related
I have a Dynamodb table for Connections. The idea is that when a user logs into a website, a connection will be made via websockets and this connection information is stored in this table.
Now, we have a feature we want to release which shows total users online. My thoughts are, i could add a new API end point which scans dynamodb and returns the count of connections, but this would involve a dynamodb scan every time the UI refreshes - guessing this would be very expensive.
Another option i thought of was creating an API and an scheduled lambda that calls this API once every 10 minutes and uploads the count to an S3 file, the API for the UI could then be pointed at the S3 file which would be cheaper but this would not be real time as its 10 mins out of date potentially.
Alternatively, i tried to use the /#connections end point to see if this returned the total connections via the websocket API but seems i am getting CORS error when doing so and there's no way in AWS for us to be able to set CORS on the provided HTTP #connections route.
I would be interested in some ideas how to achieve this in the most efficient way :) my estimated table of connections could have anywhere between 5k-10k items.
Best thing here would be to use an item in the table to hold the live item count.
Add connection:
Add connection to DDB -> Stream -> Lambda -> Increment count item
Remove connection:
Remove connection from DDB -> Stream -> Lambda -> Decrement count item
This will allow you to efficiently gain the number of live-users on the system by a simple GetItem.
You just need to be mindful that a single item can consume only 1000WCU per second, so if you are trying to update the item more than 1000 times per second you will either:
Have to aggregate the events in Lambda, using a sliding window.
Artificially shard the count item n ways, count-item0, count-item1 etc...
This is my use case:
I have a JSON Api with 200k objects. The dataset looks a little something like this: date, bike model, production time in min. I use Lambda to read from a JSON Api and write in DynamoDB via http request. The Lambda function runs everyday and updates DynamoDB with the most recent data.
I then retrieve the data by date since I want to calculate the average production time for each day and put it in a second table. An Alexa skill is connected to the second table and reads out the average value for each day.
First question: Since the same bike model is produced multiple times per day, using a composite primary key with date and bike model won't give me a unique key. Shall I create a UUID for the entries instead? Or is there a better solution?
Second question: For the calculation I would need to do a full table scan each time, which is very costly and advised against by many. How can I solve this problem without doing a full table scan?
Third question: Is it better to avoid DynamoDB altogether for my use case? Which AWS database is more suitable for my use case then?
Yes, uuid or any other unique identifier (ex: date+bike model+created time) as pk is fine.
It seems your daily job for average value is some sort of data analytics job not really a transaction job. I would suggest to go with a service support data analytics such as Amazon Redshift. You should be able to add data to such database service using Dynamodb streams. Alternatively, you can stream data into s3 and use a service like Athena to get the daily average.
There is a simple database model that you could use for this task:
PartitionKey: a UUID or use any combination of fields that provide uniqueness.
SortKey: Production date, as a string, i.e. 2020-07-28
If you then create a secondary index which uses as PK the Production date and includes the production time, you can then query (not scan) the secondary index for a specific date and perform any calculations you need on production time. You can then provision the required read/write capacity on the secondary index and the table independently.
Regarding your third question, I don't see any real benefit of using DynamoDB for this task. Any RDS (i.e. MySQL), Redshift or even S3+Athena can easily handle such use case. If you require real time analytics, you could even consider AWS Kinesis.
To store api-gateway websocket-connections, I use a dynamoDB-table.
When posting to stored connections, I retrieve the connection in a lambda-function via:
const dynamodb = new DynamoDB.DocumentClient();
const { Items, Count } = await dynamodb.scan({ TableName: 'Websocket' }).promise();
// post to connections
This is not really fast; the query takes around 400 - 800ms which could be better in my opinion. Can I change something on my implementation or is there maybe another aws-service which is better for storing these tiny infos about the websocket-connection (its really just a small connection-id and a user-id)?
It has nothing to do with dynamodb, if you do a scan on any database which reads from disk, it will take time and money from your pocket.
You can use any of below solution to achieve what you are doing.
Instead of storing all the websocket ids as separate row, consider having single record in which ids are stored, so that you can do a single query (not scan) and proceed.
Cons:
a. multiple writes to same row will result in race condition. and few reads might get lost, you can use conditional write to update record to solve this problem (have a always increasing version, and update the record only if version in db = version you read from db)
b. There is a limit on size of single document in dynamodb. As of now it is 400kb.
Store websocket id as separate row but group them by different keys, and create secondary index on these keys. Store the keys in a single row. While doing a fetch first get all relevant groups, and then query (not scan) all the items of that group. It will not exactly solve your problem but you can do interesting things like, let's say there are 10 groups, every second, messages for 1 groups are sent. this will make sure that load on your message sending infrastructure is also balanced. And you can keep incrementing number of groups as user increases.
Keep the ids in a cache like aws elastic cache and add/remove ids as new entries are made in dynamodb by using aws lambda and dyanmodb streams. It will make sure that you reads are fast. At the same time if cache goes down you can use dynamodb to populate it again by doing scan on dynamodb.
Cons:
a. Extra component to maintain.
I am using dynamo db as back end database in my project, I am storing items in the table with each of size 80 Kb or more(contains nested JSON), and my partition key is a unique valued column(unique for each item). Now i want to perform pagination on this table i.e., my UI will provide(start-Integer, limit-Integer and type-2 string constants) and my API should retrieve the items from dynamo db based on the provided input query parameters from UI. I am using SCAN method from boto3 (python SDK) but this scan is reading all the items from my table prior to considering my filters and causing provision throughput error, but I cannot afford to either increase my table's throughput or opt table auto-scaling. Is there any way how my problem can be solved? Please give your suggestions
Do you have a limit set on your scan call? If not, DynamoDB will return you 1MB of data by default. You could try using limit and some kind of sleep or delay in your code, so that you process your table at a slower rate and stay within your provisioned read capacity. If you do this, you'll have to use the LastEvaluatedKey returned to you to page through your table.
Keep in mind that just to read a single one of your 80kb items, you'll be using 10 read capacity units, and perhaps more if your items are larger.
I have a simple table structure in dynamo db with two fields. Partition key is a string (say Key). The second field is a Number Set (say Values).
I have a batch_get_item query on the Partition Key. The response time when i execute the command from AWS CLI is different from when I execute it from an AWS Lambda function.
The query response includes the Number Set. The Number Set can contain 30000 elements.
How can I test the true response time, are there any tools/best practices to test your response times. Also currently the response time is 0.3 seconds how can I take it to milliseconds, by increasing the read capacity? Or since I have sets of 30000 elements the response time cannot be reduced?