DynamoDB query performance is low - amazon-web-services

First of all some specs:
number of entries: 110k
total size of table: 700MB
number of columns per data set: can be up to 450
data set size: can be up to 25kB
read/write capacity: "on demand"
Problem: when trying to query for some rows by a column it takes easily up to 10 seconds or more.
The column we query by is an UUID column (index exists), not unique, used kind of like an external ID. So we say give me all records with that UUID and we expect up to ca. 1000 rows.
Even if I remove our application completely out of the equation (testing directly in the AWS management console) it makes no difference, still very poor performance (means also about 10 seconds or more).
So my question: do you have any ideas or concrete tips that I should check/test/adjust to improve the performance?
After request, here's the example code in PHP (reduced to relevant parts):
We use the official aws/aws-sdk-php package.
do {
// Marshal a native PHP array of data to a DynamoDB item.
$transformedValue = $this->getDynamoArrayFromNativeArray(
[':uuid' => $uuid]
);
$params = [
'TableName' => 'our-table-name',
'IndexName' => 'our-uuid-index',
'KeyConditionExpression' => 'our-uuid-column = :uuid',
'ExpressionAttributeValues' => $transformedValue,
];
if ($queryResult !== null) {
$lastEvaluatedKey = $queryResult['LastEvaluatedKey'];
$params['ExclusiveStartKey'] = $lastEvaluatedKey;
}
$queryResult = $this->client->query($params);
// (push results to some array)
} while ($queryResult['LastEvaluatedKey'] !== null);
Example data set:
{
"_id": "82ee23ce-d7ff-11eb-bf92-0aa84964df0a",
"_meta": {
"creation_date": 1624877797,
"uuid": "820025c0-d7ff-11eb-a5f4-0aa84964df0a"
},
"some_key.data.id": 63680,
(couple of hundred more simple key => value pairs, nothing special, no huge values or anything like that)
Read capacity chart for the index in question:
Query latency chart:

Related

How to retrieve records efficiently using indices in dynamodb aws

I want to fetch records of all male or female students from class 5 to class 11 who have secured between x1 and x2 marks in maths, between y1 and y2 marks in science, between z1 and z2 marks in english and between w1 and w2 marks in social...
My model looks as follows:
type StudentMarks
#model
#key(
name: "filterbyClassAndMarks"
fields: [
"gender"
"classCode"
"mathsMarks"
"socialMarks"
"englishMarks"
"scienceMarks"
]
queryField: "filterbyClassAndMarks"
)
#auth(
rules: [
{ allow: private, provider: iam, operations: [read] }
{ allow: public, provider: iam, operations: [read] }
]
) {
id: ID!
name: String
gender: String
classCode: String
mathsMarks: String
socialMarks: String
englishMarks: String
scienceMarks: String
}
The GSI created with gender as the partition key(hashkey) and compound sort/range key with classCode, mathMarks, socialMarks, scienceMarks, englishMarks fields is showing item count as 0.
I am trying to query using the following graphql:
const listOfStudents = await API.graphql({
query: queries.filterbyClassAndMarks,
variables: {
gender: "m",
classCodeMathsMarksSocialMarksEnglishMarksScienceMarks: {
between: [
{
classCode: "05",
mathsMarks: "06",
scienceMarks: "07",
englishMarks: "04",
socialMarks: "05",
},
{
classCode: "11",
mathsMarks: "90",
scienceMarks: "91",
englishMarks: "95",
socialMarks: "92",
},
],
},
},
authMode: "AWS_IAM",
});
This table with 11 records should return 4 records as shown in green
The cloud watch logs are as below:
{
"logType": "RequestMapping",
"path": [
"filterbyClassAndMarks"
],
"fieldName": "filterbyClassAndMarks",
"resolverArn": "arn:aws:appsync:ap-south-1:488377001042:apis/4tbr6xolzjfctl5tdbiur7jqnu/types/Query/resolvers/filterbyClassAndMarks",
"requestId": "5de45e66-1a5d-44dd-80cc-1a5b93a3c3aa",
"context": {
"arguments": {
"gender": "m",
"classCodeMathsMarksSocialMarksEnglishMarksScienceMarks": {
"between": [
{
"classCode": "05",
"mathsMarks": "06",
"socialMarks": "05",
"englishMarks": "04",
"scienceMarks": "07"
},
{
"classCode": "11",
"mathsMarks": "90",
"socialMarks": "92",
"englishMarks": "95",
"scienceMarks": "91"
}
]
},
},
"stash": {},
"outErrors": []
},
"fieldInError": false,
"errors": [],
"parentType": "Query",
"graphQLAPIId": "4tbr6xolzjfctl5tdbiur7jqnu",
"transformedTemplate": "{\"version\":\"2018-05-29\",\"operation\":\"Query
\",\"limit\":100,\"query\":{\"expression\":\"#gender = :gender AND #sortKey
BETWEEN :sortKey0 AND :sortKey1\",\"expressionNames\":{\"#gender\":\"gender
\",\"#sortKey\":\"classCode#mathsMarks#socialMarks#englishMarks#scienceMarks
\"},\"expressionValues\":{\":gender\":{\"S\":\"m\"},\":sortKey0\":{\"S
\":\"05#06#05#04#07\"},\":sortKey1\":{\"S\":\"11#90#92#95#91\"}}},\"index
\":\"filterbyClassAndMarks\",\"scanIndexForward\":true}"
}
It is showing the scanned count as 0 in the result:
"result": {
"items": [],
"scannedCount": 0
},
And the final response is success with empty result as shown below:
Object { data: {…} }
​
data: Object { filterbyClassAndMarks: {…} }
​​
filterbyClassAndMarks: Object { items: [], nextToken: null }
​​​
items: Array []
​​​​
length: 0
​​​​
<prototype>: Array []
​​​
nextToken: null
Array []
Key conditions on compound sort keys is possible as per aws documentation... I do not understand what I am missing exactly. Probably the GSI index is not properly configured/written.
Since most of the time the result is less than 1% of records in the table and is read intensive, scanning/reading all the records and filtering them is a very naive solution. Need better solution either with indices or otherwise.
SECOND EDIT:
Example of expected behavior of applying hash key and compound sort key before filtering.. The example is given to highlight expected behavior only and does not indicate approximate percentage of records not read due to hash key or compound sort key.
DynamoDB does not excel in supporting ad-hoc queries among an arbitrary list of attributes. If you want to fetch the same item using an undefined (or large) number of attributes(e.g. fetch student by mathMarks, socialMarks, scienceMarks, name, gender, etc) you will be better off using something other than DynamoDB.
EDIT:
You've updated your question with information that fundamentally changes the access pattern. You initially said
I want to fetch records of all male or female students from dynamodb who have secured more than x1 marks in maths, more than x2 marks in science, less than y1 marks in social and less than y2 marks in English...
and later changed this access pattern to say (emphasis mine)
I want to fetch records of all male or female students from class 5 to class 11 who have secured between x1 and x2 marks in maths, between y1 and y2 marks in science, between z1 and z2 marks in english and between w1 and w2 marks in social
Partitioning your data on gender and class range might allow you to implement this access pattern. You also removed ... other variables from your example schema, suggesting you might have a fixed list of marks include math, science, social, and English. This is less important than gender and class, but hear me out :)
You didn't mention it in your original question, but your data suggests you have a fetch Student by ID access pattern. So I started by defining students by ID:
The partition key is STUDENT#student_id and the sort key is A. I sometimes use "METADATA" or "META" as the sort key, but "A" is nice and short.
To support your primary access pattern, I created a global secondary index named GSI1, with PK and SK attributes of GSI1PK and GSI1SK. I assigned GSI1PK STUDENTS#gender and GSISK is the class attribute.
This partitions your data by gender and class. To narrow your results even further, you'd need to use filters on the various marks. For example, if you wanted to fetch all make students between class 5 and 9 with specific marks, you could do the following (in DynamoDB pseudocode):
QUERY from GSI1 where PK=STUDENTS#M SK BETWEEN 05 and 09
FILTER englishMark BETWEEN 009 and 050 AND
mathsMark BETWEEN 050 and 075 AND
scienceMark BETWEEN 045 and 065 AND
socialMark BETWEEN 020 and 035 AND
Filtering in DynamoDB doesn't work like most people think. When filtering, DynamoDB:
Read items from the table
Apply filter to remove items that don't match
Return items
This can lead to awful performance (and cost) if you have a large table and are filtering for a very small set of data. For example, if you're executing a scan operation on terabytes of data and applying filters to identify a single item, you're going to have a bad time.
However, there are circumstances where filtering is a fine solution. One example is when you can use the Partition Key and Sort Key to narrow the set of data you're working with before applying a filter. In the above example, I'm dividing your data by gender (perhaps dividing your search space in half) before further narrowing the items by class. The amount of data you're left with might be small enough to filter out the remaining items effectively.
However, I'll point out that gender has rather low cardinality. Perhaps there are more specifics about your access pattern that could help with that. For example, maybe you could group students by primary/secondary school and create a PK of STUDENTS#F#1-5. Maybe you could group them by the school district, or zip code? No detail about access patterns is too specific when working with NoSQL databases!
The point of this exercise is to illustrate two points:
Filtering in DynamoDB is best achieved by selecting a good primary key.
Using the DynamoDB filtering mechanism is best used on smaller subsets of your data. Don't be afraid to use it in the right places!

Single endpoint post array with multiple type of dictionary

I am creating the RESTful endpoints for supporting frontend payload.
My payload is an order of build your own dish and ready-made single dish
Problem:
In single POST of frontend. He wants to put everything to the single time. That's mean in the given list will contains 2 types of dictionary
one for build your own and one for ready-made single dish
IMO:
He can POST 2 times for each type of payload. By this method the endpoint will do one thing and I prefer that way.
He has only 1 reason to POST everything to single endpoint
Question:
What is your best practice for this sort of problem?
Build Your Own Payload:
In short I call it BYO.
1. base_bowl will dictates the size and price of the item
1. base_bowl will also determine the number of fishes, toppings, sauces.
Because base_bowl size S, M, or L has different quota.
For example
Size S can has fishes 1 scoop size S, and toppings 2 scoops size S.
Size M can has fishes 2 scoops size M, and toppings 3 scoops size M. Then if the customer would like to add more than quota he must add it in the extra_fishes, extra_toppings
Base on Price id since quantity is determine by number of member in the list.
{
"base_bowl": salad.id, # require=True, Price id
"fishes": [salmon.id, tuna.id],
"extra_fishes": [tofu.id],
"toppings": [tamago.id, mango.id],
"extra_toppings": [rambutan.id],
"premium_toppings": [ikura.id],
"sauces": [shoyu.id, spicy_kimchi.id],
"extra_sauces": [],
"sprinkles": [sesame.id, fried_shalots.id],
"dish_order": 1, # require=True
"note": {
'msg': 'eat here',
},
}
And backend will validate the input and INSERT them to Order and OrderItem
Ready-Made Dish:
This is very straight forward because it has no implicit logic like BYO. It just add OrderItem to Order
Use Menu id, size, and qty to determine price. Because customer is free to choose
{
'order_items': [
{
'menu_id': has_poink_menu.id,
'size': Price.MenuSize.XL, # 27, 37, 47, 52
'qty': 2, # amount = 52 * 2
},
{
'menu_id': no_poink_menu.id,
'size': Price.MenuSize.L, # 20, 30, 40, 45
'qty': 1 # amount = 40 * 1
}
]
}
My answer is opinionated, but to me a RESTful design is kept much clearer by keeping endpoints specific and well defined. So in your case there may be a BYODishViewSet and ReadyMadeDishViewSet mapped to /api/byodish/ and /api/readymadedish/.
However, if this is part of a larger single model, say an Order model, then you may want to consider using a nested (writable) serializer to wrap up an Order as a single API request-response.

How to write CouchDB view to get currently active servers given start timestamp and end timestamp of each server?

I have set of documents which has the server name, with the start timestamp and end timestamp of that server. eg.
[
{
serverName: "Houston",
startTimestamp: "2018/03/07 17:52:13 +000",
endTimestamp: "2018/03/07 18:50:10 +000"
},
{
serverName: "Canberra",
startTimestamp: "2018/03/07 18:48:09 +000",
endTimestamp: "2018/03/07 20:10:00 +000"
},
{
serverName: "Melbourne",
startTimestamp: "2018/03/08 01:43:13 +000",
endTimestamp: "2018/03/08 12:09:10 +000"
}
]
With this data, given a Timestamp I need to get the list of active servers at that point of time.
For example. for TS="2018/03/07 18:50:00 +000" from the above data the list of active servers are ["Huston", "Canberra"]
Is it possible to achieve this using only CouchDB views. If so how to go about it?
Note: Initially I tried the following approach. In the map function I emit two documents
1 with key=doc.startTimestsamp and value={"station_add": doc.station}
1 with key=doc.startEndtsamp and value={"station_rem": doc.station}
My intention was to iterate through these in the reduce function adding stations present in "station_add" and removing stations in "stations_rem". But I found that CouchDB does not mention anything about the ordering of values in the reduce function.
If you can live with fixed periods and don't mind the extra disk space that might be needed for the view results, you can create a view of active servers per hour, for example.
Iterate over the periods between start and end and emit the time that each server was online during this period:
function(doc) {
var start = new Date(doc.startTimestamp).getTime()
var end = new Date(doc.endTimestamp).getTime()
var msPerPeriod = 60*60*1000
var msOfflineInFirstPeriod = start % msPerPeriod
var firstPeriod = start - msOfflineInFirstPeriod
var msOnlineInLastPeriod = end % msPerPeriod
var lastPeriod = end - msOnlineInLastPeriod
if (firstPeriod === lastPeriod) {
// The server was only online within one period.
emit([new Date(firstPeriod), doc.serverName], [1, msOnlineInLastPeriod - msOfflineInFirstPeriod])
} else {
// The server was online over multiple periods.
emit([new Date(firstPeriod), doc.serverName], [1,msPerPeriod - msOfflineInFirstPeriod])
for (var period = firstPeriod + msPerPeriod; period < lastPeriod; period += msPerPeriod) {
emit([new Date(period), doc.serverName], [1, msPerPeriod])
}
emit([new Date(lastPeriod), doc.serverName], [1,msOnlineInLastPeriod])
}
}
If you want the total without the server names, just add a reduce function with the built-in shortcut _sum. You'll get the number of servers online during the period as the first number and the milliseconds that the servers were online in that period as the second number.
You can play with the view if you emit the year, month and day as the first keys. Then you can use the group_level at query time to get a finer or more coarse overview.
Bear in mind that this view might get large on disk, as each row has to be stored, and also the intermediate results for each group level are stored. So you shouldn't set the period duration too small – emitting a row for each second would take a lot of disk space, for example.

PowerBI Custom Visual - Table data binding

Also asked this on the PowerBI forum.
I am trying to change sampleBarChart PowerBI visual to use a "table" data binding instead of current "categorical". First goal is to build a simple table visual, with inputs "X", "Y" and "Value".
Both data bindings are described on the official wiki. This is all I could find:
I cannot find any example visuals which use it and are based on the new API.
From the image above, a table object has "rows", "columns", "totals" and "identities". So it looks like rows and columns are my x/y indexes, and totals are my values?
This is what I tried. (Naming is slightly off as most of it came from existing barchart code)
Data roles:
{ "displayName": "Category1 Data",
"name": "category1",
"kind": 0},
{ "displayName": "Category2 Data",
"name": "category2",
"kind": 0},
{ "displayName": "Measure Data",
"name": "measure",
"kind": 1}
Data view mapping:
"table": {
"rows": {"for": {"in": "category1"}},
"columns": {"for": {"in": "category2"}},
"totals": {"select": [{"bind": {"to": "measure"}}]}
}
Data Point class:
interface BarChartDataPoint {
value: number;
category1: number;
category2: number;
color: string;
};
Relevant parts of my visualTransform():
...
let category1 = categorical.rows;
let category2 = categorical.columns;
let dataValue = categorical.totals;
...
for (let i = 1, len = category1.length; i <= len; i++) {
for (let j = 1, jlen = category2.length; j <= jlen; j++) {
{
barChartDataPoints.push({
category1: i,
category2: j,
value: dataValue[i,j],
color: "#555555"//for now
});
}
...
Test data looks like this:
__1_2_3_
1|4 4 3
2|4 5 5
3|3 6 7 (total = 41)
The code above fills barChartDataPoints with just six data points:
(1; 1; 41),
(1; 2; undefined),
(2; 1; 41),
(2; 2; undefined),
(3; 1; 41),
(3; 2; undefined).
Accessing zero indeces results in nulls.
Q: Is totals not the right measure to access value at (x;y)? What am I doing wrong?
Any help or direction is very appreciated.
User #RichardL shared this link on the PowerBI forum. Which helped quite a lot.
"Totals" is not the right measure to access value at (x;y).
It turns out Columns contain column names, and Rows contain value arrays which correspond to those columns.
From the link above, this is how table structure looks like:
{
"columns":[
{"displayName": "Year"},
{"displayName": "Country"},
{"displayName": "Cost"}
],
"rows":[
[2014, "Japan", 25],
[2015, "Japan", 30],
[2016, "Japan", 18],
[2015, "North America", 14],
[2016, "North America", 30],
[2016, "China", 100]
]
}
You can also view the data as your visual receives it by placing this
window.alert(JSON.stringify(options.dataViews))
In your update() method. Or write it in html contents of your visual.
This was very helpful but it shows up a few fundamental problems with the data management of PowerBI for a custom visual. There is no documentation and the process from Roles to mapping to visualTransform is horrendous because it takes so much effort to rebuild the data into a format that is usable consistently with D3.
Commenting on user5226582's example, for me, columns is presented in a form where I have to look up the Roles property to be able to understand the order of data presented in the rows column array. displayName offers no certainty. For exampe, if a user uses the same field in two different dataRoles then it all gets crazily awry.
I think the safest approach is to build a new array inside visualTransform using the known well-field names (the "name" property in dataRoles), then iterate columns interrogating the Roles property to establish an index to the rows array items. Then use that index to populate the new array reliably. D3 then gobbles that up.
I know that's crazy, but at least it means reasonably consistent data and allows for the user selecting the same data field more than once or choosing count instead of column value.
All in all, I think this area needs a lot of attention before custom Visuals can really take off.

cloudant index: count number of unique users per time period

A very similar post was made about this issue here. In cloudant, I have a document structure storing when users access an application, that looks like the following:
{"username":"one","timestamp":"2015-10-07T15:04:46Z"}---| same day
{"username":"one","timestamp":"2015-10-07T19:22:00Z"}---^
{"username":"one","timestamp":"2015-10-25T04:22:00Z"}
{"username":"two","timestamp":"2015-10-07T19:22:00Z"}
What I want to know is to count the # of unique users for a given time period. Ex:
2015-10-07 = {"count": 2} two different users accessed on 2015-10-07
2015-10-25 = {"count": 1} one different user accessed on 2015-10-25
2015 = {"count" 2} two different users accessed in 2015
This all just becomes tricky because for example on 2015-10-07, username: one has two records of when they accessed, but it should only return a count of 1 to the total of unique users.
I've tried:
function(doc) {
var time = new Date(Date.parse(doc['timestamp']));
emit([time.getUTCFullYear(),time.getUTCMonth(),time.getUTCDay(),doc.username], 1);
}
This suffers from several issues, which are highlighted by Jesus Alva who commented in the post I linked to above.
Thanks!
There's probably a better way of doing this, but off the top of my head ...
You could try emitting an index for each level of granularity:
function(doc) {
var time = new Date(Date.parse(doc['timestamp']));
var year = time.getUTCFullYear();
var month = time.getUTCMonth()+1;
var day = time.getUTCDate();
// day granularity
emit([year,month,day,doc.username], null);
// year granularity
emit([year,doc.username], null);
}
// reduce function - `_count`
Day query (2015-10-07):
inclusive_end=true&
start_key=[2015, 10, 7, "\u0000"]&
end_key=[2015, 10, 7, "\uefff"]&
reduce=true&
group=true
Day query result - your application code would count the number of rows:
{"rows":[
{"key":[2015,10,7,"one"],"value":2},
{"key":[2015,10,7,"two"],"value":1}
]}
Year query:
inclusive_end=true&
start_key=[2015, "\u0000"]&
end_key=[2015, "\uefff"]&
reduce=true&
group=true
Query result - your application code would count the number of rows:
{"rows":[
{"key":[2015,"one"],"value":3},
{"key":[2015,"two"],"value":1}
]}