Google Datastore projection query on array of complex objects - google-cloud-platform

I have a simple Datastore kind having the following properties:
id (long)
createdAt (timestamp)
userId (string)
metrics (array of complex objects)
type of metric
value of metric
Each stored row in the Datastore might have a different amount of metrics as well as different types of metrics.
I have a very specific requirement to query the latest metrics of a user. The problem here is that different rows have different metrics so I can't just take the most recent row, I need to look into metrics array to retrieve all the data.
I decided to use projection queries. My idea was to create a projection based on the following properties: metrics.type, metrics.value and use distinct on metrics.type and adding order by createdAt desc.
For a better explanation, a simple example of rows from the Datastore:
1. { "id": 111, "createdAt": "2019-01-01 00:00", "userId" : "user-123", [{ "type" : "metric1", "value" : 123 }, { "type" : "metric2", "value" : 345 }] }
2. { "id": 222, "createdAt": "2019-01-02 00:00", "userId" : "user-123", [{ "type" : "metric3", "value" : 567 }, { "type" : "metric4", "value" : 789 }] }
I expected a projection query with distinct on metrics.type filter to return the following results:
1. "metric1", 123
2. "metric2", 345
3. "metric3", 567
4. "metric4", 789
but actually what query returns is:
1. "metric1", 123
2. "metric2", 123
3. "metric3", 123
4. "metric4", 123
So all metrics have the same value (which is incorrect). Basically it happens because of an exploded index - Datastore thinks I have 2 arrays but indeed it's a single array
Is there any way to make projection query to return what I expect instead of exploding the index? If not, how can I rebuild what I have so it meets my requirements?

The Cloud Datastore documentation specifically warns your exact issue.
https://cloud.google.com/datastore/docs/concepts/queries#projections_and_array-valued_properties
One option to solve this is to combine both the type and value. So, have a property called "metric" that will have values like "metric1:123", "metric2:345". Then you will be projecting a single array-valued property.

Related

AWS OpenSearch/ElasticSearch match_phrase_prefix query doen't work for a node

I have an AWS ElasticSearch domain with 4 shards and 3 data nodes.
The shard 3 linked with 2 nodes: xQVLroD1RoCShwwzLwXY4g and gn9dYu4pS22gNEDeCg6RrQ.
The first query returns the expected result:
POST carat-prod/_search?preference=_shards:3|_only_nodes:xQVLroD1RoCShwwzLwXY4g
{
"explain": true,
"query": {
"match_phrase_prefix": {
"content": {
"query": "10 Blahblahblah Avenue Apt. 42 Banana, NJ 0700"
}
}
}
}
But the second one returns nothing:
POST carat-prod/_search?preference=_shards:3|_only_nodes:gn9dYu4pS22gNEDeCg6RrQ
{
"explain": true,
"query": {
"match_phrase_prefix": {
"content": {
"query": "10 Blahblahblah Avenue Apt. 42 Banana, NJ 0700"
}
}
}
}
The only difference in the queries is the specified node.
Actually, the target document contains the line (additional 1 after NJ 0700):
10 Blahblahblah Avenue Apt. 42 Banana, NJ 07001
If I try to search for
10 Blahblahblah Avenue Apt. 42 Banana, NJ 07001
I have the expected result from the both nodes.
Does anyone have any ideas why this might be so?
Please share your index mappings and settings, without these it is very difficult to answer your question.
Some solutions:
Try to refresh the index that you are querying, this may be due to a stale index.
I don't know how a shard can be on two nodes at a time. A primary shard can only live on one node. Its replicas may live on other nodes, but they don't really matter in this case.
A document can only live on one primary shard, so what you are probably querying are two different nodes, but the document is only present in the primary shard in one node.
Did you give a node/shard preference while indexing the document?
HTH

How to retrieve records efficiently using indices in dynamodb aws

I want to fetch records of all male or female students from class 5 to class 11 who have secured between x1 and x2 marks in maths, between y1 and y2 marks in science, between z1 and z2 marks in english and between w1 and w2 marks in social...
My model looks as follows:
type StudentMarks
#model
#key(
name: "filterbyClassAndMarks"
fields: [
"gender"
"classCode"
"mathsMarks"
"socialMarks"
"englishMarks"
"scienceMarks"
]
queryField: "filterbyClassAndMarks"
)
#auth(
rules: [
{ allow: private, provider: iam, operations: [read] }
{ allow: public, provider: iam, operations: [read] }
]
) {
id: ID!
name: String
gender: String
classCode: String
mathsMarks: String
socialMarks: String
englishMarks: String
scienceMarks: String
}
The GSI created with gender as the partition key(hashkey) and compound sort/range key with classCode, mathMarks, socialMarks, scienceMarks, englishMarks fields is showing item count as 0.
I am trying to query using the following graphql:
const listOfStudents = await API.graphql({
query: queries.filterbyClassAndMarks,
variables: {
gender: "m",
classCodeMathsMarksSocialMarksEnglishMarksScienceMarks: {
between: [
{
classCode: "05",
mathsMarks: "06",
scienceMarks: "07",
englishMarks: "04",
socialMarks: "05",
},
{
classCode: "11",
mathsMarks: "90",
scienceMarks: "91",
englishMarks: "95",
socialMarks: "92",
},
],
},
},
authMode: "AWS_IAM",
});
This table with 11 records should return 4 records as shown in green
The cloud watch logs are as below:
{
"logType": "RequestMapping",
"path": [
"filterbyClassAndMarks"
],
"fieldName": "filterbyClassAndMarks",
"resolverArn": "arn:aws:appsync:ap-south-1:488377001042:apis/4tbr6xolzjfctl5tdbiur7jqnu/types/Query/resolvers/filterbyClassAndMarks",
"requestId": "5de45e66-1a5d-44dd-80cc-1a5b93a3c3aa",
"context": {
"arguments": {
"gender": "m",
"classCodeMathsMarksSocialMarksEnglishMarksScienceMarks": {
"between": [
{
"classCode": "05",
"mathsMarks": "06",
"socialMarks": "05",
"englishMarks": "04",
"scienceMarks": "07"
},
{
"classCode": "11",
"mathsMarks": "90",
"socialMarks": "92",
"englishMarks": "95",
"scienceMarks": "91"
}
]
},
},
"stash": {},
"outErrors": []
},
"fieldInError": false,
"errors": [],
"parentType": "Query",
"graphQLAPIId": "4tbr6xolzjfctl5tdbiur7jqnu",
"transformedTemplate": "{\"version\":\"2018-05-29\",\"operation\":\"Query
\",\"limit\":100,\"query\":{\"expression\":\"#gender = :gender AND #sortKey
BETWEEN :sortKey0 AND :sortKey1\",\"expressionNames\":{\"#gender\":\"gender
\",\"#sortKey\":\"classCode#mathsMarks#socialMarks#englishMarks#scienceMarks
\"},\"expressionValues\":{\":gender\":{\"S\":\"m\"},\":sortKey0\":{\"S
\":\"05#06#05#04#07\"},\":sortKey1\":{\"S\":\"11#90#92#95#91\"}}},\"index
\":\"filterbyClassAndMarks\",\"scanIndexForward\":true}"
}
It is showing the scanned count as 0 in the result:
"result": {
"items": [],
"scannedCount": 0
},
And the final response is success with empty result as shown below:
Object { data: {…} }
​
data: Object { filterbyClassAndMarks: {…} }
​​
filterbyClassAndMarks: Object { items: [], nextToken: null }
​​​
items: Array []
​​​​
length: 0
​​​​
<prototype>: Array []
​​​
nextToken: null
Array []
Key conditions on compound sort keys is possible as per aws documentation... I do not understand what I am missing exactly. Probably the GSI index is not properly configured/written.
Since most of the time the result is less than 1% of records in the table and is read intensive, scanning/reading all the records and filtering them is a very naive solution. Need better solution either with indices or otherwise.
SECOND EDIT:
Example of expected behavior of applying hash key and compound sort key before filtering.. The example is given to highlight expected behavior only and does not indicate approximate percentage of records not read due to hash key or compound sort key.
DynamoDB does not excel in supporting ad-hoc queries among an arbitrary list of attributes. If you want to fetch the same item using an undefined (or large) number of attributes(e.g. fetch student by mathMarks, socialMarks, scienceMarks, name, gender, etc) you will be better off using something other than DynamoDB.
EDIT:
You've updated your question with information that fundamentally changes the access pattern. You initially said
I want to fetch records of all male or female students from dynamodb who have secured more than x1 marks in maths, more than x2 marks in science, less than y1 marks in social and less than y2 marks in English...
and later changed this access pattern to say (emphasis mine)
I want to fetch records of all male or female students from class 5 to class 11 who have secured between x1 and x2 marks in maths, between y1 and y2 marks in science, between z1 and z2 marks in english and between w1 and w2 marks in social
Partitioning your data on gender and class range might allow you to implement this access pattern. You also removed ... other variables from your example schema, suggesting you might have a fixed list of marks include math, science, social, and English. This is less important than gender and class, but hear me out :)
You didn't mention it in your original question, but your data suggests you have a fetch Student by ID access pattern. So I started by defining students by ID:
The partition key is STUDENT#student_id and the sort key is A. I sometimes use "METADATA" or "META" as the sort key, but "A" is nice and short.
To support your primary access pattern, I created a global secondary index named GSI1, with PK and SK attributes of GSI1PK and GSI1SK. I assigned GSI1PK STUDENTS#gender and GSISK is the class attribute.
This partitions your data by gender and class. To narrow your results even further, you'd need to use filters on the various marks. For example, if you wanted to fetch all make students between class 5 and 9 with specific marks, you could do the following (in DynamoDB pseudocode):
QUERY from GSI1 where PK=STUDENTS#M SK BETWEEN 05 and 09
FILTER englishMark BETWEEN 009 and 050 AND
mathsMark BETWEEN 050 and 075 AND
scienceMark BETWEEN 045 and 065 AND
socialMark BETWEEN 020 and 035 AND
Filtering in DynamoDB doesn't work like most people think. When filtering, DynamoDB:
Read items from the table
Apply filter to remove items that don't match
Return items
This can lead to awful performance (and cost) if you have a large table and are filtering for a very small set of data. For example, if you're executing a scan operation on terabytes of data and applying filters to identify a single item, you're going to have a bad time.
However, there are circumstances where filtering is a fine solution. One example is when you can use the Partition Key and Sort Key to narrow the set of data you're working with before applying a filter. In the above example, I'm dividing your data by gender (perhaps dividing your search space in half) before further narrowing the items by class. The amount of data you're left with might be small enough to filter out the remaining items effectively.
However, I'll point out that gender has rather low cardinality. Perhaps there are more specifics about your access pattern that could help with that. For example, maybe you could group students by primary/secondary school and create a PK of STUDENTS#F#1-5. Maybe you could group them by the school district, or zip code? No detail about access patterns is too specific when working with NoSQL databases!
The point of this exercise is to illustrate two points:
Filtering in DynamoDB is best achieved by selecting a good primary key.
Using the DynamoDB filtering mechanism is best used on smaller subsets of your data. Don't be afraid to use it in the right places!

How to extract more than label text items in a single annotation using Google NLP

I have created dataset using Google NLP Entity extraction and I uploaded input data's(train, test, validation jsonl files) like NLP format that will be stored in google storage bucket.
Sample Annotation:
{
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 10,
"start_offset": 0
}
},
"display_name": "Name"
}],
"text_snippet": {
"content": "JJ's Pizza\n "
}
} {
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 9,
"start_offset": 0
}
},
"display_name": "City"
}],
"text_snippet": {
"content": "San Francisco\n "
}
}
Here is the input text to predict the label as "Name", "City" and "State"
Best J J's Pizza in San Francisco, CA
Result in the following screenshot,
I expect the predicted results would be in the following,
Name : JJ's Pizza
City : San Francisco
State: CA
According to the sample annotation you provided, you're setting the whole text_snippet to be a name (or whatever field you want to extract).
This can confuse the model in understanding that all the text is that entity.
It would be better to have training data similar to the one in the documentation. In there, there is a big chunk of text and then we annotate the entities that we want extracted from there.
As an example, let's say that from these text snippets I tell the model that the cursive part is an entity named a, while the bold part is an entity called b:
JJ Pizza
LL Burritos
Kebab MM
Shushi NN
San Francisco
NY
Washington
Los Angeles
Then, when then the model reads Best JJ Pizza, it thinks all is a single entity (we trained the model with this assumption), and it will just choose the one it matches the best (in this case, it would likely say it's an a entity).
However, if I provide the following text sample (also annotated like cursive is entity a and bold is entity b):
The best pizza place in San Francisco is JJ Pizza.
For a luxurious experience, do not forget to visit LL Burritos when you're around NY.
I once visited Kebab MM, but there are better options in Washington.
You can find Shushi NN in Los Angles
You can see how you're training the model to find the entities within a piece of text, and it will try to extract them according to the context.
The important part about training the model is providing training data as similar to real-life data as possible.
In the example you provided, if the data in your real-life scenario is going to be in the format <ADJECTIVE> <NAME> <CITY>, then your training data should have that same format:
{
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 16,
"start_offset": 6
}
},
"display_name": "Name"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 30,
"start_offset": 21
}
},
"display_name": "City"
}],
"text_snippet": {
"content": "Worst JJ's Pizza in San Francisco\n "
}
}
Note that the point of a Natural Language ML model is to process natural language. If your inputs are going to look as similar/simple/short as that, then it might not be worth going the ML route. A simple regex should be enough. Without the natural language part, it is going to be hard to properly train a model. More details in the beginners guide.

AWS DynamoDB Golang issue with inserting items into table

I've been following Miguel C's tutorial on setting up a DynamoDB table in golang but modified my json to look like this instead of using movies. I modified the movie struct into a Fruit struct (so there is no more info) and in my schema I defined the partition key as "Name" and the Sort Key as "Price". But when I run my code it says
"ValidationException: One of the required keys was not given a value"
despite me printing out the input as
map[name:{
S: "bananas"
} price:{
N: "0.25"
}]
which clearly shows that String bananas and Number 0.25 both have values in them.
My Json below looks like this:
[
{
"name": "bananas",
"price": 0.25
},
{
"name": "apples",
"price": 0.50
}
]
Capitalization issue, changed "name" to "Name" and it worked out.

PowerBI Custom Visual - Table data binding

Also asked this on the PowerBI forum.
I am trying to change sampleBarChart PowerBI visual to use a "table" data binding instead of current "categorical". First goal is to build a simple table visual, with inputs "X", "Y" and "Value".
Both data bindings are described on the official wiki. This is all I could find:
I cannot find any example visuals which use it and are based on the new API.
From the image above, a table object has "rows", "columns", "totals" and "identities". So it looks like rows and columns are my x/y indexes, and totals are my values?
This is what I tried. (Naming is slightly off as most of it came from existing barchart code)
Data roles:
{ "displayName": "Category1 Data",
"name": "category1",
"kind": 0},
{ "displayName": "Category2 Data",
"name": "category2",
"kind": 0},
{ "displayName": "Measure Data",
"name": "measure",
"kind": 1}
Data view mapping:
"table": {
"rows": {"for": {"in": "category1"}},
"columns": {"for": {"in": "category2"}},
"totals": {"select": [{"bind": {"to": "measure"}}]}
}
Data Point class:
interface BarChartDataPoint {
value: number;
category1: number;
category2: number;
color: string;
};
Relevant parts of my visualTransform():
...
let category1 = categorical.rows;
let category2 = categorical.columns;
let dataValue = categorical.totals;
...
for (let i = 1, len = category1.length; i <= len; i++) {
for (let j = 1, jlen = category2.length; j <= jlen; j++) {
{
barChartDataPoints.push({
category1: i,
category2: j,
value: dataValue[i,j],
color: "#555555"//for now
});
}
...
Test data looks like this:
__1_2_3_
1|4 4 3
2|4 5 5
3|3 6 7 (total = 41)
The code above fills barChartDataPoints with just six data points:
(1; 1; 41),
(1; 2; undefined),
(2; 1; 41),
(2; 2; undefined),
(3; 1; 41),
(3; 2; undefined).
Accessing zero indeces results in nulls.
Q: Is totals not the right measure to access value at (x;y)? What am I doing wrong?
Any help or direction is very appreciated.
User #RichardL shared this link on the PowerBI forum. Which helped quite a lot.
"Totals" is not the right measure to access value at (x;y).
It turns out Columns contain column names, and Rows contain value arrays which correspond to those columns.
From the link above, this is how table structure looks like:
{
"columns":[
{"displayName": "Year"},
{"displayName": "Country"},
{"displayName": "Cost"}
],
"rows":[
[2014, "Japan", 25],
[2015, "Japan", 30],
[2016, "Japan", 18],
[2015, "North America", 14],
[2016, "North America", 30],
[2016, "China", 100]
]
}
You can also view the data as your visual receives it by placing this
window.alert(JSON.stringify(options.dataViews))
In your update() method. Or write it in html contents of your visual.
This was very helpful but it shows up a few fundamental problems with the data management of PowerBI for a custom visual. There is no documentation and the process from Roles to mapping to visualTransform is horrendous because it takes so much effort to rebuild the data into a format that is usable consistently with D3.
Commenting on user5226582's example, for me, columns is presented in a form where I have to look up the Roles property to be able to understand the order of data presented in the rows column array. displayName offers no certainty. For exampe, if a user uses the same field in two different dataRoles then it all gets crazily awry.
I think the safest approach is to build a new array inside visualTransform using the known well-field names (the "name" property in dataRoles), then iterate columns interrogating the Roles property to establish an index to the rows array items. Then use that index to populate the new array reliably. D3 then gobbles that up.
I know that's crazy, but at least it means reasonably consistent data and allows for the user selecting the same data field more than once or choosing count instead of column value.
All in all, I think this area needs a lot of attention before custom Visuals can really take off.