AWS DynamoDB Golang issue with inserting items into table - amazon-web-services

I've been following Miguel C's tutorial on setting up a DynamoDB table in golang but modified my json to look like this instead of using movies. I modified the movie struct into a Fruit struct (so there is no more info) and in my schema I defined the partition key as "Name" and the Sort Key as "Price". But when I run my code it says
"ValidationException: One of the required keys was not given a value"
despite me printing out the input as
map[name:{
S: "bananas"
} price:{
N: "0.25"
}]
which clearly shows that String bananas and Number 0.25 both have values in them.
My Json below looks like this:
[
{
"name": "bananas",
"price": 0.25
},
{
"name": "apples",
"price": 0.50
}
]

Capitalization issue, changed "name" to "Name" and it worked out.

Related

Oracle Apex 22.21 - REST data source - nested JSON array - discovery

I need to get APEX Rest Data Source to parse my JSON which has a nested array. I've read that JSON nested arrays are not supported but there must be a way.
I have a REST API that returns data via JSON as per below. On Apex, I've created a REST data source following the tutorial on this Oracle blog link
However, Auto-Discovery does not 'discover' the nested array. It only returns the root level data.
[ {
"order_number": "so1223",
"order_date": "2022-07-01",
"full_name": "Carny Coulter",
"email": "ccoulter2#ovh.net",
"credit_card": "3545556133694494",
"city": "Myhiya",
"state": "CA",
"zip_code": "12345",
"lines": [
{
"product": "Beans - Fava, Canned",
"quantity": 1,
"price": 1.99
},
{
"product": "Edible Flower - Mixed",
"quantity": 1,
"price": 1.50
}
]
},
{
"order_number": "so2244",
"order_date": "2022-12-28",
"full_name": "Liam Shawcross",
"email": "lshawcross5#exblog.jp",
"credit_card": "6331104669953298",
"city": "Humaitá",
"state": "NY",
"zip_code": "98670",
"lines": [
{
"order_id": 5,
"product": "Beans - Green",
"quantity": 2,
"price": 4.33
},
{
"order_id": 1,
"product": "Grapefruit - Pink",
"quantity": 5,
"price": 5.00
}
]
},
]
So in the JSON above, it only 'discovers' order_numbers up to zip_code. The 'lines' array with attributes order_id, product, quantity, & price do not get 'discovered'.
I found this SO question in which Carsten instructs to create the Rest Data Source manually. I've tried changing the Row Selector to "." (a dot) and leaving it blank. That still returns the root level data.
Changing the Row Selector to 'lines' returns only 1 array for each 'lines'
So in the JSON example above, it would only 'discover':
{
"product": "Beans - Fava, Canned",
"quantity": 1,
"price": 1.99
}
{
"order_id": 5,
"product": "Beans - Green",
"quantity": 2,
"price": 4.33
}
and not the complete array..
This is how the Data Profile is set up when creating Data Source manually.
There's another SO question with a similar situation so I followed some steps such as selecting the data type for 'lines' as JSON Document. I feel I've tried almost every selector & data type. But obviously not enough.
The docs are not very helpful on this subject and it's been difficult finding links on Google, Oracle Blogs, or SO.
My end goal would be to have two tables as below auto synchronizing from the API.
orders
id pk
order_number num
order_date date
full_name vc(200)
email vc(200)
credit_card num
city vc(200)
state vc(200)
zip_code num
lines
order_id /fk orders
product vc(200)
quantity num
price num
view orders_view orders lines
As you're correctly stating, REST Data Sources do not support nested arrays - a REST Source can only "extract" one flat table from the JSON response. In your example, the JSON as such is an array ("orders"). The Row Selector in the Data Profile would thus be "." (to select the "root node").
That gives you all the order attributes, but discovery would skip the lines array. However, you can manually add a column to the Data Profile, of the JSON Document data type, and using lines as the selector.
As a result, you'd still get a flat table from the REST Data Source, but that table contains a LINES column, which contains the "JSON Fragment" for the order line items. You could then synchronize the REST Source to a local table ("REST Synchronization"), then you can use some custom code to extract the JSON fragments to a ORDER_LINES child table.
Does that help?

How to retrieve records efficiently using indices in dynamodb aws

I want to fetch records of all male or female students from class 5 to class 11 who have secured between x1 and x2 marks in maths, between y1 and y2 marks in science, between z1 and z2 marks in english and between w1 and w2 marks in social...
My model looks as follows:
type StudentMarks
#model
#key(
name: "filterbyClassAndMarks"
fields: [
"gender"
"classCode"
"mathsMarks"
"socialMarks"
"englishMarks"
"scienceMarks"
]
queryField: "filterbyClassAndMarks"
)
#auth(
rules: [
{ allow: private, provider: iam, operations: [read] }
{ allow: public, provider: iam, operations: [read] }
]
) {
id: ID!
name: String
gender: String
classCode: String
mathsMarks: String
socialMarks: String
englishMarks: String
scienceMarks: String
}
The GSI created with gender as the partition key(hashkey) and compound sort/range key with classCode, mathMarks, socialMarks, scienceMarks, englishMarks fields is showing item count as 0.
I am trying to query using the following graphql:
const listOfStudents = await API.graphql({
query: queries.filterbyClassAndMarks,
variables: {
gender: "m",
classCodeMathsMarksSocialMarksEnglishMarksScienceMarks: {
between: [
{
classCode: "05",
mathsMarks: "06",
scienceMarks: "07",
englishMarks: "04",
socialMarks: "05",
},
{
classCode: "11",
mathsMarks: "90",
scienceMarks: "91",
englishMarks: "95",
socialMarks: "92",
},
],
},
},
authMode: "AWS_IAM",
});
This table with 11 records should return 4 records as shown in green
The cloud watch logs are as below:
{
"logType": "RequestMapping",
"path": [
"filterbyClassAndMarks"
],
"fieldName": "filterbyClassAndMarks",
"resolverArn": "arn:aws:appsync:ap-south-1:488377001042:apis/4tbr6xolzjfctl5tdbiur7jqnu/types/Query/resolvers/filterbyClassAndMarks",
"requestId": "5de45e66-1a5d-44dd-80cc-1a5b93a3c3aa",
"context": {
"arguments": {
"gender": "m",
"classCodeMathsMarksSocialMarksEnglishMarksScienceMarks": {
"between": [
{
"classCode": "05",
"mathsMarks": "06",
"socialMarks": "05",
"englishMarks": "04",
"scienceMarks": "07"
},
{
"classCode": "11",
"mathsMarks": "90",
"socialMarks": "92",
"englishMarks": "95",
"scienceMarks": "91"
}
]
},
},
"stash": {},
"outErrors": []
},
"fieldInError": false,
"errors": [],
"parentType": "Query",
"graphQLAPIId": "4tbr6xolzjfctl5tdbiur7jqnu",
"transformedTemplate": "{\"version\":\"2018-05-29\",\"operation\":\"Query
\",\"limit\":100,\"query\":{\"expression\":\"#gender = :gender AND #sortKey
BETWEEN :sortKey0 AND :sortKey1\",\"expressionNames\":{\"#gender\":\"gender
\",\"#sortKey\":\"classCode#mathsMarks#socialMarks#englishMarks#scienceMarks
\"},\"expressionValues\":{\":gender\":{\"S\":\"m\"},\":sortKey0\":{\"S
\":\"05#06#05#04#07\"},\":sortKey1\":{\"S\":\"11#90#92#95#91\"}}},\"index
\":\"filterbyClassAndMarks\",\"scanIndexForward\":true}"
}
It is showing the scanned count as 0 in the result:
"result": {
"items": [],
"scannedCount": 0
},
And the final response is success with empty result as shown below:
Object { data: {…} }
​
data: Object { filterbyClassAndMarks: {…} }
​​
filterbyClassAndMarks: Object { items: [], nextToken: null }
​​​
items: Array []
​​​​
length: 0
​​​​
<prototype>: Array []
​​​
nextToken: null
Array []
Key conditions on compound sort keys is possible as per aws documentation... I do not understand what I am missing exactly. Probably the GSI index is not properly configured/written.
Since most of the time the result is less than 1% of records in the table and is read intensive, scanning/reading all the records and filtering them is a very naive solution. Need better solution either with indices or otherwise.
SECOND EDIT:
Example of expected behavior of applying hash key and compound sort key before filtering.. The example is given to highlight expected behavior only and does not indicate approximate percentage of records not read due to hash key or compound sort key.
DynamoDB does not excel in supporting ad-hoc queries among an arbitrary list of attributes. If you want to fetch the same item using an undefined (or large) number of attributes(e.g. fetch student by mathMarks, socialMarks, scienceMarks, name, gender, etc) you will be better off using something other than DynamoDB.
EDIT:
You've updated your question with information that fundamentally changes the access pattern. You initially said
I want to fetch records of all male or female students from dynamodb who have secured more than x1 marks in maths, more than x2 marks in science, less than y1 marks in social and less than y2 marks in English...
and later changed this access pattern to say (emphasis mine)
I want to fetch records of all male or female students from class 5 to class 11 who have secured between x1 and x2 marks in maths, between y1 and y2 marks in science, between z1 and z2 marks in english and between w1 and w2 marks in social
Partitioning your data on gender and class range might allow you to implement this access pattern. You also removed ... other variables from your example schema, suggesting you might have a fixed list of marks include math, science, social, and English. This is less important than gender and class, but hear me out :)
You didn't mention it in your original question, but your data suggests you have a fetch Student by ID access pattern. So I started by defining students by ID:
The partition key is STUDENT#student_id and the sort key is A. I sometimes use "METADATA" or "META" as the sort key, but "A" is nice and short.
To support your primary access pattern, I created a global secondary index named GSI1, with PK and SK attributes of GSI1PK and GSI1SK. I assigned GSI1PK STUDENTS#gender and GSISK is the class attribute.
This partitions your data by gender and class. To narrow your results even further, you'd need to use filters on the various marks. For example, if you wanted to fetch all make students between class 5 and 9 with specific marks, you could do the following (in DynamoDB pseudocode):
QUERY from GSI1 where PK=STUDENTS#M SK BETWEEN 05 and 09
FILTER englishMark BETWEEN 009 and 050 AND
mathsMark BETWEEN 050 and 075 AND
scienceMark BETWEEN 045 and 065 AND
socialMark BETWEEN 020 and 035 AND
Filtering in DynamoDB doesn't work like most people think. When filtering, DynamoDB:
Read items from the table
Apply filter to remove items that don't match
Return items
This can lead to awful performance (and cost) if you have a large table and are filtering for a very small set of data. For example, if you're executing a scan operation on terabytes of data and applying filters to identify a single item, you're going to have a bad time.
However, there are circumstances where filtering is a fine solution. One example is when you can use the Partition Key and Sort Key to narrow the set of data you're working with before applying a filter. In the above example, I'm dividing your data by gender (perhaps dividing your search space in half) before further narrowing the items by class. The amount of data you're left with might be small enough to filter out the remaining items effectively.
However, I'll point out that gender has rather low cardinality. Perhaps there are more specifics about your access pattern that could help with that. For example, maybe you could group students by primary/secondary school and create a PK of STUDENTS#F#1-5. Maybe you could group them by the school district, or zip code? No detail about access patterns is too specific when working with NoSQL databases!
The point of this exercise is to illustrate two points:
Filtering in DynamoDB is best achieved by selecting a good primary key.
Using the DynamoDB filtering mechanism is best used on smaller subsets of your data. Don't be afraid to use it in the right places!

How to extract more than label text items in a single annotation using Google NLP

I have created dataset using Google NLP Entity extraction and I uploaded input data's(train, test, validation jsonl files) like NLP format that will be stored in google storage bucket.
Sample Annotation:
{
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 10,
"start_offset": 0
}
},
"display_name": "Name"
}],
"text_snippet": {
"content": "JJ's Pizza\n "
}
} {
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 9,
"start_offset": 0
}
},
"display_name": "City"
}],
"text_snippet": {
"content": "San Francisco\n "
}
}
Here is the input text to predict the label as "Name", "City" and "State"
Best J J's Pizza in San Francisco, CA
Result in the following screenshot,
I expect the predicted results would be in the following,
Name : JJ's Pizza
City : San Francisco
State: CA
According to the sample annotation you provided, you're setting the whole text_snippet to be a name (or whatever field you want to extract).
This can confuse the model in understanding that all the text is that entity.
It would be better to have training data similar to the one in the documentation. In there, there is a big chunk of text and then we annotate the entities that we want extracted from there.
As an example, let's say that from these text snippets I tell the model that the cursive part is an entity named a, while the bold part is an entity called b:
JJ Pizza
LL Burritos
Kebab MM
Shushi NN
San Francisco
NY
Washington
Los Angeles
Then, when then the model reads Best JJ Pizza, it thinks all is a single entity (we trained the model with this assumption), and it will just choose the one it matches the best (in this case, it would likely say it's an a entity).
However, if I provide the following text sample (also annotated like cursive is entity a and bold is entity b):
The best pizza place in San Francisco is JJ Pizza.
For a luxurious experience, do not forget to visit LL Burritos when you're around NY.
I once visited Kebab MM, but there are better options in Washington.
You can find Shushi NN in Los Angles
You can see how you're training the model to find the entities within a piece of text, and it will try to extract them according to the context.
The important part about training the model is providing training data as similar to real-life data as possible.
In the example you provided, if the data in your real-life scenario is going to be in the format <ADJECTIVE> <NAME> <CITY>, then your training data should have that same format:
{
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 16,
"start_offset": 6
}
},
"display_name": "Name"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 30,
"start_offset": 21
}
},
"display_name": "City"
}],
"text_snippet": {
"content": "Worst JJ's Pizza in San Francisco\n "
}
}
Note that the point of a Natural Language ML model is to process natural language. If your inputs are going to look as similar/simple/short as that, then it might not be worth going the ML route. A simple regex should be enough. Without the natural language part, it is going to be hard to properly train a model. More details in the beginners guide.

Google Datastore projection query on array of complex objects

I have a simple Datastore kind having the following properties:
id (long)
createdAt (timestamp)
userId (string)
metrics (array of complex objects)
type of metric
value of metric
Each stored row in the Datastore might have a different amount of metrics as well as different types of metrics.
I have a very specific requirement to query the latest metrics of a user. The problem here is that different rows have different metrics so I can't just take the most recent row, I need to look into metrics array to retrieve all the data.
I decided to use projection queries. My idea was to create a projection based on the following properties: metrics.type, metrics.value and use distinct on metrics.type and adding order by createdAt desc.
For a better explanation, a simple example of rows from the Datastore:
1. { "id": 111, "createdAt": "2019-01-01 00:00", "userId" : "user-123", [{ "type" : "metric1", "value" : 123 }, { "type" : "metric2", "value" : 345 }] }
2. { "id": 222, "createdAt": "2019-01-02 00:00", "userId" : "user-123", [{ "type" : "metric3", "value" : 567 }, { "type" : "metric4", "value" : 789 }] }
I expected a projection query with distinct on metrics.type filter to return the following results:
1. "metric1", 123
2. "metric2", 345
3. "metric3", 567
4. "metric4", 789
but actually what query returns is:
1. "metric1", 123
2. "metric2", 123
3. "metric3", 123
4. "metric4", 123
So all metrics have the same value (which is incorrect). Basically it happens because of an exploded index - Datastore thinks I have 2 arrays but indeed it's a single array
Is there any way to make projection query to return what I expect instead of exploding the index? If not, how can I rebuild what I have so it meets my requirements?
The Cloud Datastore documentation specifically warns your exact issue.
https://cloud.google.com/datastore/docs/concepts/queries#projections_and_array-valued_properties
One option to solve this is to combine both the type and value. So, have a property called "metric" that will have values like "metric1:123", "metric2:345". Then you will be projecting a single array-valued property.

kairosdb aggregate group by

I have one year's 15 minute interval data in my kairosdb. I need to do following things sequentially:
- filter data using a tag
- group filtered data using few tags. I am not specifying values of tags because I want them to automatically grouped by tag values at runtime.
- once grouped on those tags, I want to aggregate sum 15 min interval data into a month.
I wrote this query to run from python script based on information available on kairosdb google code forum. But the aggregated values seem incorrect. Output seem skewed. I want to understand where I am going wrong. I am doing this in python. Here is my json query:
agg_query = {
"start_absolute": 1412136000000,
"end_absolute": 1446264000000,
"metrics":[
{
"tags": {
"insert_date": ["11/17/2015"]
},
"name": "gb_demo",
"group_by": [
{
"name": "time",
"range_size": {
"value": "1",
"unit": "months"
},
"group_count": "12"
},
{
"name": "tag",
"tags": ["usage_kind","building_snapshot_id","usage_point_id","interval"]
}
],
"aggregators": [
{
"name": "sum",
"sampling": {
"value": 1,
"unit": "months"
}
}
]
}
]
}
For reference: Data is something like this:
[[1441065600000,53488],[1441066500000,43400],[1441067400000,44936],[1441068300000,48736],[1441069200000,51472],[1441070100000,43904],[1441071000000,42368],[1441071900000,41400],[1441072800000,28936],[1441073700000,34896],[1441074600000,29216],[1441075500000,26040],[1441076400000,24224],[1441077300000,27296],[1441078200000,37288],[1441079100000,30184],[1441080000000,27824],[1441080900000,27960],[1441081800000,28056],[1441082700000,29264],[1441083600000,33272],[1441084500000,33312],[1441085400000,29360],[1441086300000,28400],[1441087200000,28168],[1441088100000,28944],[1443657600000,42112],[1443658500000,36712],[1443659400000,38440],[1443660300000,38824],[1443661200000,43440],[1443662100000,42632],[1443663000000,42984],[1443663900000,42952],[1443664800000,36112],[1443665700000,33680],[1443666600000,33376],[1443667500000,28616],[1443668400000,31688],[1443669300000,30872],[1443670200000,28200],[1443671100000,27792],[1443672000000,27464],[1443672900000,27240],[1443673800000,27760],[1443674700000,27232],[1443675600000,27824],[1443676500000,27264],[1443677400000,27328],[1443678300000,27576],[1443679200000,27136],[1443680100000,26856]]
This is snapshot of some data from Sep and Oct 2015. When I run this, if I give start timestamp of Sep, it will sum Sep data correctly, but for october it doesn't.
I believe your group by time will create groups by calendar month (January to December), but your sum aggregator will sum values by a running month starting withyour start date... Which seems a bit weird. COuld that be the cause of what you see?
What is the data like? What is the aggregated result like?