AWS: Transforming data from DynamoDB before it's sent to Cloudsearch - amazon-web-services

I'm trying to set up AWS' Cloudsearch with a DynamoDB table. My data structure is something like this:
{
"name": "John Smith",
"phone": "0123 456 789"
"business": {
"name": "Johnny's Cool Co",
"id": "12345",
"type": "contractor",
"suburb": "Sydney"
},
"profession": {
"name": "Plumber",
"id": "20"
},
"email": "johnsmith#gmail.com",
"id": "354684354-4b32-53e3-8949846-211384",
}
Importing this data from DynamoDB -> Cloudsearch is a breeze, however I want to be able to index on some of these nested object parameters (like business.name, profession.name etc).
Cloudsearch is pulling in some of the nested objects like suburb, but it seems like it's impossible for it to differentiate between the name in the root of the object and the name within the business and profession objects.
Questions:
How do I make these nested parameters searchable? Can I index on business.name or something?
If #1 is not possible, can I somehow send my data through a transforming function before it gets to Cloudsearch? This way I could flatten all of my objects and give the fields unique names like businessName and professionName
EDIT:
My solution at the moment is to have a separate DynamoDB table which replicates our users table, but stores it in a CloudSearch-friendly format. However, I don't like this solution at all so any other ideas are totally welcome!

You can use dynamodb streams and write a function that runs in lambda to capture changes and add documents to cloudsearch, flatenning them at that point, instead of keeping an additional dynamodb table.
For example, within my lambda function I have logic that keeps the list of nested fields (within a "body" parent in this case) and I create a just flatten them with their field name, in the case of duplicate sub-field names you can append the parent name to create a new field such as "body-name" as the key.
... misc. setup ...
headers = { "Content-Type": "application/json" }
indexed_fields = ['app', 'name', 'activity'] #fields to flatten
def handler(event, context): #lambda handler called at each update
document = {} #document to be uploaded to cloudsearch
document['id'] = ... #your uid, from the dynamo update record likely
document['type'] = 'add'
all_fields = {}
#flatten/pull out info you want indexed
for record in event['Records']:
body = record['dynamodb']['NewImage']['body']['M']
for key in indexed_fields:
all_fields[key] = body[key]['S']
document['fields'] = all_fields
#post update to cloudsearch endpoint
r = requests.post(url, auth=awsauth, json=document, headers=headers)

Related

How do we encrypt the value of a nested dictionary to store in DynamoDB using DynamoDb Encryption Client?

I have the following dictionary
plaintext_item = {
"website": "https://example.com",
"description": "This is a sample data",
"website_username": {
"testuser1": "password12",
"testuser2": "password13",
}
}
In the above dictionary I want to encrypt both the passwords but not their usernames and store it in dynamoDb.
what I tried?
This was my first approach but didn't work
actions = AttributeActions(
default_action=CryptoAction.ENCRYPT_AND_SIGN,
attribute_actions={
"website": CryptoAction.DO_NOTHING,
plaintext_item["website_username"]["testuser1"]: CryptoAction.ENCRYPT_AND_SIGN,
"description": CryptoAction.DO_NOTHING,
}
)
Then I tried this below 2nd approach like how we update nested value in dynamodb, this too didn't work
actions = AttributeActions(
default_action=CryptoAction.ENCRYPT_AND_SIGN,
attribute_actions={
"website": CryptoAction.DO_NOTHING,
"website_username.testuser1": CryptoAction.ENCRYPT_AND_SIGN,
"description": CryptoAction.DO_NOTHING,
})
In both the above cases the whole object is getting encrypted and stored, I looked for some documentation but I am not able to find anything related, I am able to encrypt normal dictionaries like {"a":2,"b":3} but not nested ones.

AWS-Console: DynamoDB scan on nested field

I have below table in DynamoDB
{
"id": 1,
"user": {
"age": "26",
"email": "testuser#gmail.com",
"name": "test user"
}
}
Using AWS console, I want to scan all the records whose email address contains gmail.com
I am trying this but it is giving no results.
I am new to AWS, not sure what's wrong here. Is it not possible to scan on nested fields?
I've been trying to figure this out myself but it would seem that nested item scans are not supported through the console.
I'm going based off of this which offer some alternative options via CLI or SDK: https://forums.aws.amazon.com/thread.jspa?messageID=931016

What are the extra values added to DynamoDB streams and how do I remove them?

I am using DynamoDB streams to sync data to Elasticsearch using Lambda
The format of the data (from https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.Tutorial.html) looks like:
"NewImage": {
"Timestamp": {
"S": "2016-11-18:12:09:36"
},
"Message": {
"S": "This is a bark from the Woofer social network"
},
"Username": {
"S": "John Doe"
}
},
So two questions.
What is the "S" that the stream attaches. I am assuming it is to indicate string or stream, but I can't find any documentation.
Is there an option to exclude this from the stream or do I have to write code in my lambda function to remove it?
What you are seeing is the DynamoDB Data Type Descriptors. This is how data is stored in DynamoDB (or at least how it is exposed via the low level APIs). There are SDKs is various languages that will convert this to JSON.
For Python: https://boto3.amazonaws.com/v1/documentation/api/latest/_modules/boto3/dynamodb/types.html
'TypeSerializer'
deserializer = boto3.dynamodb.types.TypeDeserializer()
dic = {key: deserializer.deserialize(val) for key,val in record['dynamodb']['NewImage'].items()}
def decimal_default(obj):
if isinstance(obj, decimal.Decimal):
return float(obj)
raise TypeError
json.dumps(dic, default=decimal_default)
If you want to index in elasticsearch you have to do another json.loads() to convert to a Python dictionary.
The S indicates that the value of the attribute is simply a scalar string (S) attribute type. Each DynamoDB item attribute's key name is always a string though the attribute value doesn't have to be a scalar string. 'Naming Rules and Data Types' details each attribute data type. A string is a scalar type which is different than a document type or a set type.
There are different views of a stream record however there is no stream view that omits the item's attribute value code and also provides the attribute value. Each possible StreamViewType is explained in 'Capturing Table Activity with DynamoDB streams'.
Have fun!

Not able to add data into DynamoDB using API gateway POST method

I made a Serverless API backend on AWS console which uses API Gateway, DynamoDB, Lambda functions.
Upon creation I can add the data in dynamoDB online by adding a JSON file, which looks like this:
{
"id": "4",
"k": "key1",
"v": "value1"
}
But when I try to add this using "Postman", by adding the above JSON data in the body of POST message, I get a Positive return (i.e. no errors) but only the "id" field is added in the database and not the "k" or "v".
What is missing?
I think that you need to check on your Lambda function.
As you are using Postman to do the API calls, received event's body will be as follows:
{'resource':
...
}, 'body': '{\n\t"id": 1,\n\t"name": "ben"\n
}', 'isBase64Encoded': False
}
As you can see:
'body': '{\n\t"id": 1,\n\t"name": "ben"\n}'
For example, I will use Python 3 for this case, what I need to do is to load the body into JSON format then we are able to use it.
result = json.loads(event['body'])
id = result['id']
name = result['name']
Then update them into DynamoDB:
item = table.put_item(
Item={
'id': str(id),
'name': str(name)
}
)

Map different Sort Key responses to Appsync Schema values

So here is my schema:
type Model {
PartitionKey: ID!
Name: String
Version: Int
FBX: String
# ms since epoch
CreatedAt: AWSTimestamp
Description: String
Tags: [String]
}
type Query {
getAllModels(count: Int, nextToken: String): PaginatedModels!
}
type PaginatedModels {
models: [Model!]!
nextToken: String
}
I would like to call 'getAllModels' and have all of it's data, and all of it's tags be filled in.
But here is the thing. Tags are stored via sort keys. Like so
PartionKey | SortKey
Model-0 | Model-0
Model-0 | Tag-Tree
Model-0 | Tag-Building
Is it possible to transform the 'Tag' sort keys into the Tags: [String] array in the schema via a DynamoDB resolver? Or must I do something extra fancy through a lambda? Or is there a smarter way to do this?
To clarify, are you storing objects like this in DynamoDB:
{ PartitionKey (HASH), Tag (SortKey), Name, Version, FBX, CreatedAt, Description }
and using a DynamoDB Query operation to fetch all rows for a given HashKey.
Query #PartitionKey = :PartitionKey
and getting back a list of objects some of which have a different "Tag" value and one of which is "Model-0" (aka the same value as the partition key) and I assume that record contains all other values for the record. E.G.
[
{ PartitionKey, Tag: 'ValueOfPartitionKey', Name, Version, FBX, CreatedAt, ... },
{ PartitionKey, Tag: 'Tag-Tree' },
{ PartitionKey: Tag: 'Tag-Building' }
]
You can definitely write resolver logic without too much hassle that reduces the list of model objects into a single object with a list of "Tags". Let's start with a single item and see how to implement a getModel(id: ID!): Model query:
First define the response mapping template that will get all rows for a partition key:
{
"version" : "2017-02-28",
"operation" : "Query",
"query" : {
"expression": "#PartitionKey = :id",
"expressionValues" : {
":id" : {
"S" : "${ctx.args.id}"
}
},
"expressionNames": {
"#PartitionKey": "PartitionKey" # whatever the table hash key is
}
},
# The limit will have to be sufficiently large to get all rows for a key
"limit": $util.defaultIfNull(${ctx.args.limit}, 100)
}
Then to return a single model object that reduces "Tag" to "Tags" you can use this response mapping template:
#set($tags = [])
#set($result = {})
#foreach( $item in $ctx.result.items )
#if($item.PartitionKey == $item.Tag)
#set($result = $item)
#else
$util.qr($tags.add($item.Tag))
#end
#end
$util.qr($result.put("Tags", $tags))
$util.toJson($result)
This will return a response like this:
{
"PartitionKey": "...",
"Name": "...",
"Tags": ["Tag-Tree", "Tag-Building"],
}
Fundamentally I see no problem with this but its effectiveness depends upon your query patterns. Extending this to the getAll use is doable but will require a few changes and most likely a really inefficient Scan operation due to the fact that the table will be sparse of actual information since many records are effectively just tags. You can alleviate this with GSIs pretty easily but more GSIs means more $.
As an alternative approach, you can store your Tags in a different "Tags" table. This way you only store model information in the Model table and tag information in the Tag table and leverage GraphQL to perform the join for you. In this approach have Query.getAllModels perform a "Scan" (or Query) on the Model table and then have a Model.Tags resolver that performs a Query against the Tag table (HK: ModelPartitionKey, SK: Tag). You could then get all tags for a model and later create a GSI to get all models for a tag. You do need to consider that now the nested Model.Tag query will get called once per model but Query operations are fast and I've seen this work well in practice.
Hope this helps :)