What are the extra values added to DynamoDB streams and how do I remove them? - amazon-web-services

I am using DynamoDB streams to sync data to Elasticsearch using Lambda
The format of the data (from https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.Tutorial.html) looks like:
"NewImage": {
"Timestamp": {
"S": "2016-11-18:12:09:36"
},
"Message": {
"S": "This is a bark from the Woofer social network"
},
"Username": {
"S": "John Doe"
}
},
So two questions.
What is the "S" that the stream attaches. I am assuming it is to indicate string or stream, but I can't find any documentation.
Is there an option to exclude this from the stream or do I have to write code in my lambda function to remove it?

What you are seeing is the DynamoDB Data Type Descriptors. This is how data is stored in DynamoDB (or at least how it is exposed via the low level APIs). There are SDKs is various languages that will convert this to JSON.

For Python: https://boto3.amazonaws.com/v1/documentation/api/latest/_modules/boto3/dynamodb/types.html
'TypeSerializer'
deserializer = boto3.dynamodb.types.TypeDeserializer()
dic = {key: deserializer.deserialize(val) for key,val in record['dynamodb']['NewImage'].items()}
def decimal_default(obj):
if isinstance(obj, decimal.Decimal):
return float(obj)
raise TypeError
json.dumps(dic, default=decimal_default)
If you want to index in elasticsearch you have to do another json.loads() to convert to a Python dictionary.

The S indicates that the value of the attribute is simply a scalar string (S) attribute type. Each DynamoDB item attribute's key name is always a string though the attribute value doesn't have to be a scalar string. 'Naming Rules and Data Types' details each attribute data type. A string is a scalar type which is different than a document type or a set type.
There are different views of a stream record however there is no stream view that omits the item's attribute value code and also provides the attribute value. Each possible StreamViewType is explained in 'Capturing Table Activity with DynamoDB streams'.
Have fun!

Related

API Gateway mapping template converts JSON string to comma-separated key=value pairs

Mapping template noob here.
Using this query string as input:
userid=foo&firstname=bar&lastname=bat
I need to POST data to Kinesis as JSON, thus:
{
"userid":"foo",
"firstname":"bar",
"lastname":"bat"
}
However, according to a consuming Lambda, it arrives at Kinesis translated into comma-separated key=value pairs, thus:
userid=foo, firstname=bar, lastname=bat
Google found some old forum posts outlining the same problem, but no solution.
This is the mapping template I'm using:
{
#set($data = {
"userid":"$input.params('userid')",
"firstname":"$input.params('firstname')",
"lastname":"$input.params('lastname')"
})
"StreamName": "xyz",
"Data": "$util.base64Encode($data)",
"PartitionKey": "0"
}
It seems the mapping template is converting the $data dict into a comma-separated list of key=value pairs. Perhaps I should construct a JSON string rather than a dict, but, this behavior must be configurable, I'd think. How to instruct the system to construct JSON rather than KV?
This wasn't exactly what I was looking for, but, I'm able to prevent the conversion to KV by constructing a string rather than a dict, as so:
{
#set($data = "{
""userid"":""$input.params('userid')"",
""firstname"":""$input.params('firstname')"",
""lastname"":""$input.params('lastname')""
}")
"StreamName": "xyz",
"Data": "$util.base64Encode($data)",
"PartitionKey": "0"
}
I'm gonna leave this here since I did have to wrestle with it for a bit. Hopefully when Google sends someone here in future it'll be helpful.

AWS Kendra PreHook Lambdas for Data Enrichment

I am working on a POC using Kendra and Salesforce. The connector allows me to connect to my Salesforce Org and index knowledge articles. I have been able to set this up and it is currently working as expected.
There are a few custom fields and data points I want to bring over to help enrich the data even more. One of these is an additional answer / body that will contain key information for the searching.
This field in my data source is rich text containing HTML and is often larger than 2048 characters, a limit that seems to be imposed in a String data field within Kendra.
I came across two hooks that are built in for Pre and Post data enrichment. My thought here is that I can use the pre hook to strip HTML tags and truncate the field before it gets stored in the index.
Hook Reference: https://docs.aws.amazon.com/kendra/latest/dg/API_CustomDocumentEnrichmentConfiguration.html
Current Setup:
I have added a new field to the index called sf_answer_preview. I then mapped this field in the data source to the rich text field in the Salesforce org.
If I run this as is, it will index about 200 of the 1,000 articles and give an error that the remaining articles exceed the 2048 character limit in that field, hence why I am trying to set up the enrichment.
I set up the above enrichment on my data source. I specified a lambda to use in the pre-extraction, as well as no additional filtering, so run this on every article. I am not 100% certain what the S3 bucket is for since I am using a data source, but it appears to be needed so I have added that as well.
For my lambda, I create the following:
exports.handler = async (event) => {
// Debug
console.log(JSON.stringify(event))
// Vars
const s3Bucket = event.s3Bucket;
const s3ObjectKey = event.s3ObjectKey;
const meta = event.metadata;
// Answer
const answer = meta.attributes.find(o => o.name === 'sf_answer_preview');
// Remove HTML Tags
const removeTags = (str) => {
if ((str===null) || (str===''))
return false;
else
str = str.toString();
return str.replace( /(<([^>]+)>)/ig, '');
}
// Truncate
const truncate = (input) => input.length > 2000 ? `${input.substring(0, 2000)}...` : input;
let result = truncate(removeTags(answer.value.stringValue));
// Response
const response = {
"version" : "v0",
"s3ObjectKey": s3ObjectKey,
"metadataUpdates": [
{"name":"sf_answer_preview", "value":{"stringValue":result}}
]
}
// Debug
console.log(response)
// Response
return response
};
Based on the contract for the lambda described here, it appears pretty straight forward. I access the event, find the field in the data called sf_answer_preview (the rich text field from Salesforce) and I strip and truncate the value to 2,000 characters.
For the response, I am telling it to update that field to the new formatted answer so that it complies with the field limits.
When I log the data in the lambda, the pre-extraction event details are as follows:
{
"s3Bucket": "kendrasfdev",
"s3ObjectKey": "pre-extraction/********/22736e62-c65e-4334-af60-8c925ef62034/https://*********.my.salesforce.com/ka1d0000000wkgVAAQ",
"metadata": {
"attributes": [
{
"name": "_document_title",
"value": {
"stringValue": "What majors are under the Exploratory track of Health and Life Sciences?"
}
},
{
"name": "sf_answer_preview",
"value": {
"stringValue": "A complete list of majors affiliated with the Exploratory Health and Life Sciences track is available online. This track allows you to explore a variety of majors related to the health and life science professions. For more information, please visit the Exploratory program description. "
}
},
{
"name": "_data_source_sync_job_execution_id",
"value": {
"stringValue": "0fbfb959-7206-4151-a2b7-fce761a46241"
}
},
]
}
}
The Problem:
When this runs, I am still getting the same field limit error that the content exceeds the character limit. When I run the lambda on the raw data, it strips and truncates it as expected. I am thinking that the response in the lambda for some reason isn't setting the field value to the new content correctly and still trying to use the data directly from Salesforce, thus throwing the error.
Has anyone set up lambdas for Kendra before that might know what I am doing wrong? This seems pretty common to be able to do things like strip PII information before it gets indexed, so I must be slightly off on my setup somewhere.
Any thoughts?
since you are still passing the rich text as a metadata filed of a document, the character limit still applies so the document would fail at validation step of the API call and would not reach the enrichment step. A work around is to somehow append those rich text fields to the body of the document so that your lambda can access it there. But if those fields are auto generated for your documents from your data sources, that might not be easy.

Is there a way to determine the type of an attribute in an AWS DynamoDB item with Node.js?

Each item in my table represents a social media post.
It has the following definition:
{
"title": string,
"user_id": string,
"description: string,
"likes": set,
"comments": list
}
When I'm updating an item, I want to overwrite the existing value if it's a string and add to the existing value if it's a set/list. How do I determine the type of an attribute? Is this the best way to design my table?

AWS: Transforming data from DynamoDB before it's sent to Cloudsearch

I'm trying to set up AWS' Cloudsearch with a DynamoDB table. My data structure is something like this:
{
"name": "John Smith",
"phone": "0123 456 789"
"business": {
"name": "Johnny's Cool Co",
"id": "12345",
"type": "contractor",
"suburb": "Sydney"
},
"profession": {
"name": "Plumber",
"id": "20"
},
"email": "johnsmith#gmail.com",
"id": "354684354-4b32-53e3-8949846-211384",
}
Importing this data from DynamoDB -> Cloudsearch is a breeze, however I want to be able to index on some of these nested object parameters (like business.name, profession.name etc).
Cloudsearch is pulling in some of the nested objects like suburb, but it seems like it's impossible for it to differentiate between the name in the root of the object and the name within the business and profession objects.
Questions:
How do I make these nested parameters searchable? Can I index on business.name or something?
If #1 is not possible, can I somehow send my data through a transforming function before it gets to Cloudsearch? This way I could flatten all of my objects and give the fields unique names like businessName and professionName
EDIT:
My solution at the moment is to have a separate DynamoDB table which replicates our users table, but stores it in a CloudSearch-friendly format. However, I don't like this solution at all so any other ideas are totally welcome!
You can use dynamodb streams and write a function that runs in lambda to capture changes and add documents to cloudsearch, flatenning them at that point, instead of keeping an additional dynamodb table.
For example, within my lambda function I have logic that keeps the list of nested fields (within a "body" parent in this case) and I create a just flatten them with their field name, in the case of duplicate sub-field names you can append the parent name to create a new field such as "body-name" as the key.
... misc. setup ...
headers = { "Content-Type": "application/json" }
indexed_fields = ['app', 'name', 'activity'] #fields to flatten
def handler(event, context): #lambda handler called at each update
document = {} #document to be uploaded to cloudsearch
document['id'] = ... #your uid, from the dynamo update record likely
document['type'] = 'add'
all_fields = {}
#flatten/pull out info you want indexed
for record in event['Records']:
body = record['dynamodb']['NewImage']['body']['M']
for key in indexed_fields:
all_fields[key] = body[key]['S']
document['fields'] = all_fields
#post update to cloudsearch endpoint
r = requests.post(url, auth=awsauth, json=document, headers=headers)

JSON API response and ember model names

A quick question about the JSON API response key "type" matching up with an Ember model name.
If I have a model, say "models/photo.js" and I have a route like "/photos", my JSON API response looks like this
{
data: [{
id: "298486374",
type: "photos",
attributes: {
name: "photo_name_1.png",
description: "A photo!"
}
},{
id: "298434523",
type: "photos",
attributes: {
name: "photo_name_2.png",
description: "Another photo!"
}
}]
}
I'm under the assumption that my model name should be singular but this error pops up
Assertion Failed: You tried to push data with a type 'photos' but no model could be found with that name
This is, of course, because my model is named "photo"
Now in the JSON API spec there is a note that reads "This spec is agnostic about inflection rules, so the value of type can be either plural or singular. However, the same value should be used consistently throughout an implementation."
So,
tl;dr Is the "Ember way" of doing things to have both the model names and the JSON API response key "type" both be singular? or does it not matter as long as they match?
JSON API serializer expects plural type. Payload example from guides.
Since modelNameFromPayloadKey function singularizes key, it works with singular type:
// as is
modelNameFromPayloadKey: function(key) {
return singularize(normalizeModelName(key));
}
but inverse operation payloadKeyFromModelName pluralizes model name and should be changed, if you use singular type in your backend:
// as is
payloadKeyFromModelName: function(modelName) {
return pluralize(modelName);
}
It is important that the internal Ember Data JSON API format differs a bit from the one used by JSONAPISerializer. Store.push expects singular type, JSON API serializer expects plural.
From discussion:
"...ED uses camelCased attributes and singular types internally, regardless of what adapter/serializer you're using.
When you're using the JSON API adapter/serializer we want users to be able to use the examples available on jsonapi.org and have it just work. Most users never have to care about the internal format since the serializer handles the work for them.
This is documented in the guides, http://guides.emberjs.com/v2.0.0/models/pushing-records-into-the-store/
..."
Depending on your use case, you might try pushPayload instead of push. As the documentation suggests, it does some normalization; and in my case it covered the "plural vs. singular" problem.