According to the AWS Lambda event filtering, you can filter events by the existence of keys like so {"foo":[{"exists":true}]}. However, through much trial and error, I found that this only applies to keys that contain a JSON primitive and not a nested structure.
e.g. the above filter will work for
{
"foo": "bar"
}
but not for
{
"foo": { "baz": "bar"}
}
If you wanted to match the latter event, your filter will need to drill e.g. {"foo":{"baz":[{"exists":true}]}}
This can be especially easy to miss when triggering on Dynamo steams events since the payload is in DynamoJSON format.
Is this a bug or intentional?
Related
I would like to write a data structure like this to DynamoDB via a step function:
{
"data": { "foo": "bar" },
"id": "TEST-0123",
"rev": "1.0.0",
"source": "test",
"streamId": "test-stream-1",
"timestamp": 9999,
"type": "TestExecuted",
"version": 1
}
Data can be a deeply nested object.
I would prefer to not use lambda, if possible. So my question is: Is it possible to marshall this data into a form that can be inserted into a dynamodb table?
If your input format has a fixed shape, you can marshal it into DynamoDB JSON in the DynamoDB integration's task parameters with JSON Path references.
If your input has an arbitrary shape, you'll need a Lambda Task (in which case you're probably better off using an SDK to marshal and perform the DynamoDB operation in one go).
I have some sample data (which is simulating real data that I will begin getting soon), that represents user behavior on a website.
The data is broken down into 2 json files for every day of usage. (When I'm getting the real data, I will want to fetch it every day at midnight). At the bottom of this question are example snippets of what this data looks like, if that helps.
I'm no data scientist, but I'd like to be able to do some basic analysis on this data. I want to be able to see things like how many of the user-generated objects existed on any given day, and the distribution of different attributes that they have/had. I'd also like to be able to visualize which objects are getting edited more, by whom, when and how frequently. That sort of thing.
I think I'd like to be able to make dashboards in google data studio (or similar), which basically means that I get this data in a usable format into a normal relational database. I'm thinking postgres in aws RDS (there isn't that much data that I need something like aurora, I think, though I'm not terribly opposed).
I want to automate the ingestion of the data (for now the sample data sets I have stored on S3, but eventually from an API that can be called daily). And I want to automate any reformatting/processing this data needs to get the types of insights I want.
AWS has so many data science/big data tools that it feels to me like there should be a way to automate this type of data pipeline, but the terminology and concepts are too foreign to me, and I can't figure out what direction to move in.
Thanks in advance for any advice that y'all can give.
Data example/description:
One file is catalog of all user generated objects that exist at the time the data was pulled, along with their attributes. It looks something like this:
{
"obj_001": {
"id": "obj_001",
"attr_a": "a1",
"more_attrs": {
"foo": "fred":,
"bar": null
}
},
"obj_002": {
"id": "obj_002",
"attr_a": "b2",
"more_attrs": {
"foo": null,
"bar": "baz"
}
}
}
The other file is an array that lists all the user edits to those objects that occurred in the past day, which resulted in the state from the first file. It looks something like this:
[
{
"edit_seq": 1,
"obj_id": "obj_002",
"user_id": "u56",
"edit_date": "2020-01-27",
"times": {
"foo": null,
"bar": "baz"
}
},
{
"edit_seq": 2,
"obj_id": "obj_001",
"user_id": "u25",
"edit_date": "2020-01-27",
"times": {
"foo": "fred",
"bar": null
}
}
]
It depends on the architecture that you want to deploy. If you want event based trigger, I would use SQS, I have used it heavily, as soon as someone drop a file in s3 it can trigger the SQS which can be used to trigger Lambda.
Here is a link which can give you some idea: http://blog.zenof.ai/processing-high-volume-big-data-concurrently-with-no-duplicates-using-aws-sqs/
You could build Data Pipelines using AWS DataPipeline. For e.g. if you want to read data from S3 and some transformations and then to Redshift.
You can also have look on AWS Glue, which has Spark backend, which can also crawler the schema and perform ETL.
When I update some record in DynamoDB as such
UpdateExpression: "set #audioField = :payload",
ExpressionAttributeValues: {
":payload": something,
},
var something = {"test.com1": {}}
DynamoDB puts a random character in the record like this
{ "test.com1" : { "M" : { } }}
What's up with this? And how do I prevent this?
This is not a random character, this is how DynamoDB stores and represents types.
DynamoDB embeds type information in each value that is stores. See the following for the list of types: https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_AttributeValue.html
Based on the linked above, The "M" that you are seeing is describing the contents of "test.com1" attribute which is a map (M for map).
The reason you are not seeing these in your other attributes is probably because the SDK is automatically translating this DynamoDB structure into native types for the top-level attributes but not for nested attributes.
What language/SDK are you using? Many SDKs have helpers that you can pass your results through to parse these embedded types and convert them into native types that are easier to work with.
Consider the following architecture:
write -> DynamoDB table -> stream -> Lambda -> write metadata item to same table
It could be used for many, many awsome situations, e.g table and item level aggregations. I've seen this architecture promoted in several tech talks by official AWS engineers.
But doesn't writing metadata item add new item to stream and run Lambda again?
How to avoid infinite loop? Is there a way to avoid metadata write appearing in stream?
Or is spending 2 stream and Lambda requests inevitable with this architecture? (we're charged per request) I.e exit Lambda function early if it's metadata item.
As triggering an AWS Lambda function from a DynamoDB stream is a binary option (on/off), it's not possible to only trigger the AWS Lambda function for certain writes to the table. So your AWS Lambda function will be called again for the items it just wrote to the DynamoDB table. The important bit is to have logic in place in your AWS Lambda function to detect that it wrote that data and to not write data in that case again. Otherwise you'd get the mentioned infinite loop, which would be a really unfortunate situation, especially if it would went unnoticed.
Currently dynamo DB does not offer condition based subscription to stream, so yes Dynamo DB will execute your lambda function in an infinite loop, currently the only solution is to limit the time your lambda function execute, you can use multiple lambda functions, one lambda function would be there just to check whether a metadata was written or not, I'm sharing a cloud architecture diagram of how you can achieve it,
A bit late but hopefully people looking for a more demonstrative answer will find this useful.
Suppose you want to process records where you want to add to an item up to a certain threshold, you could have an if condition that checks that and processes or skips the record, e.g.
This code assumes you have an attribute "Type" for each of your entities / object types - this was recommended to me by Rick Houlihan himself but you could also check if an attribute exists i.e. "<your-attribute>" in record["dynamodb"]["NewImage"] - and you are designing with PK and SK as generic primary and sort key names.
threshold = (os.environ.get("THRESHOLD"))
def get_value():
response = table.query(KeyConditionExpression=Key('PK').eq(<your-pk>))
value = response['Items']['<your-attribute>'] if 'Items' in response else 0
return value
def your_aggregation_function():
# Your aggregation logic here
# Write back to the table with a put_item call once done
def lambda_handler(event, context):
for record in event['Records']:
if record['eventName'] != "REMOVE" and record["dynamodb"]["NewImage"]["Type'] == <your-entity-type>:
# Query the table to extract the attribute value
attribute_value = get_value(record["dynamodb"]["Keys"]["PK"]["S"])
if attribute_value < threshold:
# Send to your aggregation function
Having the conditions in place in the lambda handler (or you could change where to suit your needs) prevents the infinite loop mentioned.
You may want additional checks in the update expression to make sure two (or more) concurrent lambda are not writing the same object. I suggest you use a date = # timestamp defined in the lambda and add this in the SK, or if you cant, have an "EventDate" attribute in your item so that yo ucould add ConditionExpression or UpdateExpression SET if_not_exists(#attribute, :date)
The above will guarantee that your lambda is idempotent.
I would like to filter a list and sort it based on an aggregate; something that is fairly simple to express in SQL, but I'm puzzled about the best way to do it with iterative Map Reduce. I'm specifically using Cloudant's "dbcopy" addition to CouchDB, but I think the approach might be similar with other map/reduce architectures.
Pseudocode SQL might look like so:
SELECT grouping_field, aggregate(*)
FROM data
WHERE #{filter}
GROUP BY grouping_field
ORDER BY aggregate(*), grouping_field
LIMIT page_size
The filter might be looking for a match or it might searching within a range; e.g. field in ('foo', 'bar') or field between 37 and 42.
As a concrete example, consider a dataset of emails; the grouping field might be "List-id", "Sender", or "Subject"; the aggregate function might be count(*), or max(date) or min(date); and the filter clause might consider flags, a date range, or a mailbox ID. The documents might look like so:
{
"id": "foobar", "mailbox": "INBOX", "date": "2013-03-29",
"sender": "foo#example.com", "subject": "Foo Bar"
}
Getting a count of emails with the same sender is trivial:
"map": "function (doc) { emit(doc.sender, null) }",
"reduce": "_count"
And Cloudant has a good example of sorting by count on the second pass of a map reduce.
But when I also want to filter (e.g. by mailbox), things get messy fast.
If I add the filter to the view keys (e.g. final result looks like {"key": ["INBOX", 1234, "foo#example.com"], "value": null}, then it's trivial to sort by count within a single filter value. But sorting that data by count with multiple filters would require traversing the entire data set (per key), which is far too slow on large data sets.
Or I could create an index for each potential filter selection; e.g. final result looks like {"key": [["mbox1", "mbox2"], 1234, "foo#example.com"], "value": null}, (for when both "mbox1" and "mbox2" are selected) or {"key": [["mbox1"], 1234, "foo#example.com"], "value": {...}}, (for when only "mbox1" is selected). That's easy to query, and fast. But it seems like the disk size of the index will grow exponentially (with the number of distinct filtered fields). And it seems to be completely untenable for filtering on open-ended data, such as date ranges.
Lastly, I could dynamically generate views which handle the desired filters on the fly, only on an as-needed basis, and tear them down after they are no longer being used (to save on disk space). The downsides here are a giant jump in code complexity, and a big up-front cost every time a new filter is selected.
Is there a better way?
I've been thinking about this for nearly a day and I think that there is no better way to do this than what you have proposed. The challenges that you face are the following:
1) The aggregation work (count, sum, etc) can only be done in the CouchDB/Cloudant API via the materialized view engine (mapreduce).
2) While the group_level API provides some flexibility to specify variable granularity at query time, it isn't sufficiently flexible for arbitrary boolean queries.
3) Arbitrary boolean queries are possible in the Cloudant API via the lucene based _search API. However, the _search API doesn't support aggregation post query. Limited support for what you want to do is only capable in lucene using faceting, which isn't yet supported in Cloudant. Even then, I believe it may only support count and not sum or more complex aggregations.
I think the best option you face is to use the _search API and make use of sort, group_by, or group_sort and then do aggregation on the client. A few sample URLs to test would look like:
GET /db/_design/ddoc/_search/indexname?q=name:mike AND age:[1.2 TO 4.5]&sort=["age","name"]
GET /db/_design/ddoc/_search/indexname?q=name:mike AND group_by="mailbox" AND group_sort=["age","name"]