Marshalling Dynamo Data in Step functions

Marshalling Dynamo Data in Step functions - amazon-web-services

I would like to write a data structure like this to DynamoDB via a step function:
{
"data": { "foo": "bar" },
"id": "TEST-0123",
"rev": "1.0.0",
"source": "test",
"streamId": "test-stream-1",
"timestamp": 9999,
"type": "TestExecuted",
"version": 1
}
Data can be a deeply nested object.
I would prefer to not use lambda, if possible. So my question is: Is it possible to marshall this data into a form that can be inserted into a dynamodb table?

If your input format has a fixed shape, you can marshal it into DynamoDB JSON in the DynamoDB integration's task parameters with JSON Path references.
If your input has an arbitrary shape, you'll need a Lambda Task (in which case you're probably better off using an SDK to marshal and perform the DynamoDB operation in one go).

Related

AWS Lambda event filtering using "exists"

According to the AWS Lambda event filtering, you can filter events by the existence of keys like so {"foo":[{"exists":true}]}. However, through much trial and error, I found that this only applies to keys that contain a JSON primitive and not a nested structure.
e.g. the above filter will work for
{
"foo": "bar"
}
but not for
{
"foo": { "baz": "bar"}
}
If you wanted to match the latter event, your filter will need to drill e.g. {"foo":{"baz":[{"exists":true}]}}
This can be especially easy to miss when triggering on Dynamo steams events since the payload is in DynamoJSON format.
Is this a bug or intentional?

AWS DynamoDB Max Item size

I am trying to create a simple logging system using DynamoDB. Using this simple table structure:
{
"userName": "Billy Wotsit",
"messageLog": [
{
"date": "2022-06-08 13:17:03",
"messageId": "j2659afl32btc0feqtbqrf802th296srbka8tto0",
"status": 200
},
{
"date": "2022-06-08 16:28:37.464",
"id": "eb4oqktac8i19got1t70eec4i8rdcman6tve81o0",
"status": 200
},
{
"date": "2022-06-09 11:54:37.457",
"id": "m5is9ah4th4kl13d1aetjhjre7go0nun2lecdsg0",
"status": 200
}
]
}
It is easily possible that the number of items in the message log will run into the thousands.
According to the documentation an "item" can have a maximum size of 400kB which severly limits the maximum number of log elements that can be stored.
What would be the correct way to store this amount of data without resorting to a more traditional SQL-approach (which is not really needed)

Some information on use cases for your data would help. My primary questions would be
How often do you need to update/ append logs to an entry?
How do you need to retrieve the logs? Do you need all of them? Do you need them per user? Will you filter by time?
Without knowing any more info on your data read/write patterns, a better layout would be to have each log entry be an item in the DB:
The username can be the partition key
The log date can be the sort key.
(optional) If you need to ever retrieve by message id, you can create a secondary index. (Or vice versa with the log date)
With the above structure, you can trivially append log entries & retrieve them efficiently assuming you've set your sort key / indexes appropriately to your use case. Focus on understanding how the sort key and secondary indexes optimize finding your entries in the DB. You want to avoid scanning through the entire database looking for your items. See Core Components of Amazon DynamoDB for more info.

Read data iteratively from DynamoDB inside a Step Functions workflow

I am new to coding and learning about AWS services. I am trying to fetch data iteratively from a DynamoDB table in a Step Functions workflow. For Example, say for a book ID I want to check the names of customers that purchased the book. I am primarily trying to do practice using Map state and try to integrate with DDB.
I created a DDB table with id as partition key and added cust_name1 and cust_name2 as attributes for multiple customers of a book. Now in my Step Functions, I want to use a Map state to query how many people have that book ID. Is this possible? Or is there a better way to use Map state for this scenario?
I am able to do this in a Task state, but trying to figure out how to use Map state for this scenario.
{
"Comment": "A description of my state machine",
"StartAt": "DynamoDB Get Book Purchases",
"States": {
"DynamoDB Get Book Purchases": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "get-book-data",
"Key": {
"id": {
"S": "121-11-2436"
}
}
},
"ResultPath": "$.DynamoDB",
"End": true
}
}
}

Map (in Step Functions) is designed to apply a function on each of the elements of an array or list. Count is an aggregation function and therefore, using it as part of Map is not that useful.
For your exercise, you can try to allow users to buy multiple books, where the Map functions will check the availability of each of the books in your inventory.

AWS ASM describe-secret details and date resolution

I am trying to retrieve the secret details using aws CLI command, and I am able to get the details as well. But I am not able to understand the format in which dates are being retured.
{
"RotationRules": {
"AutomaticallyAfterDays": 90
},
"Name": "indextract/uat",
"VersionIdsToStages": {
"51b23a11-b871-40ec-a5f0-d4d3c90d781e": [
"AWSCURRENT"
],
"1f0b581b-4353-43d0-9163-f3a8a622b479": [
"AWSPREVIOUS"
]
},
"Tags": [],
"RotationEnabled": true,
"LastChangedDate": 1596000798.137,
"LastRotatedDate": 1595914104.584,
"KmsKeyId": "XXX",
"RotationLambdaARN": "XXX",
"LastAccessedDate": 1595980800.0,
"ARN": "XXX",
"Description": "ZZZZ"
}
Can someone please help in interpreting LastRotatedDate, is there a cast function which I can use directly or on the field after parsing json?
May be a python or a unix command?
As a second part of question, my requirement is to get the new password only if it has changed. One way is to make first api call to get the LastChangeDate and then make get-secret-value call if required as per rotation days.
But this would need 2 api call, is there a way to do this in a single call? May be passing an argument like date and get response only if LastChangedDate is beyond the passed argument?
I could not find a way in doc, so though to take some suggestions.

LastChangedDate, LastRotatedDate and LastAccessedDate are all timestamp fields described here
In another word, they are epoch ones. I have tried using the conversion with your timestamp and the data is showing as
For the second question, I have not found direct way as you expected, but you can handle it based on some logic like.
Keep track 'timestamp' of the date you know the key get rotated, maybe using DynamoDB here to store it.
Based on the flag there you will decide your logic to see if you need to read a new password or not.
You can use Lambda, and CloudWatch rules to handle trigger and small logic for it.
But again, it takes efforts for that logic as well.
Could I ask you is that because of performance issue we want to reduce calling API everytime?
Thanks,

What AWS services & workflows can I use to ingest and analyze JSON daily usage data from an API?

I have some sample data (which is simulating real data that I will begin getting soon), that represents user behavior on a website.
The data is broken down into 2 json files for every day of usage. (When I'm getting the real data, I will want to fetch it every day at midnight). At the bottom of this question are example snippets of what this data looks like, if that helps.
I'm no data scientist, but I'd like to be able to do some basic analysis on this data. I want to be able to see things like how many of the user-generated objects existed on any given day, and the distribution of different attributes that they have/had. I'd also like to be able to visualize which objects are getting edited more, by whom, when and how frequently. That sort of thing.
I think I'd like to be able to make dashboards in google data studio (or similar), which basically means that I get this data in a usable format into a normal relational database. I'm thinking postgres in aws RDS (there isn't that much data that I need something like aurora, I think, though I'm not terribly opposed).
I want to automate the ingestion of the data (for now the sample data sets I have stored on S3, but eventually from an API that can be called daily). And I want to automate any reformatting/processing this data needs to get the types of insights I want.
AWS has so many data science/big data tools that it feels to me like there should be a way to automate this type of data pipeline, but the terminology and concepts are too foreign to me, and I can't figure out what direction to move in.
Thanks in advance for any advice that y'all can give.
Data example/description:
One file is catalog of all user generated objects that exist at the time the data was pulled, along with their attributes. It looks something like this:
{
"obj_001": {
"id": "obj_001",
"attr_a": "a1",
"more_attrs": {
"foo": "fred":,
"bar": null
}
},
"obj_002": {
"id": "obj_002",
"attr_a": "b2",
"more_attrs": {
"foo": null,
"bar": "baz"
}
}
}
The other file is an array that lists all the user edits to those objects that occurred in the past day, which resulted in the state from the first file. It looks something like this:
[
{
"edit_seq": 1,
"obj_id": "obj_002",
"user_id": "u56",
"edit_date": "2020-01-27",
"times": {
"foo": null,
"bar": "baz"
}
},
{
"edit_seq": 2,
"obj_id": "obj_001",
"user_id": "u25",
"edit_date": "2020-01-27",
"times": {
"foo": "fred",
"bar": null
}
}
]

It depends on the architecture that you want to deploy. If you want event based trigger, I would use SQS, I have used it heavily, as soon as someone drop a file in s3 it can trigger the SQS which can be used to trigger Lambda.
Here is a link which can give you some idea: http://blog.zenof.ai/processing-high-volume-big-data-concurrently-with-no-duplicates-using-aws-sqs/

You could build Data Pipelines using AWS DataPipeline. For e.g. if you want to read data from S3 and some transformations and then to Redshift.
You can also have look on AWS Glue, which has Spark backend, which can also crawler the schema and perform ETL.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js