Read data iteratively from DynamoDB inside a Step Functions workflow - amazon-web-services

I am new to coding and learning about AWS services. I am trying to fetch data iteratively from a DynamoDB table in a Step Functions workflow. For Example, say for a book ID I want to check the names of customers that purchased the book. I am primarily trying to do practice using Map state and try to integrate with DDB.
I created a DDB table with id as partition key and added cust_name1 and cust_name2 as attributes for multiple customers of a book. Now in my Step Functions, I want to use a Map state to query how many people have that book ID. Is this possible? Or is there a better way to use Map state for this scenario?
I am able to do this in a Task state, but trying to figure out how to use Map state for this scenario.
{
"Comment": "A description of my state machine",
"StartAt": "DynamoDB Get Book Purchases",
"States": {
"DynamoDB Get Book Purchases": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "get-book-data",
"Key": {
"id": {
"S": "121-11-2436"
}
}
},
"ResultPath": "$.DynamoDB",
"End": true
}
}
}

Map (in Step Functions) is designed to apply a function on each of the elements of an array or list. Count is an aggregation function and therefore, using it as part of Map is not that useful.
For your exercise, you can try to allow users to buy multiple books, where the Map functions will check the availability of each of the books in your inventory.

Related

Marshalling Dynamo Data in Step functions

I would like to write a data structure like this to DynamoDB via a step function:
{
"data": { "foo": "bar" },
"id": "TEST-0123",
"rev": "1.0.0",
"source": "test",
"streamId": "test-stream-1",
"timestamp": 9999,
"type": "TestExecuted",
"version": 1
}
Data can be a deeply nested object.
I would prefer to not use lambda, if possible. So my question is: Is it possible to marshall this data into a form that can be inserted into a dynamodb table?
If your input format has a fixed shape, you can marshal it into DynamoDB JSON in the DynamoDB integration's task parameters with JSON Path references.
If your input has an arbitrary shape, you'll need a Lambda Task (in which case you're probably better off using an SDK to marshal and perform the DynamoDB operation in one go).

AWS DynamoDB Max Item size

I am trying to create a simple logging system using DynamoDB. Using this simple table structure:
{
"userName": "Billy Wotsit",
"messageLog": [
{
"date": "2022-06-08 13:17:03",
"messageId": "j2659afl32btc0feqtbqrf802th296srbka8tto0",
"status": 200
},
{
"date": "2022-06-08 16:28:37.464",
"id": "eb4oqktac8i19got1t70eec4i8rdcman6tve81o0",
"status": 200
},
{
"date": "2022-06-09 11:54:37.457",
"id": "m5is9ah4th4kl13d1aetjhjre7go0nun2lecdsg0",
"status": 200
}
]
}
It is easily possible that the number of items in the message log will run into the thousands.
According to the documentation an "item" can have a maximum size of 400kB which severly limits the maximum number of log elements that can be stored.
What would be the correct way to store this amount of data without resorting to a more traditional SQL-approach (which is not really needed)
Some information on use cases for your data would help. My primary questions would be
How often do you need to update/ append logs to an entry?
How do you need to retrieve the logs? Do you need all of them? Do you need them per user? Will you filter by time?
Without knowing any more info on your data read/write patterns, a better layout would be to have each log entry be an item in the DB:
The username can be the partition key
The log date can be the sort key.
(optional) If you need to ever retrieve by message id, you can create a secondary index. (Or vice versa with the log date)
With the above structure, you can trivially append log entries & retrieve them efficiently assuming you've set your sort key / indexes appropriately to your use case. Focus on understanding how the sort key and secondary indexes optimize finding your entries in the DB. You want to avoid scanning through the entire database looking for your items. See Core Components of Amazon DynamoDB for more info.

AWS Personalize: how to deal with a huge catalog with not enough interaction data

I'm adding a product recommendation feature with Amazon Personalize to an e-commerce website. We currently have a huge product catalog with millions of items. We want to be able to use Amazon Personalize on our item details page to recommend other relevant items to the current item.
Now as you may be aware, Amazon Personalize heavily rely on the user interaction to provide recommendation. However, since we only just started our new line of business, we're not getting enough interaction data. The majority of items in our catalog have no interaction at all. A few items (thousands) though get interacted a lot, which then impose a huge influence on the recommendation results. Hence you will see those few items always get recommended even if they are not relevant to the current item at all, creating a very odd recommendation.
I think this is what we usually refer as a "cold-start" situation - except that usual cold-start problems are about item "cold-start" or user "cold-start", but the problem I am faced with now is a new business "cold-start" - we don't have the basic amount of interaction data to support the a fully personalized recommendation. With the absence of interaction data of each item, we want the Amazon Personalize service to rely on the item metadata to provide the recommendation. So that ideally, we want the service to recommend based on item metadata and once it's getting more interactions, recommend based on item metadata + interaction.
So far I've done quite some researches only to find one solution - to increase explorationWeight when creating the campaign. As this article indicates, Higher values for explorationWeight signify higher exploration; new items with low impressions are more likely to be recommended. But it does NOT seem to do the trick for me. It improves the situation a little bit but still often times I am seeing odd results being recommended due to a higher integration rate.
I'm not sure if there're any other solutions out there to remedy my situation. How can I improve the recommendation results when I have a huge catalog with not enough interaction data?
I appreciate if anyone has any advice. Thank you and have a good day!
The SIMS recipe is typically what is used on product detail pages to recommend similar items. However, given that SIMS only considers the user-item interactions dataset and you have very little interaction data, SIMS will not perform well in this case. At least at this time. Once you have accumulated more interaction data, you may want to revisit SIMS for your detail page.
The user-personalization recipe is a better match here since it uses item metadata to recommend cold items that the user may be interested in. You can improve the relevance of recommendations based on item metadata by adding textual data to your items dataset. This is a new Personalize feature (see blog post for details). Just add your product descriptions to your items dataset as a textual field as shown below and create a solution with the user-personalization recipe.
{
"type": "record",
"name": "Items",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "BRAND",
"type": [
"null",
"string"
],
"categorical": true
},
{
"name": "PRICE",
"type": "float"
},
{
"name": "DESCRIPTION",
"type": [
"null",
"string"
],
"textual": true
},
],
"version": "1.0"
}
If you're still using this recipe on your product detail page, you can also consider using a filter when calling GetRecommendations to limit recommendations to the current product's category.
INCLUDE ItemID WHERE Items.CATEGORY IN ($CATEGORY)
Where $CATEGORY is the current product's category. This may require some experimentation to see if it fits with your UX and catalog.

AWS ASM describe-secret details and date resolution

I am trying to retrieve the secret details using aws CLI command, and I am able to get the details as well. But I am not able to understand the format in which dates are being retured.
{
"RotationRules": {
"AutomaticallyAfterDays": 90
},
"Name": "indextract/uat",
"VersionIdsToStages": {
"51b23a11-b871-40ec-a5f0-d4d3c90d781e": [
"AWSCURRENT"
],
"1f0b581b-4353-43d0-9163-f3a8a622b479": [
"AWSPREVIOUS"
]
},
"Tags": [],
"RotationEnabled": true,
"LastChangedDate": 1596000798.137,
"LastRotatedDate": 1595914104.584,
"KmsKeyId": "XXX",
"RotationLambdaARN": "XXX",
"LastAccessedDate": 1595980800.0,
"ARN": "XXX",
"Description": "ZZZZ"
}
Can someone please help in interpreting LastRotatedDate, is there a cast function which I can use directly or on the field after parsing json?
May be a python or a unix command?
As a second part of question, my requirement is to get the new password only if it has changed. One way is to make first api call to get the LastChangeDate and then make get-secret-value call if required as per rotation days.
But this would need 2 api call, is there a way to do this in a single call? May be passing an argument like date and get response only if LastChangedDate is beyond the passed argument?
I could not find a way in doc, so though to take some suggestions.
LastChangedDate, LastRotatedDate and LastAccessedDate are all timestamp fields described here
In another word, they are epoch ones. I have tried using the conversion with your timestamp and the data is showing as
For the second question, I have not found direct way as you expected, but you can handle it based on some logic like.
Keep track 'timestamp' of the date you know the key get rotated, maybe using DynamoDB here to store it.
Based on the flag there you will decide your logic to see if you need to read a new password or not.
You can use Lambda, and CloudWatch rules to handle trigger and small logic for it.
But again, it takes efforts for that logic as well.
Could I ask you is that because of performance issue we want to reduce calling API everytime?
Thanks,

What AWS services & workflows can I use to ingest and analyze JSON daily usage data from an API?

I have some sample data (which is simulating real data that I will begin getting soon), that represents user behavior on a website.
The data is broken down into 2 json files for every day of usage. (When I'm getting the real data, I will want to fetch it every day at midnight). At the bottom of this question are example snippets of what this data looks like, if that helps.
I'm no data scientist, but I'd like to be able to do some basic analysis on this data. I want to be able to see things like how many of the user-generated objects existed on any given day, and the distribution of different attributes that they have/had. I'd also like to be able to visualize which objects are getting edited more, by whom, when and how frequently. That sort of thing.
I think I'd like to be able to make dashboards in google data studio (or similar), which basically means that I get this data in a usable format into a normal relational database. I'm thinking postgres in aws RDS (there isn't that much data that I need something like aurora, I think, though I'm not terribly opposed).
I want to automate the ingestion of the data (for now the sample data sets I have stored on S3, but eventually from an API that can be called daily). And I want to automate any reformatting/processing this data needs to get the types of insights I want.
AWS has so many data science/big data tools that it feels to me like there should be a way to automate this type of data pipeline, but the terminology and concepts are too foreign to me, and I can't figure out what direction to move in.
Thanks in advance for any advice that y'all can give.
Data example/description:
One file is catalog of all user generated objects that exist at the time the data was pulled, along with their attributes. It looks something like this:
{
"obj_001": {
"id": "obj_001",
"attr_a": "a1",
"more_attrs": {
"foo": "fred":,
"bar": null
}
},
"obj_002": {
"id": "obj_002",
"attr_a": "b2",
"more_attrs": {
"foo": null,
"bar": "baz"
}
}
}
The other file is an array that lists all the user edits to those objects that occurred in the past day, which resulted in the state from the first file. It looks something like this:
[
{
"edit_seq": 1,
"obj_id": "obj_002",
"user_id": "u56",
"edit_date": "2020-01-27",
"times": {
"foo": null,
"bar": "baz"
}
},
{
"edit_seq": 2,
"obj_id": "obj_001",
"user_id": "u25",
"edit_date": "2020-01-27",
"times": {
"foo": "fred",
"bar": null
}
}
]
It depends on the architecture that you want to deploy. If you want event based trigger, I would use SQS, I have used it heavily, as soon as someone drop a file in s3 it can trigger the SQS which can be used to trigger Lambda.
Here is a link which can give you some idea: http://blog.zenof.ai/processing-high-volume-big-data-concurrently-with-no-duplicates-using-aws-sqs/
You could build Data Pipelines using AWS DataPipeline. For e.g. if you want to read data from S3 and some transformations and then to Redshift.
You can also have look on AWS Glue, which has Spark backend, which can also crawler the schema and perform ETL.