Do I need to update item csv in AWS personalize? - amazon-web-services

I'm trying to use AWS personalize, and following their documents.
So I've uploaded dataset files(interaction, user, item) to S3, then created a solution and a campaign.
And I implemented PutEvents API using java.
GetRecommendations API call works good.
At this moment I'm curious I need to update dataset files, especially item csv.

In general it's done at this point for very basic recommendations.
Since you are using PutEvents call, then all of the real-time events are added to Interactions dataset this way. Interactions datasets created by manual import and by PutEvents calls are separated from themselves. You can actually see them in Personalize Datasets web console.
Still you might want to update dataset files, using dataset import job feature, but it's going to replace your existing dataset. In general I would recommend using it only when:
You just created a fresh/bigger/better dump of your database with Interactions.
You've found, that your previous interactions dataset was invalid.
The schema of dataset changed (pretty much you are forced to do it then).
User or Item dataset changed/improved, it's actually a good idea to refresh it often, so Personalize can produce better recommendations. Keep in mind, that it also requires retraining of the Solution, so the new Items/Users will be included during the recommendations generation.
So for interactions you usually don't want to update dataset. For other datasets it might be a good idea to even create an automatic import mechanism.
Keep in mind, that Items and Users datasets are used only with Personalize Recipes, that support metadata. Otherwise they are simply ignored.

Related

feeding real time data into aws personalize

I want to feed real time data into aws personalize to build a recommendation engine. I've read online resources and in those guides, I could see that the training user-interaction data, user data and item data is provided in the beginning while creating the recommendation engine.
However, I have an app and I will gather data in the app and want to feed those realtime data into aws personalize. I want to know if building the recommendation engine is possible without providing any data at first and then stream real time data from my app later with the putevents, putItem and putUser api from aws-sdk? I'm quite new to this so I'm quite confused with this initial step
I want to know if building the recommendation engine is possible without providing any data at first and then stream real time data from my app later with the putevents, putItem and putUser api from aws-sdk?
Yes, it is possible. You just need to adjust the sequence of creating resources.
Interaction data is required for all Personalize recipes before a recommender can be created that provides recommendations. However, if you don't have interaction data (or enough data; see quotas and limits) to start with, you can create a dataset group and an interactions dataset, feed interactions to the dataset using the PutEvents API (see recording events page), and then create a domain recommender or custom solution when enough data has been ingested.
The minimum amount of interaction data (and potentially item metadata) required before you can train a model/recommender depends on the recipe that you select. Generally speaking, you will need 1000 interactions across 25 distinct users where each of those users has 2+ interactions. The domain recommenders also require specific event types. Check the docs linked above. The quality and relevance of recommendations will improve as you collect more data and retrain.

AWS Personalize - Recommender retraining questions

I'm a new user with AWS Personalize. So, I only have a few questions about recommender retraining below.
Currently, I focus on E-Commerce data set group and use the e-commerce use-case recommender. If I use this; It can't create a campaign right?
If I understand correctly this one is no need to retrain the model right? (If I use recommender above) because I read in many docs, it has only a retraining process when we use only the custom resource and create a campaign right?
So, when I increment the new event data, the recommender will apply the new data directly for recommendations, right? If yes, that means we don't need to focus on the retraining process for the e-commerce use case right? following this docs
that's all from my question.
Currently, I focus on E-Commerce data set group and use the e-commerce use-case recommender. If I use this; It can't create a campaign right?
The recommenders for domain dataset groups automatically manage the inference endpoint for you. So the step of creating a campaign is not necessary. The service handles this.
If I understand correctly this one is no need to retrain the model right? (If I use recommender above) because I read in many docs, it has only a retraining process when we use only the custom resource and create a campaign right?
Correct. Training and retraining is managed by the service for domain recommenders.
So, when I increment the new event data, the recommender will apply the new data directly for recommendations, right? If yes, that means we don't need to focus on the retraining process for the e-commerce use case right?
You can send in new event data two ways. First, an event tracker can be used to incrementally stream in new events. In this case, Personalize will use new events to adjust recommendations in near-real-time to match the user's evolving intent (retraining is not necessary for this). Personalize will also persist those new events in the incremental interactions dataset so they are included in the next retraining.
The other way you can send in new event data is with a bulk import of the interactions dataset. Since bulk imports replace the previous bulk import, your bulk files need to include all interaction history you want to train on and not just new interactions. Bulk imports of the interactions dataset are included in the next retraining.

Automating dynamodb scripts

Like we used to do with rdbms sql scripts. I wanted to do a similar thing with my dynamodb table.
Currently its very difficult to track changes from environment to environment(dev - qa -prod). We are directly making changes via the console.
What I want to do is, Keep the table data/json in the git version control and whenever any dev makes a change, we should be able to just run a script that will be able migrate the respective changes to on the dynamodb table eg. update/create/delete the tables, add/remove/update the records.
But I am not able to find a proper way/guide to achieve this currently. I am using javascript/nodejs as our base language.
Any help regarding this scenario will be appriciable.
Thanks
ref : https://forums.aws.amazon.com/thread.jspa?threadID=342538
As far as I can tell you described to separate issues:
Changing the tables "structure"
Updating records after the update
Before I go into my answer, remember that DynamoDB is a NoSQL database and your previous RDBMS was a relational database. Operational tasks can differ very much for both types of databases.
1. Automating changes to the "structure" of the table
For this you can check out Infrastructure as Code tools like Terraform, CloudFormation or Pulumi.
But since DynamoDB is a NoSQL database, you only can do a few things like setting your hash and sort key etc and defining indices. Adding "fields" to the DynamoDB is not done with those tools, because except for the hash and sort key, there are no fields. Everything else does not follow a explicit (sql) schema.
2. Updating records after an update
If you do not have a lot of records, you could write yourself a simple tool or script to do the relevant work using the AWS SDK and run that during your CI/CD pipeline. A simple approach would be to have a "migrations" folder and if there is a file in it, the pipeline will execute it. So after the migration is done, just remove the file again. Not great, but pragmatic.
If you have a lot of records this won't work that great anymore, at least if you want to have a downtime-less deployment. In that case you will have update your software to be able to work with the old and new versions of the records structure, while you gradually update all records in the background (using a script etc.). Once all the records are updated, you can remove the code paths that handle the old structure.

Single table db architecture with AWS Amplify

By default AWS Amplify transformers creating tables per each graphql type.
But according DynamoDB documentation it's best practice to
Keep tables few as possible
Keep often queried together entries within a same table
I have an impression Amplify way of doing things stays in contradiction with the statement above.
I am new to both NoSQL and Amplify
Can someone suggest ways to address those issues?
I think we're in a bit of a transition or gray area here. I'm very new to Amplify and have been investigating moving to a single-table design as there are sources (below) that indicate that it's always be there but you'd have to write everything in VTL templates. But in 2020 they released direct lambda resolver support: https://youtu.be/EOQqi6Yun7g?t=960 (clip)
However, it seems like you lose access to the #auth directive (and probably others because you're no longer going to use #model) along with a lot of the nice out-of-the-box functionality that's available with Amplify's multi-table approach.
At this point, being that I'm developing a new app, I'm going to stick with the default multi-table design to hasten the process of getting the app functional.
Trying to implement the single-table design seems to go against what the Amplify team recommends and requires more manual work. You'd have to manually create custom lambda functions (AppSync) and code queries to DynamoDb for each data access element and manage authorization through some other means which I'm not aware of at this time. Maybe someone can chime in here...
Single table vs multi table info
Using Amplify with single table:
https://youtu.be/EOQqi6Yun7g
Single vs Multi Clip:
https://youtu.be/1WF_wped808?t=1251 (clip)
https://www.alexdebrie.com/posts/dynamodb-single-table/ (towards bottom)
https://youtu.be/EOQqi6Yun7g?t=1288 (clip)
Example single table design by Alex Debrie:
https://gist.github.com/dabit3/96dc51e688b18a7d40fc534331758c56
More Discussion:
https://stackoverflow.com/a/56438716/1956540
Basic Setup steps
I setup a single table by following the below instructions. Again, you don't use #models for this. Also, I think you have to include a type query {} in your schema for it to compile, but I could be wrong here.
So the basic steps are:
Create a single table (amplify add storage)
amplify push
Create your schema in the schema.graphql file.
Create supporting lambda function (amplify add function)
Note: if you look at the example here, I believe you can create an entry point to routes to all other methods: https://gist.github.com/dabit3/96dc51e688b18a7d40fc534331758c56#lambda
Add the DynamoDb query code in the function.
amplify push
Complete steps for Setting up a single Table:
https://catalog.us-east-1.prod.workshops.aws/workshops/53b10bf8-2271-4ab4-bfd2-39e878a90dc8/en-US/lab2/1-vtl (both "Connecting to an existing DynamoDB table" and "Direct Lambda Resolver" steps)
Not trying to be negative about Amplify, it is awesome, I love what they are doing with this product. I just think it's very new to everyone and I'm hoping this post is no longer valid next year and we continue to see great progress from the team.

The correct way to remove or update Item

I am building recommendation system for classified ads website , ads are added and deleted daily.
What I thought of is to use PutItems to add new ads and make field called status = 0 , if user deleted the ad , I will use the same PutItem API with the same ITEM_ID to update the stored Item, and use filter to select only ads with status = 0 when generation recommendation.
Is that correct ? will the PutItems API update the existing ad ? and is there anyway to delete the Item ?
Currently there is no way to remove items that were already added to Datasets.
Your workaround looks good, however from my experience with working with Personalize, the filter might decrease your recommendations quality.
To understand why, this is the more or less algorithm, that Personalize uses for filtering recommendations:
Get recommended items for user
Filter recommendations using filter expression
Return first N recommended items left after filtering
Because the filtering is done after getting recommendations, it means, that Personalize will simply fill recommendations list with items, that were somewhere down on the recommended list.
And there is a problem with that approach - items lower on the list, have lower "Score" value, which indicates accuracy of recommendations. That's why you will end up with in general worse recommendations, but it will depend how many ads that have status = 0 were recommended, before filtering out them.
To check your recommendations scores, simply get recommendations in Personalize web UI. It will return list of recs with scores.
Better approach
If your ads are updated daily, then you can definitely workaround it by following those steps:
Create a Lambda function, that is triggered every 24 hours
Lambda will fetch all of the ads and put them into S3 bucket as CSV file. It should exclude ads that are no longer available (status = 0)
Call CreateDatasetImportJob API using any AWS SDK of your choice and provide the data which is stored on S3 bucket
Personalize will start import job. When it finishes, all of the items are replaced with the newest dump
However it has some downsides.
If you are not using the User-Personalization (aws-user-personalization) Recipe, then after each import of Items, you need to update your Solution by creating new Solution Version. Otherwise it won't include changes made by items dataset import job.
Creating a new Solution Version is quite slow and expensive, that's why I would recommend to use User-Personalization Recipe, if you want to use this approach and since HRNN Recipes are marked as legacy, it's a good idea to migrate anyways.
If you are using User-Personalization Recipe, then according to AWS documentation:
Amazon Personalize automatically updates your latest solution version every two hours to include new data. Your campaign automatically uses the updated solution version. For more information see Automatic Updates.
So pretty much all of the work is done on Personalize side and you don't have to worry about Solution retraining after each Items import job.
And the last problem...
Since for User-Personalization Recipe documentation claims, that your solution will be updated within two hours, then you might end up with recommending items, that are not available, for some short period of time. If you are updating items daily, it might be a significant problem.
To fix that case, I would recommend simply using Filter approach, that you mentioned. Thanks to this, you have benefits of both approaches
and your recommendations are always valid.