I am building recommendation system for classified ads website , ads are added and deleted daily.
What I thought of is to use PutItems to add new ads and make field called status = 0 , if user deleted the ad , I will use the same PutItem API with the same ITEM_ID to update the stored Item, and use filter to select only ads with status = 0 when generation recommendation.
Is that correct ? will the PutItems API update the existing ad ? and is there anyway to delete the Item ?
Currently there is no way to remove items that were already added to Datasets.
Your workaround looks good, however from my experience with working with Personalize, the filter might decrease your recommendations quality.
To understand why, this is the more or less algorithm, that Personalize uses for filtering recommendations:
Get recommended items for user
Filter recommendations using filter expression
Return first N recommended items left after filtering
Because the filtering is done after getting recommendations, it means, that Personalize will simply fill recommendations list with items, that were somewhere down on the recommended list.
And there is a problem with that approach - items lower on the list, have lower "Score" value, which indicates accuracy of recommendations. That's why you will end up with in general worse recommendations, but it will depend how many ads that have status = 0 were recommended, before filtering out them.
To check your recommendations scores, simply get recommendations in Personalize web UI. It will return list of recs with scores.
Better approach
If your ads are updated daily, then you can definitely workaround it by following those steps:
Create a Lambda function, that is triggered every 24 hours
Lambda will fetch all of the ads and put them into S3 bucket as CSV file. It should exclude ads that are no longer available (status = 0)
Call CreateDatasetImportJob API using any AWS SDK of your choice and provide the data which is stored on S3 bucket
Personalize will start import job. When it finishes, all of the items are replaced with the newest dump
However it has some downsides.
If you are not using the User-Personalization (aws-user-personalization) Recipe, then after each import of Items, you need to update your Solution by creating new Solution Version. Otherwise it won't include changes made by items dataset import job.
Creating a new Solution Version is quite slow and expensive, that's why I would recommend to use User-Personalization Recipe, if you want to use this approach and since HRNN Recipes are marked as legacy, it's a good idea to migrate anyways.
If you are using User-Personalization Recipe, then according to AWS documentation:
Amazon Personalize automatically updates your latest solution version every two hours to include new data. Your campaign automatically uses the updated solution version. For more information see Automatic Updates.
So pretty much all of the work is done on Personalize side and you don't have to worry about Solution retraining after each Items import job.
And the last problem...
Since for User-Personalization Recipe documentation claims, that your solution will be updated within two hours, then you might end up with recommending items, that are not available, for some short period of time. If you are updating items daily, it might be a significant problem.
To fix that case, I would recommend simply using Filter approach, that you mentioned. Thanks to this, you have benefits of both approaches
and your recommendations are always valid.
Related
I am having an AWS personalize solution with 3 successful solution versions trained with trainingMode = "FULL" using both AWS console and AWS SDK python. However when I tried to create a new solution version with trainingMode="UPDATE" as shown in the code below:
import boto3
personalize = boto3.client('personalize')
personalize.create_solution_version(solutionArn = solution_arn, trainingMode = "UPDATE")
I've got back the following exception:
Exception has occurred: InvalidInputException
An error occurred (InvalidInputException) when calling the CreateSolutionVersion operation: There should be updates to at least one dataset after last active solution version with training mode set to FULL.
Anyone experiencing this issue, is there anything I miss to be able to train with an UPDATE mode?
The purpose of trainingMode="UPDATE" is to process new items added to the items dataset (via PutItems or a bulk upload) as well as impression data for new interactions added to the interactions since the last FULL/UPDATE training. The new items and impressions are used to update the exploration feature for solutions created with the aws-user-personalization recipe. That is, to bring in new/cold items for exploration and to adjust probabilities for existing cold items in further exploration. Note that UPDATE only brings in new items and impression data and does not retrain the model.
Therefore, if there are no dataset updates since the last FULL/UPDATE, there is no value in creating a new solution version with UPDATE.
Finally, keep in mind that Personalize automatically updates solution versions created with the aws-user-personalization recipe every two hours at no cost. This essentially does an UPDATE for you.
With User-Personalization, Amazon Personalize automatically updates the latest model (solution version) every two hours behind the scenes to include new data without creating a new solution version. With each update, Amazon Personalize updates the solution version with the latest item information and adjusts the exploration according to implicit feedback from users. This allows Amazon Personalize to gauge item quality based on new interactions for already explored items and continually update item exploration.
If you create a solution version with UPDATE, you will be charged for the server hours to perform the update. Practically speaking, the only time you would need to manually create a solution version with UPDATE is when you do not want to wait for the next automatic update.
I'm trying to use AWS personalize, and following their documents.
So I've uploaded dataset files(interaction, user, item) to S3, then created a solution and a campaign.
And I implemented PutEvents API using java.
GetRecommendations API call works good.
At this moment I'm curious I need to update dataset files, especially item csv.
In general it's done at this point for very basic recommendations.
Since you are using PutEvents call, then all of the real-time events are added to Interactions dataset this way. Interactions datasets created by manual import and by PutEvents calls are separated from themselves. You can actually see them in Personalize Datasets web console.
Still you might want to update dataset files, using dataset import job feature, but it's going to replace your existing dataset. In general I would recommend using it only when:
You just created a fresh/bigger/better dump of your database with Interactions.
You've found, that your previous interactions dataset was invalid.
The schema of dataset changed (pretty much you are forced to do it then).
User or Item dataset changed/improved, it's actually a good idea to refresh it often, so Personalize can produce better recommendations. Keep in mind, that it also requires retraining of the Solution, so the new Items/Users will be included during the recommendations generation.
So for interactions you usually don't want to update dataset. For other datasets it might be a good idea to even create an automatic import mechanism.
Keep in mind, that Items and Users datasets are used only with Personalize Recipes, that support metadata. Otherwise they are simply ignored.
Pretty new to the dynamoDb and the whole AWS, it's very exciting but I feel the learning curve is a bit steep. Anyway, here is my situation and my problem.
We have a mobile react native app which stores into a dynamoDb table one row each time the users are doing a search. (the database is a search history with a UUID and then the search criteria). On average we have a few thousands new searches into the table every day. The table has just a primary key which is the search id.
The app is quite new but we are reaching the few hundred thousand rows in the table already and can expect having a million in the following months. The data is plain simple data with unique id and string and numbers in the other attributes. No connection, no relationship, etc... That's already when I felt maybe DynamoDb may not have been the best choice but still, I read everywhere it can be suitable for anything if properly managed.
Next to this there is a webapp dashboard which -thanks to a rest api using nodejs lambdas- queries the dynamoDB to make statistics about the searches: how many searches per day, list of last searches... the problem is DynamoDb is not really suitable to query hundred thousands of data (the 1mb limit, query limitations, credits...).
When I do a scan I get only 3000 searches. I tried to make a loop on the scan using the last index requested but after a few test I did not get data and I blocked the maximum throughput. It seems really clear that I don't have the right approach to bring all these searches to my web app. So now what would be the right approach? My ideas are the following but I am open to more experienced one:
Switching to a SQL database (using the aws migration ?). Will it really be easier then?
creating lambdas to execute scheduled jobs every night to make statistics every day so that I don't have to query the full database all the time but just some of the most recent searches and the statistics rows? Is it doable? any node.js / lambdas tutorial you may know regarding this?
better management of indexes? I am still very lost regarding those.
Looking forward to your opinions.
Add another layer to take care for full text search.
For example, with Elasticsearch, or Algolia or other similars.
Notes:
Elasticsearch may be cost you a lot if compare the cost on dynamodb
Reference:
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/
I am essentially trying to build a website where members can post blog entries and i want to record unique and overall page views for the different posts in absolute terms as well as over different time-frames e.g., last 24h, last week etc.
My initial approach was to use the date as primary key and the blogPostId as secondary key, i could then add all the posts visited during a given day. If i then include the userIds as an attribute i should then be able to a)get unique page views and b)overall page views (which might include duplicate visits by a specific user) for a given day. Finally, i would then pull the primary key for let's say the last 7 days and extract the most popular post.
As far as i can tell this should work fine as long as there aren't too many entries, however, i'm sceptical if this will scale. More specifically, if the number of blog posts increases a lot for a given interval, or if i want to find the all-time most viewed post i'd essentially have to read the whole table.
Has anyone an idea how i could implement this more efficiently?
DynamoDB will almost certainly work for you, and if you need an excuse to use it, by all means give it a try. If you get a ton or traffic it might end up being expensive.
Personally, I would consider using redis for what you are asking to do, and here is a pretty good/detailed question/answer on how you might implement it:
Scalable way of logging page request data from a PHP application?
DynamoDB can be used to iterate and create this feature quickly.
Nonetheless, this is a feature for Amazon Kinesis Data Streams, which will let you ingest data and then manipulate it to your needs.
Know that Kinesis can become expensive if you try to be as frugal as possible.
But, if you start receiving a lot of traffic, Kinesis will work as a Queue and let you manipulate the data before ingesting it to DynamoDB (Or another Data Store) (It will be cheaper than sending all those write requests).
Another limitation you'd like to check out is that DynamoDB will only return up to 1MB per Query.
Amazon recommends you use Redshift to handle all those operations as it is more suited to perform aggregation and calculation across Data warehouses.
I am looking for a generic method of filtering a series of sitecore items based on the users current profile, I found one promising example:
How do I trigger a profile in Sitecore DMS?
However a few critical references are missing which is a shame as it looks to be a suitably generic function
Resources.Settings.AnalyticsUserProfileEnableSwitch I assume to simply be a boolean switch
The killer is ApplyUserProfile(filter)
Please keep in mind that user profiles are NOT the same thing as profiles in DMS. In DMS this is in reference to Analytics profiles related not to the specific user, but in visiting profiles... i.e. Marketing personas.
If you want to filter items based on user profiles, you simply get the Sitecore.Context.User.Profile and get whatever the property is and implement your logic to how you want to filter.
If you want to filter items based on DMS profiles, then that's something that's going to be difficult to do due to the fact that personas are not entered into the Analytics database real time. Those really aren't something you'll even be aware of at run time and therefore it's going to be difficult to categorize the persona at run time. You could, however, use the rules system to do some filtering based on other criteria (such as using the Engagement plans or something else)... but without more information, that's about as much as can be said.