AWS Machine Learning Retrain Model - amazon-web-services

I have some Models created in AWS Machine Learning with a S3 csv file.
After a lot of search I didn't find the better way to retrain my model.
I would like to know if there any any option to retrain my models with new data or I if need to create a new one each time.

Amazon ML is providing a set of API (and SDKs) that allows creating programmatically a pipeline that will take new data from S3 and generate the datasource and the ML models from it.
All the components including datasources, ML models, evaluation etc. are immutable, and if you want to retrain, you need to recreate it. It allows you to roll back to a previous model, if the performance of the new model is not better that the old model.

Related

AWS Personalize - Recommender retraining questions

I'm a new user with AWS Personalize. So, I only have a few questions about recommender retraining below.
Currently, I focus on E-Commerce data set group and use the e-commerce use-case recommender. If I use this; It can't create a campaign right?
If I understand correctly this one is no need to retrain the model right? (If I use recommender above) because I read in many docs, it has only a retraining process when we use only the custom resource and create a campaign right?
So, when I increment the new event data, the recommender will apply the new data directly for recommendations, right? If yes, that means we don't need to focus on the retraining process for the e-commerce use case right? following this docs
that's all from my question.
Currently, I focus on E-Commerce data set group and use the e-commerce use-case recommender. If I use this; It can't create a campaign right?
The recommenders for domain dataset groups automatically manage the inference endpoint for you. So the step of creating a campaign is not necessary. The service handles this.
If I understand correctly this one is no need to retrain the model right? (If I use recommender above) because I read in many docs, it has only a retraining process when we use only the custom resource and create a campaign right?
Correct. Training and retraining is managed by the service for domain recommenders.
So, when I increment the new event data, the recommender will apply the new data directly for recommendations, right? If yes, that means we don't need to focus on the retraining process for the e-commerce use case right?
You can send in new event data two ways. First, an event tracker can be used to incrementally stream in new events. In this case, Personalize will use new events to adjust recommendations in near-real-time to match the user's evolving intent (retraining is not necessary for this). Personalize will also persist those new events in the incremental interactions dataset so they are included in the next retraining.
The other way you can send in new event data is with a bulk import of the interactions dataset. Since bulk imports replace the previous bulk import, your bulk files need to include all interaction history you want to train on and not just new interactions. Bulk imports of the interactions dataset are included in the next retraining.

Do I need to update item csv in AWS personalize?

I'm trying to use AWS personalize, and following their documents.
So I've uploaded dataset files(interaction, user, item) to S3, then created a solution and a campaign.
And I implemented PutEvents API using java.
GetRecommendations API call works good.
At this moment I'm curious I need to update dataset files, especially item csv.
In general it's done at this point for very basic recommendations.
Since you are using PutEvents call, then all of the real-time events are added to Interactions dataset this way. Interactions datasets created by manual import and by PutEvents calls are separated from themselves. You can actually see them in Personalize Datasets web console.
Still you might want to update dataset files, using dataset import job feature, but it's going to replace your existing dataset. In general I would recommend using it only when:
You just created a fresh/bigger/better dump of your database with Interactions.
You've found, that your previous interactions dataset was invalid.
The schema of dataset changed (pretty much you are forced to do it then).
User or Item dataset changed/improved, it's actually a good idea to refresh it often, so Personalize can produce better recommendations. Keep in mind, that it also requires retraining of the Solution, so the new Items/Users will be included during the recommendations generation.
So for interactions you usually don't want to update dataset. For other datasets it might be a good idea to even create an automatic import mechanism.
Keep in mind, that Items and Users datasets are used only with Personalize Recipes, that support metadata. Otherwise they are simply ignored.

How to automate predictions with a trained model in google cloud

I have data from a web users in Firestore.
I have inserted some of this data in Google BigQuery in order to run a machine learning model.
I have experience in training Machine Learning models, but I don't have experience in obtain the predictions for new data once this model is trained.
I have read that I can upload this trained model in Google cloud storage and then put it in AI Platform, but I don't know the process I have to follow, because new data it is going to be inserted in Bigquery, with this new data I want to make predictions and then pick this predictions and put them in Firstore again.
I think that it could be done with Dataflow (Apache Beam) or Data composer (Airflow) where I can automate this process and schedule it to run all the process every week, but I don't have experience in use this technologies,can anyone recommend me what technology will be better for this particular case to lookup information on how to use it?
One possibility could be save the model in AI platform or in google cloud storage and with cloud functions call this saved model and make predictions to save them in firestore?
Bigquery ML supports external Tensorflow models.
TensorFlow model importing. This feature allows you to create BigQuery
ML models from previously-trained TensorFlow models, then perform
prediction in BigQuery ML. See the CREATE MODEL statement for
importing TensorFlow models for more information.
So what you want to achieve is
Get a table in BigQuery
Build out a feature set for your model (select statements)
CREATE MODEL in BigQuery (rerun this to re-train)
Run the ML.PREDICT (or equivalent) to get predictions on new data
As new data arrives into BigQuery you can
- retrain the model (externally or internally depends on type of algorithm you have)
- use the new row in predictions
https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro
For doing this you need 2 services:
One for the prediction which serve your model
One for getting the prediction and storing the result in firestore
Personally, I don't recommend you to store your model in AI-Platform today (a new release should happen by the end of the month, but today, it's no!). I wrote an article for hosting a Tensorflow model in Cloud Run. It should work another framework, but I only had built a tensorflow model, and I used it for my tests.
The best solution if your new data are in BigQuery, and if your model is in tensorflow, is to load you model in BigQuery. The prediction is free of charge, you only pay for the data in your query (I'm also writing an article on this, but I'm waiting the new AI-platform release for providing a correct comparison between both solution).
After getting the prediction, (result of BigQuery + call to Cloud Run OR Result of BigQuery with predict clause), you have to iterate of the results to store them into firestore. I recommend you a batch write to firestore
I have read that I can upload this trained model in Google cloud storage
If you want to do this you can use Dataflow. You can write a pipeline that reads data from BigQuery and writes them to GCS.
(I am not sure I understand how you want your job to interact with AI platform and Firestore)

Create datasource and ML models periodically in Amazon Machine Learning

I have created a data source and trained the machine learning model in Amazon Machine Learning. The data resides in S3 which is used for creating the data source. However, my application has new data added to S3 every second, thus I need a way in which I can generate the data source and train the model periodically.
Is there a way in which I can achieve this?
Any help is appreciated.
Yes. You need to do a few things:
make sure your data source points to the prefix in s3: bucket/data/ rather than bucket/data/data.csv
write a script that you run regularly to create a new model (you unfortunately can't update the model) against this data. Here's a sample script which does this using boto: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/updatemodel/updatemodel.py
tag your new model and make sure your clients are finding the model to use via tags
delete your old models (mostly to avoid confusion)

update datasource used for Amazon machine learning ML model

I am using an amazon machine learning for creating ML models for my applications. I have created a datasource and also ML model corresponding to that datasource, however in my application new data always keeps getting added so I have to update the data file in s3 which in turn used by the datasource. So the question is how can I update the datasource corresponding to that data file without changing the datasource id and also how to update the ML model corresponding to that datasource without changing the ML model id?
I know that there are methods in Boto3 to update datasource or ML model however as far as I know it only updates the name of those objects.
Any help would be appreciated.
You cannot do that. Amazon ML datasources are immutable, save for the human-readable name attribute. Instead, when you have new data, create a new datasource that points at the same data file(s) in S3, and then train a new ML model using that datasource.