How to automate predictions with a trained model in google cloud - google-cloud-platform

I have data from a web users in Firestore.
I have inserted some of this data in Google BigQuery in order to run a machine learning model.
I have experience in training Machine Learning models, but I don't have experience in obtain the predictions for new data once this model is trained.
I have read that I can upload this trained model in Google cloud storage and then put it in AI Platform, but I don't know the process I have to follow, because new data it is going to be inserted in Bigquery, with this new data I want to make predictions and then pick this predictions and put them in Firstore again.
I think that it could be done with Dataflow (Apache Beam) or Data composer (Airflow) where I can automate this process and schedule it to run all the process every week, but I don't have experience in use this technologies,can anyone recommend me what technology will be better for this particular case to lookup information on how to use it?
One possibility could be save the model in AI platform or in google cloud storage and with cloud functions call this saved model and make predictions to save them in firestore?

Bigquery ML supports external Tensorflow models.
TensorFlow model importing. This feature allows you to create BigQuery
ML models from previously-trained TensorFlow models, then perform
prediction in BigQuery ML. See the CREATE MODEL statement for
importing TensorFlow models for more information.
So what you want to achieve is
Get a table in BigQuery
Build out a feature set for your model (select statements)
CREATE MODEL in BigQuery (rerun this to re-train)
Run the ML.PREDICT (or equivalent) to get predictions on new data
As new data arrives into BigQuery you can
- retrain the model (externally or internally depends on type of algorithm you have)
- use the new row in predictions
https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro

For doing this you need 2 services:
One for the prediction which serve your model
One for getting the prediction and storing the result in firestore
Personally, I don't recommend you to store your model in AI-Platform today (a new release should happen by the end of the month, but today, it's no!). I wrote an article for hosting a Tensorflow model in Cloud Run. It should work another framework, but I only had built a tensorflow model, and I used it for my tests.
The best solution if your new data are in BigQuery, and if your model is in tensorflow, is to load you model in BigQuery. The prediction is free of charge, you only pay for the data in your query (I'm also writing an article on this, but I'm waiting the new AI-platform release for providing a correct comparison between both solution).
After getting the prediction, (result of BigQuery + call to Cloud Run OR Result of BigQuery with predict clause), you have to iterate of the results to store them into firestore. I recommend you a batch write to firestore

I have read that I can upload this trained model in Google cloud storage
If you want to do this you can use Dataflow. You can write a pipeline that reads data from BigQuery and writes them to GCS.
(I am not sure I understand how you want your job to interact with AI platform and Firestore)

Related

Where to best store data on Google Cloud?

If my end goal is to run a machine learning model on some CSV data, where should I best store my data file?
In a bucket,
in BigQuery, or
as a dataset under Vertex AI?
It seems that these three options can lead to overlap/redundancies in storage. Is there a practical reason why a basic CSV would have so many options for storage?
If your goal is to train a ML model in vertex AI, the best way to store data in Vertex-AI dataset.
Vertex-AI Datasets make data discoverable from a central place and provide the ability to annotate and label the data within the UI. You can upload your CSV data into the dataset on the basis of where your data resides ie. in GCS, BigQuery or local storage.
Is there a practical reason why a basic CSV would have so many options for storage? It is based on a people's requirement. If someone wants to query and visualize the data they need not go for creating Vertex-AI datasets, they can directly upload data to BQ and get insights.

Is there a way to connect PBI to a Databricks cluster that is not running?

In my scenario, Databricks is performing read and writing transformations in Delta tables. We have PBI connected to the Databricks cluster that needs to be running most of the time, which is expensive.
Knowing that delta tables are in a container, what would be the best way in terms of cost x performance to feed PBI from delta tables?
If your set size is under max allowed size in PowerBI (100 GB I guess) and daily refresh is enough you can just load everything to your PowerBI model.
https://blog.gbrueckl.at/2021/01/reading-delta-lake-tables-natively-in-powerbi/
If you want to save the costs maybe you don't need transactions and can save it in csv in data lake, than loading everything to PowerBI and refresh daily is really easy.
If you want to save the costs and query new incoming data all the time using DirectQuery consider using Azure SQL. It has really competitive prices starting from 5 eur/usd. Integration with databricks is also perfect write in append mode do all magic.
Another option to consider is to create an Azure Synapse workspace and use serverless SQL compute to query the delta lake files. This is a pay-per-the-TB consumed pricing model so you don’t have to have your Databricks cluster running all the time. It’s a great way to load Power BI import models.

Export Data from BigQuery to Google Cloud SQL using Create Job From SQL tab in DataFlow

I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.

AWS Machine Learning Retrain Model

I have some Models created in AWS Machine Learning with a S3 csv file.
After a lot of search I didn't find the better way to retrain my model.
I would like to know if there any any option to retrain my models with new data or I if need to create a new one each time.
Amazon ML is providing a set of API (and SDKs) that allows creating programmatically a pipeline that will take new data from S3 and generate the datasource and the ML models from it.
All the components including datasources, ML models, evaluation etc. are immutable, and if you want to retrain, you need to recreate it. It allows you to roll back to a previous model, if the performance of the new model is not better that the old model.

Create datasource and ML models periodically in Amazon Machine Learning

I have created a data source and trained the machine learning model in Amazon Machine Learning. The data resides in S3 which is used for creating the data source. However, my application has new data added to S3 every second, thus I need a way in which I can generate the data source and train the model periodically.
Is there a way in which I can achieve this?
Any help is appreciated.
Yes. You need to do a few things:
make sure your data source points to the prefix in s3: bucket/data/ rather than bucket/data/data.csv
write a script that you run regularly to create a new model (you unfortunately can't update the model) against this data. Here's a sample script which does this using boto: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/updatemodel/updatemodel.py
tag your new model and make sure your clients are finding the model to use via tags
delete your old models (mostly to avoid confusion)