Is it possible to provide an argument to a GCP Vertex AI training pipeline that means it will take data from a new data source/location?
For example if you want to use one GBQ table for one run and another for the next run, is it possible to run the pipeline using the new table? Or should you create a different pipeline for the new table?
Related
Let's suppose I have a table in BigQuery and I create a dataset on VertexAI based on it. I train my model. A while later, the data gets updated several times in BigQuery.
But can I simply go to my model and get redirected to the exact version of he data it was trained on?
Using time travel, I can still access the historical data in BigQuery. But I didn't manage to go to my model and figure out on which version of the data it was trained and look at that data.
On the Vertex Ai creating a dataset from BigQuery there is this statement:
The selected BigQuery table will be associated with your dataset. Making changes to the referenced BigQuery table will affect the dataset before training.
So there is no copy or clone of the table prepared automatically for you.
Another fact is that usually you don't need the whole base table to create the database, you probably subselect based on date, or other WHERE statements. Essentially the point here is that you filter your base table, and your new dataset is only a subselect of it.
The recommended way is to create a dataset, where you will drop your table sources, lets call them vertex_ai_dataset. In this dataset you will store all your tables that are part of a vertex ai dataset. Make sure to version them, and not update them.
So BASETABLE -> SELECT -> WRITE AS vertex_ai_dataset.dataset_for_model_v1 (use the later in Vertex AI).
Another option is that whenever you issue a TRAIN action, you also SNAPSHOT the base table. But we aware this need to be maintained, and cleaned as well.
CREATE SNAPSHOT TABLE dataset_to_store_snapshots.mysnapshotname
CLONE dataset.basetable;
Other params and some guide is here.
You could also automate this, by observing the Vertex AI, train event (it should documented here), and use EventArc to start a Cloud Workflow, that will automatically create a BigQuery table snapshot for you.
I want to move a dataset to another region but there are some pubsub subscriptions with dataflow templates loading the tables within the dataset. How can I do this without interrupting the dataflow jobs? Or interrupt them as little as possible.
Is it possible to do this in these steps?:
Create a temporary dataset with a temporary name in a new region
Copy original dataset to the temporary dataset
Delete old original dataset
Create a new dataset with the original name in the new region
Copy temporary dataset the new dataset with the original dataset
Im open for suggestions :D
You can use the copy dataset feature in preview for now. One interesting feature is the cross region copy available with the feature.
You can perform the same process but easier!
About your Dataflow pipeline, I think it won't work. Indeed, the location is an important information when you write to BigQuery. Have a try, but I'm pretty sure that you have to update it.
I have data from a web users in Firestore.
I have inserted some of this data in Google BigQuery in order to run a machine learning model.
I have experience in training Machine Learning models, but I don't have experience in obtain the predictions for new data once this model is trained.
I have read that I can upload this trained model in Google cloud storage and then put it in AI Platform, but I don't know the process I have to follow, because new data it is going to be inserted in Bigquery, with this new data I want to make predictions and then pick this predictions and put them in Firstore again.
I think that it could be done with Dataflow (Apache Beam) or Data composer (Airflow) where I can automate this process and schedule it to run all the process every week, but I don't have experience in use this technologies,can anyone recommend me what technology will be better for this particular case to lookup information on how to use it?
One possibility could be save the model in AI platform or in google cloud storage and with cloud functions call this saved model and make predictions to save them in firestore?
Bigquery ML supports external Tensorflow models.
TensorFlow model importing. This feature allows you to create BigQuery
ML models from previously-trained TensorFlow models, then perform
prediction in BigQuery ML. See the CREATE MODEL statement for
importing TensorFlow models for more information.
So what you want to achieve is
Get a table in BigQuery
Build out a feature set for your model (select statements)
CREATE MODEL in BigQuery (rerun this to re-train)
Run the ML.PREDICT (or equivalent) to get predictions on new data
As new data arrives into BigQuery you can
- retrain the model (externally or internally depends on type of algorithm you have)
- use the new row in predictions
https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro
For doing this you need 2 services:
One for the prediction which serve your model
One for getting the prediction and storing the result in firestore
Personally, I don't recommend you to store your model in AI-Platform today (a new release should happen by the end of the month, but today, it's no!). I wrote an article for hosting a Tensorflow model in Cloud Run. It should work another framework, but I only had built a tensorflow model, and I used it for my tests.
The best solution if your new data are in BigQuery, and if your model is in tensorflow, is to load you model in BigQuery. The prediction is free of charge, you only pay for the data in your query (I'm also writing an article on this, but I'm waiting the new AI-platform release for providing a correct comparison between both solution).
After getting the prediction, (result of BigQuery + call to Cloud Run OR Result of BigQuery with predict clause), you have to iterate of the results to store them into firestore. I recommend you a batch write to firestore
I have read that I can upload this trained model in Google cloud storage
If you want to do this you can use Dataflow. You can write a pipeline that reads data from BigQuery and writes them to GCS.
(I am not sure I understand how you want your job to interact with AI platform and Firestore)
We have a scenario where we need to have BigQuery.Read later in the pipeline (not at PBegin). Is there a way to implement it?
We are trying to run sequence of steps where we load a pcollection to bigquery table and then fetch data from that table (with some filters) after the load for our next steps. We are able to do this in multiple pipelines having bigqueryio.read at the starting of each pipeline. However, it would bea easier for our batch control if we can have it in single dataflow pipeline (loading entire bigquery tables initially and working completely off that pcollection is expensive)
I have some Models created in AWS Machine Learning with a S3 csv file.
After a lot of search I didn't find the better way to retrain my model.
I would like to know if there any any option to retrain my models with new data or I if need to create a new one each time.
Amazon ML is providing a set of API (and SDKs) that allows creating programmatically a pipeline that will take new data from S3 and generate the datasource and the ML models from it.
All the components including datasources, ML models, evaluation etc. are immutable, and if you want to retrain, you need to recreate it. It allows you to roll back to a previous model, if the performance of the new model is not better that the old model.