Cannot use tensorboard with Vertex AI Custom job - tensorboard

I'm trying to launch a custom training job using Vertex AI through XManager. When running Custom jobs with tensorboard enabled I get a tensorboard instance in experiments -> tensorboard instances and a button on the custom job page that says OPEN TENSORBOARD. However, this leads to an empty page that says Not found: TensorboardExperiment.
I observed this behaviour when running my own custom job and when running XManager's example cifar10_tensorflow. Note that in both cases the job runs to completion without problems.
I can visualise the logs locally via the standard tensorboard package and passing as log_dir the cloud storage directory containing the experiments logs.
I can upload experiment logs to Vertex AI tensorboard manually using
tb-gcp-uploader --tensorboard_resource_name \
TENSORBOARD_INSTANCE_NAME \
--logdir=LOG_DIR \
--experiment_name=TB_EXPERIMENT_NAME --one_shot=True
For more details check out the discussion: https://github.com/deepmind/xmanager/issues/15

Related

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

GCP Vertex AI "Enable necessary APIs" when already enabled

I am new to GCP's Vertex AI and suspect I am running into an error from my lack of experience, but Googling the answer has brought me no fruitful information.
I created a Jupyter Notebook in AI Platform but wanted to schedule it to run at a set period of time. So I was hoping to use Vertex AI's Execute function. At first when I tried accessing Vertex I was unable to do so because the API had not been enabled in GCP. My IT team then enabled the Vertex AI API and I can now utilize Vertex. Here is a picture showing it is enabled. Enabled API Picture
I uploaded my notebook to a JupyterLab instance in Vertex, and when I click on the Execute button, I get an error message saying I need to "Enable necessary APIs", specifically for Vertex AI API. I'm not sure why this is considering it's already been enabled. I try to click Enable, but it just spins and spins, and then I can only get out of it by closing or reloading the tab.
One other thing I want to call out in case it's a settings issue is that currently my Managed Notebooks tab says "PREVIEW" in the Workbench. I started thinking maybe this was an indicator that there was a separate feature that needed to be enabled to use Managed Notebooks (which is where I can access the Execute button from). When I click on the User-Managed Notebooks and open JupyterLab from there, I don't have the Execute button.
The GCP account I'm using does have billing enabled.
Can anyone point me in the right direction to getting the Execute button to work?
Based on #JamesS comments, the issue was solved by adding necessary permissions on his individual account since it is the account configured on OP's Managed Notebook Instance in which has an access mode of Single user only.
Based on my testing when I tried to replicate the scenario, "Enable necessary APIs" message box will continue to show when the user has no "Vertex AI User" role assigned to it. And in conclusion of my testing, below are the minimum roles required when trying to create a Scheduled run on a Managed Notebook Instance.
Notebook Admin - For access of the notebook instance and open it through Jupyter. User will be able to run written codes in the Notebook as well.
Vertex AI User - So that the user can create schedule run on the notebook instance since the creation of the scheduled run is under the Vertex AI API itself.
Storage Admin - Creation of scheduled run will require a Google Cloud Storage bucket location where the job will be saved
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

Creating Google Cloud Image fails with "Could not fetch resource: Internal error"

I'm trying to set up a private Redash instance with Google Cloud. Step 1 is to add the the Redash image to your account so you can boot a VM with it.
When adding the image through Google Cloud Shell, my shell times out before the process completes.
When adding the image through the Console UI, it loads and loads then disappears without a trace.
When adding an image through gcloud CLI, I finally get a response:
➜ gcloud compute images create "redash" --source-uri gs://redash-images/redash.8.0.0-b32245-1.tar.gz
ERROR: (gcloud.compute.images.create) Could not fetch resource:
- Internal error. Please try again or contact Google Support. (Code: '-527xxxxxxxxxx759')
(x = hidden number)
I have extremely slow internet, so I'm thinking this could potentially be the issue. I've contacted Google Support but no response.
I reproduced and executed the command gcloud compute images create "redash" --source-uri gs://redash-images/redash.8.0.0-b32245-1.tar.gz. Even for me also it seems it is taking more time to execute, then I have killed it using CRTL + C but when I checked the Compute Engine > Images , a redash image is created with the same timestamp I had executed the command. With this experiment I assume that even though command is interrupted, the image creation may run in background. I suggest you to check the images section in Compute Engine once.
This is an issue of Redash's GCP account payment status.
If this issue reproduced, I recommend to tell redash admin to check their GCP account payment status.
The following URL talks about this issue with the Redash community.
https://discuss.redash.io/t/cant-pull-redash-image-on-google-cloud/9486

GCP run a prediction of a model every day

I have a .py file containing all the instructions to generate the predictions for some data.
Those data are taken from BigQuery and the predictions should be inserted in another BigQuery table.
Right now the code is running on a AIPlatform Notebook, but I want to schedule its execution every day, is there any way to do it?
I run into the AIPlatform Jobs, but I can't understand what should my code do and what should be the structure of the code, is there any step-by-step guide to follow?
You can schedule a Notebook execution using different options:
nbconvert
Different variants of the same technology:
nbconvert: Provides a convenient way to execute the input cells of an .ipynb notebook file and save the results, both input and output cells, as a .ipynb file.
papermill: is a Python package for parameterizing and executing Jupyter Notebooks. (Uses nbconvert --execute under the hood.)
notebook executor: This tool that can be used to schedule the execution of Jupyter notebooks from anywhere (local, GCE, GCP Notebooks) to the Cloud AI Deep Learning VM. You can read more about the usage of this tool here. (Uses gcloud sdk and papermill under the hood)
KubeFlow Fairing
Is a Python package that makes it easy to train and deploy ML models on Kubeflow. Kubeflow Fairing can also be extended to train or deploy on other platforms. Currently, Kubeflow Fairing has been extended to train on Google AI Platform.
AI Platform Notebook Executor There are two core functions of the Scheduler extension:
Ability to submit a Notebook to run on AI Platform’s Machine Learning Engine as a training job with a custom container image. This allows you to experiment and write your training code in a cost-effective single VM environment, but scale out to an AI Platform job to take advantage of superior resources (ie. GPUs, TPUs, etc.).
Scheduling a Notebook for recurring runs follows the exact same sequence of steps, but requires a crontab-formatted schedule option.
Nova Plugin: This is the predecessor of the Notebook Scheduler project. Allows you to execute notebooks directly from your Jupyter UI.
Notebook training
Python package allows users to run a Jupyter notebook at Google Cloud AI Platform Training Jobs.
GCP runner: Allows running any Jupyter notebook function on Google Cloud Platform
Unlike all other solutions listed above, it allows to run training for the whole project, not single Python file or Jupyter notebook
Allows running any function with parameters, moving from local execution to cloud is just a matter of wrapping function in a: gcp_runner.run_cloud(<function_name>, …) call.
This project is production-ready without any modifications
Supports execution on local (for testing purposes), AI Platform, and Kubernetes environments Full end to end example can be found here:
https://www.github.com/vlasenkoalexey/criteo_nbdev
tensorflow_cloud (Keras for GCP) Provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
Update July 2021:
The recommended option in GCP is Notebook Executor which is already available in EAP.

Custom code containers for google cloud-ml for inference

I am aware that it is possible to deploy custom containers for training jobs on google cloud and I have been able to get the same running using command.
gcloud ai-platform jobs submit training infer name --region some_region --master-image-uri=path/to/docker/image --config config.yaml
The training job was completed successfully and the model was successfully obtained, Now I want to use this model for inference, but the issue is a part of my code has system level dependencies, so I have to make some modification into the architecture in order to get it running all the time. This was the reason to have a custom container for the training job in the first place.
The documentation is only available for the training part and the inference part, (if possible) with custom containers has not been explored to the best of my knowledge.
The training part documentation is available on this link
My question is, is it possible to deploy custom containers for inference purposes on google cloud-ml?
This response refers to using Vertex AI Prediction, the newest platform for ML on GCP.
Suppose you wrote the model artifacts out to cloud storage from your training job.
The next step is to create the custom container and push to a registry, by following something like what is described here:
https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements
This section describes how you pass the model artifact directory to the custom container to be used for interence:
https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#artifacts
You will also need to create an endpoint in order to deploy the model:
https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api#aiplatform_deploy_model_custom_trained_model_sample-gcloud
Finally, you would use gcloud ai endpoints deploy-model ... to deploy the model to the endpoint:
https://cloud.google.com/sdk/gcloud/reference/ai/endpoints/deploy-model