cloud-ml predict sometimes fails providing the same input - google-cloud-ml

I'm using Google cloud machine learning to get prediction on an image.
I have created a model and a version in cloud-ml with my training data, but when I try to get a prediction with gcloud beta ml predict sometimes cloud-ml gives me the correct results, while some other times using the same command and the same files I encounter a server error 502, as you can see here.
I saw this post and I know that predict is actually an alpha feature and sometimes it gives problems. Can it be my case?
If so, is there alternative? Or, will there be a new release in the near future?

Note that the online prediction feature is now in Beta.
I'd suggest trying your scenario to see if you can get past the failed requests on startup.
In particular, please ensure the request to deploy a model version is complete before issuing requests.

Related

How to use SageMaker Experiments trackers and trial components?

I'm completely confused with how SageMaker Experiments works. I used the SDK to create an Experiment and a Trial. Now I want to track job parameters, metadata and metrics.
Shall I create Trial components manually with the SDK or let SM Estimator fit create them for me??
after creating my experiment and trial, I use the below code
job.fit(inputs,
experiment_config={
"ExperimentName": reg_experiment.experiment_name,
"TrialName": trial1.trial_name,
"TrialComponentDisplayName": "training-with-RF1"},
wait=False)
When I look in Studio, I see an automatically created Trial component named "training-with-RF1".
I see here and here that we can (can = must? should? could?...) also create Trials manually, for example with
my_trial = trial.Trial.create('AutoML')
my_tracker = tracker.Tracker.create()
my_tracker.log_parameter('learning_rate', 0.01)
my_trial.add_trial_component(my_tracker)
Or here with
Trial.create(
trial_name=trial_name,
experiment_name=mnist_experiment.experiment_name,
sagemaker_boto_client=sm)
When I create trials like that manually, they appear as separate empty trials than the trials created by SageMaker jobs, see below.
I'm confused because the AWS blog post says we have to create Trials manually, however SageMaker Training jobs seem to be creating those trials on our behalf...
I'm completely confused by this service, can someone please help?
The best way to do this is to create an Experiment, a Trial and then pass the experiment config to the Training Job. The training job will automatically create a Trial Component and add it to the Trial.
Depending on the type of training job you are using, some metrics will automatically be tracked in the Trial Component. You can set this up through metric_definitions regex in the Estimator.
If you are running the training job in script mode, you can install sagemaker-experiments in the container running the job (or in the python script using subprocess.call) and import the Tracker object. You can use the Tracker to log metrics from the training script to the Trial Component.
There are some examples here - https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-experiments
This is the documentation for sagemaker-experiments sdk - https://sagemaker-experiments.readthedocs.io/en/latest/tracker.html

Custom Model for Batch Prediction on Vertex.ai

I want to run batch predictions inside Google Cloud's vertex.ai using a custom trained model. I was able to find documentation to get online prediction working with a custom built docker image by setting up an endpoint, but I can't seem to find any documentation on what the Dockerfile should be for batch prediction. Specifically how does my custom code get fed the input and where does it put the output?
The documentation I've found is here, it certainly looks possible to use a custom model and when I tried it didn't complain, but eventually it did throw an error. According to the documentation no endpoint is required for running batch jobs.

google ai platform model requires more memory than allowed

I am trying to build on top of this example. I am trying to deploy the shap explainer as a custom prediction routine on google AI platform. Unfortunately when I create the version I get the following error:
Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.
Furthermore, instead of text i am working with images, and the shap explainer is a GradientExplainer.
I have sent a message to google support, but the only result I have achieved is "The Free Trial status of project xxxxx has been upgraded and you are no longer entitled to free technical support. ". Sweet. I have also tried on a no free trial account, and I get the same problem. It is not related to the account being free. It just happened that I have sent the email from my personal account. Any suggestions?

Failed to run the inference graph - what could be wrong?

I am trying to deploy a locally trained model. I followed all of the instructions here for model preparation and I managed to deploy it.
However when I try to get the predictions, the online prediction responds with 502 Server error and the batch prediction returns ('Failed to run the inference graph', 1)
Is there a way to get a better error message to narrow down what's wrong?
Thanks
The error message indicated it occurred when running the session for the inference graph. It might be possible to uncover what is be happening with some code to use the model locally. One way to test it is to create a small input dataset and feed it to the inference graph to check if you can run the session locally.
You may refer the local_predict.py in the samples/mnist/deployable/ in SDK about how to do that. Here is an example use:
python local_predict.py --input=/path/to/my/local/files --model_dir=/path/to/modeldir.
Note that the model_dir points to where the tensorflow meta graph proto and checkpoint files are saved. They are generated by training. Here is the doc link about how to train a model. https://cloud.google.com/ml/docs/how-tos/training-models. The model dir can be on GCS as well.
Thanks for bringing this up. We're continually working to improve the overall experience of the service including error reporting.

How to set diskSourceImage in google data flow pipeline

I've been trying to use custom made images to run my google data flow pipeline. Given the information from https://cloud.google.com/compute/docs/reference/latest/images I've tested the following code snippets:
DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
...
options.setDiskSourceImage("ubuntu-1504-vivid-v20150911");
options.setDiskSourceImage("projects/ubuntu-os-cloud/global/images/ubuntu-1504-vivid-v20150911");
options.setDiskSourceImage("https://www.googleapis.com/compute/beta/projects/ubuntu-os-cloud/global/images/ubuntu-1504-vivid-v20150911");
all of the above tries led to the following error in my pipeline:
(b9c7b66a676906f4): Unable to create VMs. Causes: (b9c7b66a67690aef): Error: Message: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': '[edited]'. Must be the URL to a Compute resource of the correct type HTTP Code: 400
Using a custom disk image with Dataflow is not a viable option. The flag diskSourceImage is deprecated and will be removed in a future SDK release. The reason it is no longer supported is because the Dataflow service relies on versioned resources in the VM image. So Dataflow needs control of the VM image so that we can upgrade it as necessary. If users supply their own custom images we have no way of keeping them in sync with the requirements of the Dataflow service.
If your custom VM image is based off a Dataflow image then you would be able to execute jobs using that custom image until the next release of a Dataflow VM image. There is no reasonable way in which you would be able to keep your custom images in sync with Dataflow's VM images so that you would be able to keep this working.
If you would like to customize the VM image please let us know why (e.g. send us an email at dataflow-feedback#google.com) so we can either suggest an alternative solution or else consider supporting your use case in the future.
There's a subtle issue with setDiskSourceImage -- it uses 'beta' instead of the current 'v1' version for Compute Engine. If you try the following, it should work:
options.setDiskSourceImage("https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-1504-vivid-v20150911");