Failed to run the inference graph - what could be wrong? - google-cloud-ml

I am trying to deploy a locally trained model. I followed all of the instructions here for model preparation and I managed to deploy it.
However when I try to get the predictions, the online prediction responds with 502 Server error and the batch prediction returns ('Failed to run the inference graph', 1)
Is there a way to get a better error message to narrow down what's wrong?
Thanks

The error message indicated it occurred when running the session for the inference graph. It might be possible to uncover what is be happening with some code to use the model locally. One way to test it is to create a small input dataset and feed it to the inference graph to check if you can run the session locally.
You may refer the local_predict.py in the samples/mnist/deployable/ in SDK about how to do that. Here is an example use:
python local_predict.py --input=/path/to/my/local/files --model_dir=/path/to/modeldir.
Note that the model_dir points to where the tensorflow meta graph proto and checkpoint files are saved. They are generated by training. Here is the doc link about how to train a model. https://cloud.google.com/ml/docs/how-tos/training-models. The model dir can be on GCS as well.
Thanks for bringing this up. We're continually working to improve the overall experience of the service including error reporting.

Related

How to use SageMaker Experiments trackers and trial components?

I'm completely confused with how SageMaker Experiments works. I used the SDK to create an Experiment and a Trial. Now I want to track job parameters, metadata and metrics.
Shall I create Trial components manually with the SDK or let SM Estimator fit create them for me??
after creating my experiment and trial, I use the below code
job.fit(inputs,
experiment_config={
"ExperimentName": reg_experiment.experiment_name,
"TrialName": trial1.trial_name,
"TrialComponentDisplayName": "training-with-RF1"},
wait=False)
When I look in Studio, I see an automatically created Trial component named "training-with-RF1".
I see here and here that we can (can = must? should? could?...) also create Trials manually, for example with
my_trial = trial.Trial.create('AutoML')
my_tracker = tracker.Tracker.create()
my_tracker.log_parameter('learning_rate', 0.01)
my_trial.add_trial_component(my_tracker)
Or here with
Trial.create(
trial_name=trial_name,
experiment_name=mnist_experiment.experiment_name,
sagemaker_boto_client=sm)
When I create trials like that manually, they appear as separate empty trials than the trials created by SageMaker jobs, see below.
I'm confused because the AWS blog post says we have to create Trials manually, however SageMaker Training jobs seem to be creating those trials on our behalf...
I'm completely confused by this service, can someone please help?
The best way to do this is to create an Experiment, a Trial and then pass the experiment config to the Training Job. The training job will automatically create a Trial Component and add it to the Trial.
Depending on the type of training job you are using, some metrics will automatically be tracked in the Trial Component. You can set this up through metric_definitions regex in the Estimator.
If you are running the training job in script mode, you can install sagemaker-experiments in the container running the job (or in the python script using subprocess.call) and import the Tracker object. You can use the Tracker to log metrics from the training script to the Trial Component.
There are some examples here - https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-experiments
This is the documentation for sagemaker-experiments sdk - https://sagemaker-experiments.readthedocs.io/en/latest/tracker.html

AWS Sagemaker - Custom Training Job not saving Model output

I'm running a training job using AWS SageMaker and i'm using a custom Estimator based on an available docker image from AWS. I wanted to get some feedback on whether my process is correct or not prior to deployment.
I'm running the training job in a docker container using 'local' in a SageMaker notebook instance and the training job runs successfully. However, after the job completes and saves the model to opt/model/models within the docker image, once the docker container exits, the model saved from training is lost. Ideally, i'd like to use the model for inference, however, I'm not sure about the best way of doing it. I have also tried the training job after pushing the image to ECR, but the same thing happens.
It is my understanding that the docker state is lost, once the image exits, as such, is it possible to persist the model that was produced in training in the image? One option I have thought about is saving the model output to an S3 bucket once the training job is complete, then pulling that model into another docker image for inference. Is this expected behaviour and the correct way of doing it?
I am fairly new to using SageMaker but i'd like to do it according to best practices. I've looked at a lot of the AWS documents and followed the tutorials but it doesn't seem to mention explicitly if this is how it should be done.
Thanks for any feedback on this.
You can refer to Rok's comment on saving a model file when you're using a custom estimator. That said, SageMaker built-in estimators save the model artifacts to S3. To make inferences using that model, you can either use a real-time inference endpoint for real time predictions, or a batch transformer to run inferences in batch mode. In both cases, you'll have to point the configuration to the container for inference and the model artifacts. the amazon-sagemaker-examples repository has examples for common frameworks, especially, the scikit-learn example has detailed explanations.
Also, make sure the model is being saved to /opt/ml/model/, not opt/model/models as mentioned in your question.

AWS Sagemaker Pytorch does not run properly

I am currently trying to train a model using pytorch on AWS Sagemaker but can't get it to run properly. My main question now is: Is there some workflow step I'm missing? Any help is greatly appreciated.
I managed to get the code running on colab or on a local machine for example but not on sagemaker.
In short the program should: Setup a pytorch model, load the train data from a file system and perform train epochs.
For this, I am trying the following:
The code files (dataloaders/help functions etc) with the "entry point" are stored at Sagemaker Studio in the folder "code".
enter image description here
The train files are stored in a s3 bucket and are transfered in "file mode".
I then call the estimator in a python notebook as this:
estimator = PyTorch(entry_point='entry.py',
role=role,
py_version='py3',
source_dir = "code",
output_path = "s3://XXXXX/XXXXXX/XXXX",
framework_version='1.3.1',
instance_count=1,
instance_type='ml.g4dn.2xlarge',
hyperparameters={
'epochs': 5,
'backend': 'gloo'
})
inputs = "s3://XXXXX/XXXXX"
estimator.fit({'training': inputs})
In the output I can see, that the train instance is prepared and the data is downloaded but then the problem arises:
For some reason the program jumps right into the train method. The outputs of the first steps which should take place before a train epoch, Network whitening for example, are shown after or during the train step. After one train epoch the program freezes without any error message until I manually stop the instance.
Thanks for any help.
Your script gets stuck after one epoch.
There's no error messages to review, nor the code itself.
I suggest to try and troubleshoot this in a fast manner using SageMaker Local mode (e.g., instance_type='local_gpu'). This will allow you to retry different configurations in seconds instead of minutes. And potentially remote debug it.
Note: SageMaker local requires docker support, so you'll need to run this either on your laptop, or on a SageMaker notebook instance like 'ml.g4dn.2xlarge', but not on a SageMaker Studio instance). And potentially remote debug it.

Custom Model for Batch Prediction on Vertex.ai

I want to run batch predictions inside Google Cloud's vertex.ai using a custom trained model. I was able to find documentation to get online prediction working with a custom built docker image by setting up an endpoint, but I can't seem to find any documentation on what the Dockerfile should be for batch prediction. Specifically how does my custom code get fed the input and where does it put the output?
The documentation I've found is here, it certainly looks possible to use a custom model and when I tried it didn't complain, but eventually it did throw an error. According to the documentation no endpoint is required for running batch jobs.

cloud-ml predict sometimes fails providing the same input

I'm using Google cloud machine learning to get prediction on an image.
I have created a model and a version in cloud-ml with my training data, but when I try to get a prediction with gcloud beta ml predict sometimes cloud-ml gives me the correct results, while some other times using the same command and the same files I encounter a server error 502, as you can see here.
I saw this post and I know that predict is actually an alpha feature and sometimes it gives problems. Can it be my case?
If so, is there alternative? Or, will there be a new release in the near future?
Note that the online prediction feature is now in Beta.
I'd suggest trying your scenario to see if you can get past the failed requests on startup.
In particular, please ensure the request to deploy a model version is complete before issuing requests.