Import of exported Vertex-AI AutoML model in production fails - google-cloud-ml

I want to deploy a Vertex-AI model in a production project which has been trained in a training project.
----TRAINING PRJ----- --------PRODUCTION PRJ---------
Train > test > export > import > deploy | batch predict
I follow these instructions and get a success email:
Vertex AI finished uploading model "xxxxxx".
Operation State: Succeeded
but when I try to test the model with a Batch prediction I always get a failed message:
Due to an error, Vertex AI was unable to batch predict using model "TEST".
Operation State: Failed with errors Error
Messages: INTERNAL
Please note deploying the model to an endpoint and testing with a JSON request it does provide the expected response.
I tried several container types besides the one suggested here, included the one stated in the exported model's environment.json container_uri: the batch prediction always fails with message INTERNAL
Any clue?

Related

Vertex AI on G Cloud web interface : Unable to test model

Following the starter tutorial "Train a Tabular Model" I get the following error at the step of testing the model with the deployed endpoint. (As you can see in the image).
The dataset used for training the model is provided by google tutorial at this cloud location : cloud-ml-tables-data/bank-marketing.csv
Error message :
The prediction did not succeed due to the following error: Deployed
model xxxxx does not support explanation.
Official Vertex tutorial (Tabular data)
What I belive is the old version of the tutorial (not on vertex) but almost the same
When you deploy your model, you should mark the option for enable feature attributions for this model in Explainability options as you can see here. As default the option is not enabled. I know that in the tutorial it is not specified and should be. This is the same error if the model does not have this 'feature attributions' enabled and you run gcloud ai endpoints explain ENDPOINT_ID

Error when invoking pre-trained model: NotFittedError("Vocabulary not fitted or provided")

I'm new to AWS SageMaker and I'm trying to deploy a simple pre-trained model to SageMaker to create endpoint and then make predictions.
The model is a sklearn linear regression model, the input is a vectorized sparse matrix, derived from a string of text (customer's review), and output the star-rating value (1 to 5).
I have trained the model locally and export its artifact to a model.joblib file.
Then I configure the inference.py file to zip it together with the model.joblib file into a model.tar.gz file, which is then uploaded to S3 for model registration and endpoint creation.
However, when I invoke the endpoint on a sample text, the following error is returned in the CloudWatch log:
File "/miniconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 498, in _check_vocabulary
raise NotFittedError("Vocabulary not fitted or provided")
I understand that this means SageMaker is complaining about the trained model artifact being not fitted, and there is no problem with other parts (such as the inference.py file). However the pre-trained model was fitted before exporting.
I'm not sure which part was wrong, so I didn't upload any more codes not to cluster.
Thank you.

VertexAI Batch Inference Failing for Custom Container Model

I'm having trouble executing VertexAI's batch inference, despite endpoint deployment and inference working perfectly. My TensorFlow model has been trained in a custom Docker container with the following arguments:
aiplatform.CustomContainerTrainingJob(
display_name=display_name,
command=["python3", "train.py"],
container_uri=container_uri,
model_serving_container_image_uri=container_uri,
model_serving_container_environment_variables=env_vars,
model_serving_container_predict_route='/predict',
model_serving_container_health_route='/health',
model_serving_container_command=[
"gunicorn",
"src.inference:app",
"--bind",
"0.0.0.0:5000",
"-k",
"uvicorn.workers.UvicornWorker",
"-t",
"6000",
],
model_serving_container_ports=[5000],
)
I have a Flask endpoint defined for predict and health essentially defined below:
#app.get(f"/health")
def health_check_batch():
return 200
#app.post(f"/predict")
def predict_batch(request_body: dict):
pred_df = pd.DataFrame(request_body['instances'],
columns = request_body['parameters']['columns'])
# do some model inference things
return {"predictions": predictions.tolist()}
As described, when training a model and deploying to an endpoint, I can successfully hit the API with JSON schema like:
{"instances":[[1,2], [1,3]], "parameters":{"columns":["first", "second"]}}
This also works when using the endpoint Python SDK and feeding in instances/parameters as functional arguments.
However, I've tried performing batch inference with a CSV file and a JSONL file, and every time it fails with an Error Code 3. I can't find logs on why it failed in Logs Explorer either. I've read through all the documentation I could find and have seen other's successfully invoke batch inference, but haven't been able to find a guide. Does anyone have recommendations on batch file structure or the structure of my APIs? Thank you!

creating custom model on Google vertex ai

I should use Google’s managed ML platform Vertex AI to build an end-to-end machine learning workflow for an internship. Although I completely follow the tutorial, when I run a training job, I see this error message:
Training pipeline failed with error message: There are no files under "gs://dps-fuel-bucket/mpg/model" to copy.
based on the tutorial, we should not have a /model directory in the bucket. And the model should create this directory and save the final result there.
# Export model and save to GCS
model.save(BUCKET + '/mpg/model')
I added this directory but still face this error.
Does anybody have any idea, thanks in advance :)
If you're using a pre-built container, ensure that your model artifacts have filenames that exactly match the following examples:
TensorFlow SavedModel: saved_model.pb
scikit-learn: model.joblib or model.pkl
XGBoost: model.bst, model.joblib, or model.pkl
Reference : Vertex-AI Model Import

AutoML training pipeline job failed. Where can I find the logs?

I am using Vertex AI's AutoML to train a model an it fails with the error message show below. Where can I find the logs for this job?
Training pipeline failed with error message: Job failed. See logs for details.
I had the same issue just now, raised a case with Google who told me how to find the error logs.
In GCP Log Explorer, you need a filter of resource.type = "ml_job" (make sure your time range is set correctly, too!)