Vertex AI batch prediction location - google-cloud-platform

When I initiate a batch prediction job on Vertex AI of google cloud, I have to specify a cloud storage bucket location. Suppose I provided the bucket location, 'my_bucket/prediction/', then the prediction files are stored in something like: gs://my_bucket/prediction/prediction-test_model-2022_01_17T01_46_39_898Z, which is a subdirectory within the bucket location I provided. The prediction files are stored within that subdirectory and are named:
prediction.results-00000-of-00002
prediction.results-00001-of-00002
Is there any way to programmatically get the final export location from the batch prediction name, id or any other parameter as shown below in the details of the batch prediction job?

Not only with those parameters because and you can run the same job multiple times, new folders based on the execution date will be create, but you can get it from the API using your job id (don't forget to set the credentials by GOOGLE_APPLICATION_CREDENTIALS if you are not running on cloud sdk):
Get the output directory by the Vertex AI - Batch prediction API by the job ID:
curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) "https://us-central1-aiplatform.googleapis.com/v1/projects/[PROJECT_NAME]/locations/us-central1/batchPredictionJobs/[JOB_ID]"
Output: (Get the value from gcsOutputDirectory )
{
...
"gcsOutputDirectory": "gs://my_bucket/prediction/prediction-test_model-2022_01_17T01_46_39_898Z"
...
}
EDIT: Getting batchPredictionJobs via Python API:
from google.cloud import aiplatform
#-------
def get_batch_prediction_job_sample(
project: str,
batch_prediction_job_id: str,
location: str = "us-central1",
api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
client_options = {"api_endpoint": api_endpoint}
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
name = client.batch_prediction_job_path(
project=project, location=location, batch_prediction_job=batch_prediction_job_id
)
response = client.get_batch_prediction_job(name=name)
print("response:", response)
#-------
get_batch_prediction_job_sample("[PROJECT_NAME]","[JOB_ID]","us-central1","us-central1-aiplatform.googleapis.com")
Check details about it here
Check the API repository here

Just adding a cherry on top of #ewertonvsilva's answer...
If you are following Google's example on programmatically getting the batch prediction,
The object response from response = client.get_batch_prediction_job(name=name) has the output_config attribute that you need. All you need to do is to call response.output_info.gcs_output_directory once the prediction job is complete.

Related

Export from BigQuery to CSV based on client id

I have a BigQuery table filled with product data for a series of clients. The data has been flattened using a query. I want to export the data for each client to a Google Cloud Storage bucket in csv format - so each client has its own individual csv.
There are just over 100 clients, each with a client_id and the table itself is 1GB in size. I've looked into querying the table using a cloud function, but this would cost over 100,000 GB of data. I've also looked at importing the clients to individual tables directly from the source, but I would need to run the flattening query on each - again incurring a high data cost.
Is there a way of doing this that will limit data usage?
Have you thought about Dataproc?
You could write simple PySpark script where you load data from BigQuery and Write into Bucket splitting by client_id , something like this:
"""
File takes 3 arguments:
BIGQUERY-SOURCE-TABLE
desc: table being source of data in BiqQuery
format: project.dataset.table (str)
BUCKET-DEST-FOLDER
desc: path to bucket folder where CSV files will be stored
format: gs://bucket/folder/ (str)
SPLITER:
desc: name of column on which spit will be done during data saving
format: column-name (str)
"""
import sys
from pyspark.sql import SparkSession
if len(sys.argv) != 4:
raise Exception("""Usage:
filename.py BIGQUERY-SOURCE-TABLE BUCKET-DEST-FOLDER SPLITER"""
)
def main():
spark = SparkSession.builder.getOrCreate()
df = (
spark.read
.format("bigquery")
.load(sys.argv[1])
)
(
df
.write
.partitionBy(sys.argv[2])
.format("csv")
.option("header", True)
.mode("overwrite").
save(sys.argv[3])
)
if __name__ == "__main__":
main()
You will need to:
Save this script in Google Bucket,
Create Dataproc cluster for a while,
Run command written below,
Delete Dataproc cluster.
Let's say you have architecture as follows:
bigquery: myproject:mydataset.mytable
bucket: gs://mybucket/
dataproc cluster: my-cluster
So you will need to run following command:
gcloud dataproc jobs submit pyspark gs://mybucket/script-from-above.py \
--cluster my-cluster \
--region [region-of-cluster] \
--jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
-- \
myproject:mydataset.mytable gs://mybucket/destination/ client_id
This will save in gs://mybucket/destination/ data split by client_id and you will have files named:
client_id=1
client_id=2
...
client_id=n
As mentioned by #Mr.Batra, you can create partitions on your table based on client_id to regulate cost and amount of data queried.
Implementing Cloud Function and looping over each client id without partitions will cost more since with each
SELECT * FROM table WHERE client_id=xxx the query will scan the full table.

Authenticating model upload to VertexAI job from Cloud Scheduler

I am trying to run a custom training job on VertexAI. The goal is to train a model, save the model to cloud storage and then upload it to VertexAI as a VertexAI Model object. When I run the job from local workstation, it runs, but when I run the job from Cloud Scheduler it fails. Details below.
Python Code for the job:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import os
import pickle
from google.cloud import storage
from google.cloud import aiplatform
print("FITTING THE MODEL")
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
# fit the model
model.fit(X, y)
print("SAVING THE MODEL TO CLOUD STORAGE")
if 'AIP_MODEL_DIR' not in os.environ:
raise KeyError(
'The `AIP_MODEL_DIR` environment variable has not been' +
'set. See https://cloud.google.com/ai-platform-unified/docs/tutorials/image-recognition-custom/training'
)
artifact_filename = 'model' + '.pkl'
# Save model artifact to local filesystem (doesn't persist)
local_path = artifact_filename
with open(local_path, 'wb') as model_file:
pickle.dump(model, model_file)
# Upload model artifact to Cloud Storage
model_directory = os.environ['AIP_MODEL_DIR']
storage_path = os.path.join(model_directory, artifact_filename)
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
print ("UPLOADING MODEL TO VertexAI")
# Upload the model to vertex ai
project="..."
location="..."
display_name="custom_mdoel"
artifact_uri=model_directory
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-4:latest"
description="test model"
sync=True
aiplatform.init(project=project, location=location)
model = aiplatform.Model.upload(
display_name=display_name,
artifact_uri=artifact_uri,
serving_container_image_uri=serving_container_image_uri,
description=description,
sync=sync,
)
model.wait()
print("DONE")
Running from Local Workstation:
I set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the location of the Compute Engine default service account keys I have downloaded on my local workstation. I also set the AIP_MODEL_DIR environment variable to point to a cloud storage bucket. After I run the script, I can see the model.pkl file being created in the cloud storage bucket and the Model object being created in VertexAI.
Triggering the training job from Cloud Scheduler:
This is what I ultimately want to achieve - to run the custom training job periodically from Cloud Scheduler. I have converted the python script above into a docker image and uploaded to google artifact registry. The job specification for the Cloud Scheduler is below, I can provide additional details if required. The service account email in the oauth_token is the same whose keys I use to set the GOOGLE_APPLICATION_CREDENTIALS environment variable. When I run this, (either from local workstation or directly in a VertexAI notebook), I can see that the Cloud Scheduler job gets created which keeps triggering the custom job. The custom job is able to train the model and save it to the cloud storage. However, it is not able to upload it to VertexAI and I get the error meessages, status = StatusCode.PERMISSION_DENIED and {..."grpc_message":"Request had insufficient authentication scopes.","grpc_status":7}. Cannot figure out what the authentication issue is because in both cases I am using the same service account.
job = {
"name": f'projects/{project_id}/locations/{location}/jobs/test_job',
"description": "Test scheduler job",
"http_target": {
"uri": f'https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/customJobs',
"http_method": "POST",
"headers": {
"User-Agent": "Google-Cloud-Scheduler",
"Content-Type": "application/json; charset=utf-8"
},
"body": "..." // the custom training job body,
"oauth_token": {
"service_account_email": "...",
"scope": "https://www.googleapis.com/auth/cloud-platform"
}
},
"schedule": "* * * * *",
"time_zone": "Africa/Abidjan"
}

How to know path "gs://bucket1/folder_x" existing or not in GCP bucket

Is there a '''gsutil''' command can tell me if the path '''gs://bucket1/folder1_x/folder2_y/''' existing or not? Is there a '''ping''' command in gsutil?
I use Jenkins parameters folder_x and folder_y which value input by user, and joined by pipeline. Currently, if the dir does exist, the pipeline will show success. But if the path is wrong, the pipeline will be interrupted and shows failure.
Tried use gsutil stat and gsutil -q stat, it can test '''gs://bucket1/folder1_x/folder2_y/file1''', but not for dir.
'''groovy
pipeline {
stages {
stage('Check existing dirs') {
steps {
script{
if (params['Action'] =="List_etl-output") {
def Output_Data="${params['Datasource']}".toString().split(",").collect{"\"" + it + "\""}
def Output_Stage="${params['Etl_Output_Stage']}".toString().split(",").collect{"\"" + it + "\""}
for (folder1 in Output_Data) {
for (folder2 in Output_Stage) {
sh(script: """
gsutil ls -r gs://bucket1/*/$Data/$Stage
""" )
}
}
}
}
}
}
}
}
'''
I was use gsutil to check if the path gs://bucket1/*/$Data/$Stage available or not. The $Data and $Stage are given by user input, the Jenkins pipeline interrupted when the path not available. I want gsutil can skip the wrong path when it's not available.
The directory doesn't exist in Cloud Storage. It's a graphical representation. All the blob are stored at to bucket root and their name is composed of the full path (with / that you interpret as directory, but not). It's also for this that you can only search on prefix.
To answer your question, you can use this latest feature: search on the prefix. If there is 1 element, the folder exist BECAUSE there is at least 1 blob with this prefix. Here an example in Python (I don't know your language, I can adapt it in several language if you need)
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket1')
if len(list(bucket.list_blobs(prefix='folder_x/'))):
print('there is a file in the "directory"')
else:
print('No file with this path, so no "directory"')
Here the example in Groovy
import com.google.cloud.storage.Bucket
import com.google.cloud.storage.Storage
import com.google.cloud.storage.StorageOptions
Storage storage = StorageOptions.getDefaultInstance().service
Bucket bucket = storage.get("bucket1")
System.out.println(bucket.list(Storage.BlobListOption.prefix("folder_x/")).iterateAll().size())

How to input fsx for lustre to Amazon Sagemaker?

I am trying to set up Amazon sagemaker reading our dataset from our AWS Fsx for Lustre file system.
We are using the Sagemaker API, and previously we were reading our dataset from s3 which worked fine:
estimator = TensorFlow(
entry_point='model_script.py',
image_uri='some-repo:some-tag',
instance_type='ml.m4.10xlarge',
instance_count=1,
role=role,
framework_version='2.0.0',
py_version='py3',
subnets=["subnet-1"],
security_group_ids=["sg-1", "sg-2"],
debugger_hook_config=False,
)
estimator.fit({
'training': f"s3://bucket_name/data/{hyperparameters['dataset']}/"}
)
But now that I'm changing the input data source to Fsx Lustre file system, I'm getting an error that the file input should be s3:// or file://. I was following these docs (fsx lustre):
estimator = TensorFlow(
entry_point='model_script.py',
# image_uri='some-docker:some-tag',
instance_type='ml.m4.10xlarge',
instance_count=1,
role=role,
framework_version='2.0.0',
py_version='py3',
subnets=["subnet-1"],
security_group_ids=["sg-1", "sg-2"],
debugger_hook_config=False,
)
fsx_data_folder = FileSystemInput(file_system_id='fs-1',
file_system_type='FSxLustre',
directory_path='/fsx/data',
file_system_access_mode='ro')
estimator.fit(f"{fsx_data_folder}/{hyperparameters['dataset']}/")
Throws the following error:
ValueError: URI input <sagemaker.inputs.FileSystemInput object at 0x0000016A6C7F0788>/dataset_name/ must be a valid S3 or FILE URI: must start with "s3://" or "file://"
Does anyone understand what I am doing wrong? Thanks in advance!
I was (quite stupidly, it was late ;)) treating the FileSystemInput object as a string instead of an object. The error complained that the concatenation of obj+string is not a valid URI pointing to a location in s3.
The correct way to do it is making a FileSystemInput object out of the entire path to the dataset. Note that the fit now takes this object, and will mount it to data_dir = "/opt/ml/input/data/training".
fsx_data_obj = FileSystemInput(
file_system_id='fs-1',
file_system_type='FSxLustre',
directory_path='/fsx/data/{dataset}',
file_system_access_mode='ro'
)
estimator.fit(fsx_data_obj)

How to get URI of a blob in a google cloud storage (Python)

If I have a Blob object how can I get the URI (gs://...)?
The documentation says I can use self_link property to get the URI, but it returns the https URL instead (https://googleapis.com...)
I am using python client library for cloud storage.
Thank you
Since you are not sharing with us how exactly are you trying to achieve this I did a quick script in Python to get this info
There is no specific method in blob to get the URI as gs:// in Python but you can try to script this by using the path_helper
def get_blob_URI():
"""Prints out a bucket's labels."""
# bucket_name = 'your-bucket-name'
storage_client = storage.Client()
bucket_name = 'YOUR_BUCKET'
blob_name = 'YOUR_OBJECT_NAME'
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(blob_name)
link = blob.path_helper(bucket_name, blob_name)
pprint.pprint('gs://' + link)
If you want to use the gsutil tool you can also get all the gs:// Uris of a bucket using the command gsutil ls gs://bucket