Google AI Platform training - wait for the job to finish

Google AI Platform training - wait for the job to finish - google-cloud-platform

I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:
gcloud ai-platform jobs submit training ...
Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.
The problem is, with so many parallel processes, I run out of requests for getting logs:
Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute'
of service 'logging.googleapis.com'
But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?

I've just found that I can use the Python API to launch and monitor the job:
training_inputs = {
'scaleTier': 'CUSTOM',
'masterType': 'n1-standard-8',
...
}
job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}
project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1')
request = cloudml.projects().jobs().create(
body=job_spec,
parent=project_id
)
response = request.execute()
Now I can set up a loop that checks the job state every 60 seconds
state = 'RUNNING'
while state == 'RUNNING':
time.sleep(60)
status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')
state = status_req.execute()['state']
print(state)

Regarding the error message you are experiencing, indeed you are hitting a quota exceeded for Cloud Logging, what you can do is to request a quota increase.
On the other hand, about an smarter way to check the status of a job without streaming logs, what you can do is to check the status once in a while by running gcloud ai-platform jobs describe <job_name> or create a Python script to check the status, this is explained in the following documentation.

Related

Unable to drain/cancel Dataflow job. It keeps pending state

Some jobs are remaining with pending pending state and I can't cancel them.
How do I cancel the job.
Web console shows like this.
"The graph is still being analyzed."
All logs are "No entries found matching current filter."
Job status: "Starting..."
There isn't appered a cancel button yet.
There are no instances in the Compute Engline tab.
What I did is below.
I created a streaming job. it was simple template job, Pubsub subscription to BigQuery. I set machineType as e2-micro because it was just a testing.
I also tried to drain and cancel by gcloud but it doesn't work.
$ gcloud dataflow jobs drain --region asia-northeast1 JOBID
Failed to drain job [...]: (...): Workflow modification failed. Causes: (...):
Operation drain not allowed for JOBID.
Job is not yet ready for draining. Please retry in a few minutes.
Please ensure you have permission to access the job and the `--region` flag, asia-northeast1, matches the job's
region.
This is jobs list
$ gcloud dataflow jobs list --region asia-northeast1
JOB_ID NAME TYPE CREATION_TIME STATE REGION
JOBID1 pubsub-to-bigquery-udf4 Streaming 2021-02-09 04:24:23 Pending asia-northeast1
JOBID2 pubsub-to-bigquery-udf2 Streaming 2021-02-09 03:20:35 Pending asia-northeast1
...other jobs...
Please let me know how to stop/cancel/delete these streaming jobs.
Job IDs:
2021-02-08_20_24_22-11667100055733179687
2021-02-08_20_24_22-11667100055733179687
WebUI:
https://i.stack.imgur.com/B75OX.png
https://i.stack.imgur.com/LzUGQ.png

As per personal experience some time few instance get stuck either they keep on running, or they cannot be canceled or you can not see thr graphical data flow pipelines. Best way to handle this kind of issue is to leave them in thr status, unless it is not impacting your solution by exceeding maximum concurrent runs at a moment. It will be canceled automatically or by Google team, since Dataflow is a google managed.

In GCP console Dataflow UI, if you have running Dataflow jobs, you will see the "STOP" button just like the below image.
Press the STOP button.
When you successfully stop your job, you will see the status like below. (I was too slow to stop the job with the first try, so I had to test it again. :) )

Cloud Tasks Conditional Execution

I am using Cloud Tasks. I need to trigger the execution of Task C only when Task A and Task B have been completed successfully. So I need some way of reading / being notified of the statuses of Tasks triggered. But I see no way of doing this in GCP's documentation. Using Node.js SDK to create tasks and Cloud Functions as task handlers if at all that helps.
Edit:
As requested, here is more info on what we are doing:
Tasks 1 - 10 each make HTTP requests, fetch data, update individual collections in Firestore based on this data. These 10 tasks can run in parallel and in no particular order as they don't have any dependency on each other. All of these tasks are actually implemented inside GCF.
Task 11 actually depends on the Firestore collection data updated by Tasks 1 - 10. So it can only run after Tasks 1 - 10 are completed successfully.
We do issue a RunID as a common identifier to group a particular run of all tasks (1 - 11).

Cloud Task only trigger task, you can only define time condition. You have to code manually the check when the task C run.
Here an example of process:
Task A is running, at the end, the task write in firestore that is completed
Task B is running, at the end, the task write in firestore that is completed
Task C start and check if A and B are completed in firestore.
If not, the task exit in error
Is yes, continue the process
You have to customize your C task queue for retrying the task in case of error.
Another, expensive, solution is to use Cloud Composer for handling this workflow
There is no other solution for now about workflow management.

Cloud Tasks is not the tool you want to use in this case. Take a look into Cloud Composer which is built in top of Apache Airflow for GCP.
Edit: You could create a GCF to handle the states of those requests
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
################ TASK A
taskA_list = [
"https://via.placeholder.com/400",
"https://via.placeholder.com/410",
"https://via.placeholder.com/420",
"https://via.placeholder.com/430",
"https://via.placeholder.com/440",
"https://via.placeholder.com/450",
"https://via.placeholder.com/460",
"https://via.placeholder.com/470",
"https://via.placeholder.com/480",
"https://via.placeholder.com/490",
]
def call2TaskA(url):
html = requests.get(url, stream=True)
return (url,html.status_code)
processes = []
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for url in taskA_list:
processes.append(executor.submit(call2TaskA, url))
isOkayToDoTaskB = True
for taskA in as_completed(processes):
result = taskA.result()
if result[1] != 200: # your validation on taskA
isOkayToDoTaskB = False
results.append(result)
if not isOkayToDoTaskB:
raise ValueError('Problems: {}'.format(results))
################ TASK B
def doTaskB():
pass
doTaskB()

Is there a way to be notified of status changes in Google AI Platform training jobs without polling the REST API?

Right now I monitor my submitted jobs on Google AI Platform (formerly ml engine) by polling the job REST API. I don't like this solution for a few reasons:
Awareness of status changes is often delayed or missed altogether if the interval between status changes is smaller than the monitoring polling rate
Lots of unnecessary network traffic
Lots of unnecessary function invocations
I would like to be notified as soon as my training jobs complete. It'd be great if there is some way to assign hooks or callbacks to run when the job status changes.
I've also considered adding calls to cloud functions directly within the training task python package that runs on AI Platform. However, I don't think those function calls will occur in cases where the training job is shutdown unexpectedly, such as when a job is cancelled or forced to end by GCP.
Is there a better way to go about this?

You can use a Stackdriver sink to read the logs and send it to Pub/Sub. From Pub/Sub, you can connect to a bunch of other providers:
1. Set up a Pub/Sub sink
Make sure you have access to the logs and publish rights to the topic you desire before you get started. Follow the instructions for setting up a Stackdriver -> Pub/Sub sink. You’ll want to use this query to limit the events only to Training jobs:
resource.type = "ml_job"
resource.labels.task_name = "service"
Note that Stackdriver can further limit down the query. For example, you can limit to a particular Job by adding a condition like resource.labels.job_id = "..." or to a certain event with a filter like jsonPayload.message : "..."
2. Respond to the Pub/Sub message
In order to tell what changed, the recipient of the Pub/Sub message can either query the job status from the ml.googleapis.com API or read the text of the message
Reading state from ml.googleapis.com
When you receive the message, make a call to https://ml.googleapis.com/v1/<project_id>/jobs/<job_id> to get the Job information, replacing [project_id] and [job_id] in the URL with the values of resource.label.project_id and resource.label.job_id from the Pub/Sub message, respectively.
The returned Job object contains a field state that, naturally, tells the status of the job.
Reading state from the message text
The Pub/Sub message will contain a string telling what happened to the job. You probably want behavior when the job ends. Look for these strings in jsonPayload.message:
"Job completed successfully."
"Job cancelled."
"Job failed."

I implemented a Terraform module as #htappen said. I'm happy if it would help you. But my real hope is that Google updates AI Platform with the same feature.
https://github.com/sfujiwara/terraform-google-ai-platform-notification

I think you can programmatically publish a PubSub message at the end of your training job code. Something like this:
from google.cloud import pubsub_v1
# publish job complete message
client = pubsub_v1.PublisherClient()
topic = client.topic_path(args.gcp_project_id, 'topic-name')
data = {
'ACTION': 'JOB_COMPLETE',
'SAVED_MODEL_DIR': args.job_dir
}
data_bytes = json.dumps(data).encode('utf-8')
client.publish(topic, data_bytes)
Then you can setup a cloud function to be triggered by the same pubsub topic.

You can work around the lack of a callback from the service on a custom TF training job by adding a LamdbaCallback to the fit() call. In the on_epoch method, you could then send yourself a notification on job progress and on_train_end when it finishes.
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback

How to retrieve current workers count for job in GCP dataflow using API

Does anyone know if there is a possibility to get current workers count for active job that is running in GCP Dataflow?
I wasn't able to do it using provided by google API.
One thing that I was able to get is CurrentVcpuCount but it is not what I need.
Thanks in advance!

The current number of workers in a Dataflow job are displayed in the message logs, under autoscaling. For example, I did a quick job as example and I got the following message, when displaying the job logs in my Cloud Shell:
INFO:root:2019-01-28T16:42:33.173Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently running step(s).
INFO:root:2019-01-28T16:43:02.166Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently running step(s).
INFO:root:2019-01-28T16:43:05.385Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO:root:2019-01-28T16:43:05.433Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
Now, you can query these messages by using the projects.jobs.messages.list method, in the Data flow API, and setting the minimumImportance parameter to be JOB_MESSAGE_BASIC.
You will get a response similar to the following:
...
"autoscalingEvents": [
{...} //other events
{
"currentNumWorkers": "1",
"eventType": "CURRENT_NUM_WORKERS_CHANGED",
"description": {
"messageText": "(fcfef6769cff802b): Worker pool started.",
"messageKey": "POOL_STARTUP_COMPLETED"
},
"time": "2019-01-28T16:43:02.130129051Z",
"workerPool": "Regular"
},
To extend this you could create a python script to parse the response, and only get the parameter currentNumWorkers from the last element in the list autoscalingEvents, to know what is the last (hence the current) number of workers in the Job.
Note that if this parameter is not present, it means that the number of workers is zero.
Edit:
I did a quick python script that retrieves the current number of workers, from the message logs, using the API I mentioned above:
from google.oauth2 import service_account
import googleapiclient.discovery
credentials = service_account.Credentials.from_service_account_file(
filename='PATH-TO-SERVICE-ACCOUNT-KEY/key.json',
scopes=['https://www.googleapis.com/auth/cloud-platform'])
service = googleapiclient.discovery.build(
'dataflow', 'v1b3', credentials=credentials)
project_id="MY-PROJECT-ID"
job_id="DATAFLOW-JOB-ID"
messages=service.projects().jobs().messages().list(
projectId=project_id,
jobId=job_id
).execute()
try:
print("Current number of workers is "+messages['autoscalingEvents'][-1]['currentNumWorkers'])
except:
print("Current number of workers is 0")
A couple of notes:
The scopes are the permissions needed on the service account key you are referencing (in the from_service_account_file function), in order to do the call to the API. This line is needed to authenticate to the API. You can use any one of this list, to make it easy on my side, I just used a service account key with project/owner permissions.
If you want to read more about the Python API Client Libraries, check this documentation, and this samples.

<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-5513132861824326",
enable_page_level_ads: true
});
</script>

Airflow : need advices when running a lot of instances per task

This is my 1st post on Stack and it is about Airflow. I need to implement a DAG which will :
1/ Download files from an API
2/ Upload them into Google Cloud Storage
3/ Insert them into BigQuery
The thing is that the step 1 involves about 170 accounts to be call. If any error is raised during the download, I want my DAG to automatically retry it from the abended step. Therefore I implemented a loop above my tasks such as :
dag = DAG('my_dag', default_args=DEFAULT_ARGS)
for account in accounts:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
dag=my_dag)
t2 = FileToGoogleCloudStorageOperator(task_id='upload_file_' + account['id'],
google_cloud_storage_conn_id = 'gcs_my_conn',
src = 'file_' + account['id'] + '.json',
bucket = 'my_bucket',
dag=my_dag)
t3 = GoogleCloudStorageToBigQueryOperator(task_id='insert_bq',
bucket = 'my_bucket',
google_cloud_storage_conn_id = 'gcs_my_conn',
bigquery_conn_id = 'bq_my_conn',
src = 'file_' + account['id'],
destination_project_dataset_table = 'my-project:my-dataset.my-table',
source_format = 'NEWLINE_DELIMITED_JSON',
dag=my_dag)
t2.set_upstream(t1)
t3.set_upstream(t2)
So at UI level, I have about 170 instances of each task display. When I run the DAG manually, Airflow is just doing nothing as far as I can see. The DAG is don't init or queued any task instance. I guess this is due to the number of instances involve but I don't know how can I workaround this.
How should I manage so many task instances ?
Thanks,
Alex

How are you running airflow currently? Are you sure the airflow scheduler is running?
You can also run airflow list_dags to ensure the dag can be compiled. If you are running airflow using Celery you should take care that your dag shows up using list_dags on all nodes running airflow.

Alex, it would be easier to post here, I saw you have DEFAULT_ARGS with retries which is at DAG level, you can also set up retries at task level as well. It is in BaseOperator, since all Operator will inherit the BaseOperator then you can use it, you can find more detail here: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/python_operator.py and https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L1864, if you check BaseOperator in model, it has retries and retry_delay, you can do something like this:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
retries=3,
retry_delay=timedelta(seconds=300),
dag=my_dag)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js