Unable to drain/cancel Dataflow job. It keeps pending state - google-cloud-platform

Some jobs are remaining with pending pending state and I can't cancel them.
How do I cancel the job.
Web console shows like this.
"The graph is still being analyzed."
All logs are "No entries found matching current filter."
Job status: "Starting..."
There isn't appered a cancel button yet.
There are no instances in the Compute Engline tab.
What I did is below.
I created a streaming job. it was simple template job, Pubsub subscription to BigQuery. I set machineType as e2-micro because it was just a testing.
I also tried to drain and cancel by gcloud but it doesn't work.
$ gcloud dataflow jobs drain --region asia-northeast1 JOBID
Failed to drain job [...]: (...): Workflow modification failed. Causes: (...):
Operation drain not allowed for JOBID.
Job is not yet ready for draining. Please retry in a few minutes.
Please ensure you have permission to access the job and the `--region` flag, asia-northeast1, matches the job's
region.
This is jobs list
$ gcloud dataflow jobs list --region asia-northeast1
JOB_ID NAME TYPE CREATION_TIME STATE REGION
JOBID1 pubsub-to-bigquery-udf4 Streaming 2021-02-09 04:24:23 Pending asia-northeast1
JOBID2 pubsub-to-bigquery-udf2 Streaming 2021-02-09 03:20:35 Pending asia-northeast1
...other jobs...
Please let me know how to stop/cancel/delete these streaming jobs.
Job IDs:
2021-02-08_20_24_22-11667100055733179687
2021-02-08_20_24_22-11667100055733179687
WebUI:
https://i.stack.imgur.com/B75OX.png
https://i.stack.imgur.com/LzUGQ.png

As per personal experience some time few instance get stuck either they keep on running, or they cannot be canceled or you can not see thr graphical data flow pipelines. Best way to handle this kind of issue is to leave them in thr status, unless it is not impacting your solution by exceeding maximum concurrent runs at a moment. It will be canceled automatically or by Google team, since Dataflow is a google managed.

In GCP console Dataflow UI, if you have running Dataflow jobs, you will see the "STOP" button just like the below image.
Press the STOP button.
When you successfully stop your job, you will see the status like below. (I was too slow to stop the job with the first try, so I had to test it again. :) )

Related

Google AI Platform training - wait for the job to finish

I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:
gcloud ai-platform jobs submit training ...
Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.
The problem is, with so many parallel processes, I run out of requests for getting logs:
Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute'
of service 'logging.googleapis.com'
But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?
I've just found that I can use the Python API to launch and monitor the job:
training_inputs = {
'scaleTier': 'CUSTOM',
'masterType': 'n1-standard-8',
...
}
job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}
project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1')
request = cloudml.projects().jobs().create(
body=job_spec,
parent=project_id
)
response = request.execute()
Now I can set up a loop that checks the job state every 60 seconds
state = 'RUNNING'
while state == 'RUNNING':
time.sleep(60)
status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')
state = status_req.execute()['state']
print(state)
Regarding the error message you are experiencing, indeed you are hitting a quota exceeded for Cloud Logging, what you can do is to request a quota increase.
On the other hand, about an smarter way to check the status of a job without streaming logs, what you can do is to check the status once in a while by running gcloud ai-platform jobs describe <job_name> or create a Python script to check the status, this is explained in the following documentation.

How to clear aws batch job history in dashbord

In aws batch Job queues dashboard, it shows all job status failed and succeeded job count for 24 hours. Is it possible to reset counter to zero?
No, it's not possible to clear jobs. Batch keeps finished jobs around for at least a day (and in my experience occasionally up to a few weeks), and there's no API or console mechanism to accelerate the process.

Is there a way to be notified of status changes in Google AI Platform training jobs without polling the REST API?

Right now I monitor my submitted jobs on Google AI Platform (formerly ml engine) by polling the job REST API. I don't like this solution for a few reasons:
Awareness of status changes is often delayed or missed altogether if the interval between status changes is smaller than the monitoring polling rate
Lots of unnecessary network traffic
Lots of unnecessary function invocations
I would like to be notified as soon as my training jobs complete. It'd be great if there is some way to assign hooks or callbacks to run when the job status changes.
I've also considered adding calls to cloud functions directly within the training task python package that runs on AI Platform. However, I don't think those function calls will occur in cases where the training job is shutdown unexpectedly, such as when a job is cancelled or forced to end by GCP.
Is there a better way to go about this?
You can use a Stackdriver sink to read the logs and send it to Pub/Sub. From Pub/Sub, you can connect to a bunch of other providers:
1. Set up a Pub/Sub sink
Make sure you have access to the logs and publish rights to the topic you desire before you get started. Follow the instructions for setting up a Stackdriver -> Pub/Sub sink. You’ll want to use this query to limit the events only to Training jobs:
resource.type = "ml_job"
resource.labels.task_name = "service"
Note that Stackdriver can further limit down the query. For example, you can limit to a particular Job by adding a condition like resource.labels.job_id = "..." or to a certain event with a filter like jsonPayload.message : "..."
2. Respond to the Pub/Sub message
In order to tell what changed, the recipient of the Pub/Sub message can either query the job status from the ml.googleapis.com API or read the text of the message
Reading state from ml.googleapis.com
When you receive the message, make a call to https://ml.googleapis.com/v1/<project_id>/jobs/<job_id> to get the Job information, replacing [project_id] and [job_id] in the URL with the values of resource.label.project_id and resource.label.job_id from the Pub/Sub message, respectively.
The returned Job object contains a field state that, naturally, tells the status of the job.
Reading state from the message text
The Pub/Sub message will contain a string telling what happened to the job. You probably want behavior when the job ends. Look for these strings in jsonPayload.message:
"Job completed successfully."
"Job cancelled."
"Job failed."
I implemented a Terraform module as #htappen said. I'm happy if it would help you. But my real hope is that Google updates AI Platform with the same feature.
https://github.com/sfujiwara/terraform-google-ai-platform-notification
I think you can programmatically publish a PubSub message at the end of your training job code. Something like this:
from google.cloud import pubsub_v1
# publish job complete message
client = pubsub_v1.PublisherClient()
topic = client.topic_path(args.gcp_project_id, 'topic-name')
data = {
'ACTION': 'JOB_COMPLETE',
'SAVED_MODEL_DIR': args.job_dir
}
data_bytes = json.dumps(data).encode('utf-8')
client.publish(topic, data_bytes)
Then you can setup a cloud function to be triggered by the same pubsub topic.
You can work around the lack of a callback from the service on a custom TF training job by adding a LamdbaCallback to the fit() call. In the on_epoch method, you could then send yourself a notification on job progress and on_train_end when it finishes.
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback

Google Cloud Scheduler to start a task after a specific time every day, but only if a Pub/Sub message arrives

Is it possible to achieve interoperability between a scheduler and a pub/sub in the Google Cloud, so that a task is triggered after a specific time every day, but only if a message arrives?
UPDATED:
Example would be a task scheduled for 10:00 am waits for a msg (a pre-requisite).
At 10:00 the msg has not arrived. The job is not triggered. The msg arrives at 11:00. The job is triggered. (It can then send a msg to start the task to be executed)
At 09:00 the msg arrives. The job is not executed. At 10:00 the job is triggered.
The msg never arrives. The job is never executed.
Your puzzle seems to be an excellent match for using Cloud Tasks. At a high level, I would imagine you writing a Cloud Function that subscribes to the topic that is being published upon. The Cloud Function would contain your processing logic:
Received after 10:00am, run your job immediately.
Received before 10:00am, use Cloud Tasks to post a a task to run your job at 10:00am.
... and that's it.
Google's recommended practise is to use Google Cloud Composer such tasks.
You can use cloud composers for variety of use cases including batch processing, real-time / stream processing and cron job / scheduled task style processing.
https://cloud.google.com/composer/
Under the hood Composer is running Apache Airflow over managed GKE cluster. So it's not only orchestration tool but it also gives ability to run code using DAGs (which is essentially a cloud function). Have a look at some example DAG triggers below:
https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf
So essentially if you create a conditional DAG trigger then it should do the trick.
Hope this helps.

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.