Get status of scheduler job from python - google-cloud-platform

I have a scheduled job running on Cloud Scheduler, and I would like to get its status ("Success", "Failed") from python. There is a python client for cloud scheduler here but can't find documentation on how to get the status.

You can get the status with the library like that
from google.cloud.scheduler import CloudSchedulerClient
client = CloudSchedulerClient()
print(client.list_jobs(parent="projects/PROJECT_ID/locations/LOCATION"))
I chose list_job but you can also use get job.
In the JSON object that you receive, you have a status field. If empty (meaning no error), the latest call was in success. If not, it was in error and you have the GRPC error code in the field.

Related

How to use logger in DAG callbacks with Airflow running on Google Composer?

We are running Apache Airflow in a Google Cloud Composer environment. This runs a pre-built Airflow on Kubernetes, our image version is composer-2.0.32-airflow-2.3.4.
In my_dag.py, we can use the logging module to log something, and the output is visible under "Logs" in Cloud Composer.
import logging
log = logging.getLogger("airflow")
log.setLevel(logging.INFO)
log.info("Hello Airflow logging!")
However, when using the same logger in a callback (e.g. on_failure_callback of a DAG), the log lines do not appear anywyhere - not in the Airflow workers, nor the airflow-scheduler nor dag-processor-manager. I am triggering a DAG failure by setting a short (e.g. 5 minute) timeout, and I confirmed that the callback is indeed running by making an HTTP request to a webhook inside the callback. The webhook is called but the logs are nowhere to be found.
Is there a way to log something in a callback, and find the logs somewhere in Airflow?
Unfortunately in the on_failure_callback method, the logs doesn't appears in the DAG tasks logs (Webserver), but normally the logs are written in Cloud Logging.
In Cloud Logging, select the Cloud Composer Environment resource, then the location (europe-west1) and, finally, the name of the composer environment: composer-log-error-example.
Then select the airflow-worker :
You can check this link
Also for the log in Airflow DAGs and method called by on_failure_callback, I usually directly use the Python logging without other init and it works well :
import logging
def task_failure_alert(context):
logging.info("Hello Airflow logging!")

Sparkmagic+livy on EMR:Invalid status code '500' from <> with error payload: "java.lang.NullPointerException"

When trying to create a pyspark session via sparkmagic+livy, it suddenly returns
Invalid status code '500' from https:<the livy server>:8998 with error payload: "java.lang.NullPointerException"
The same configs worked just some hours ago, and also tried a minimalist session creation and the result is the same.
I tried to launch a session via the REST api(instead of sparkmagic notebook), and result is the same: java.lang.NullPointerException
Queries to the livy endpoint, for example get sessions list, work fine in both curl and python requests library, so I knot the livy server is up and running. Why could nullpointer exception be returned? are there any livy(or whatever) logs I can check?

Google Cloud Composer Airflow sqlalchemy OperationalError causing DAG to hang forever

I have a bunch of tasks within a Cloud Composer Airflow DAG, one of which is a KubernetesPodOperator. This task seems to get stuck in the scheduled state forever and so the DAG runs continuously for 15 hours without finishing (it normally takes about an hour). I have to manually mark it failed for it to end.
I've set the DAG timeout to 2 hours but it does not make any difference.
The Cloud Composer logs show the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server:
Connection refused
Is the server running on host "airflow-sqlproxy-service.default.svc.cluster.local" (10.7.124.107)
and accepting TCP/IP connections on port 3306?
The error log also gives me a link to this documentation about that error type: https://docs.sqlalchemy.org/en/13/errors.html#operationalerror
When the DAG is next triggered on schedule, it works fine without any fix required. This issue happens intermittently, we've not been able to reproduce it.
Does anyone know the cause of this error and how to fix it?
The reason behind the issue is related to SQLAlchemy using a session by a thread and creating a callable session that can be used later in the Airflow Code. If there are some minimum delays between the queries and sessions, MySQL might close the connection. The connection timeout is set to approximately 10 minutes.
Solutions:
Use the airflow.utils.db.provide_session decorator. This decorator
provides a valid session to the Airflow database in the session
parameter and closes the session at the end of the function.
Do not use a single long-running function. Instead, move all database
queries to separate functions, so that there are multiple functions
with the airflow.utils.db.provide_session decorator. In this case,
sessions are automatically closed after retrieving query results.

Is there a way to be notified of status changes in Google AI Platform training jobs without polling the REST API?

Right now I monitor my submitted jobs on Google AI Platform (formerly ml engine) by polling the job REST API. I don't like this solution for a few reasons:
Awareness of status changes is often delayed or missed altogether if the interval between status changes is smaller than the monitoring polling rate
Lots of unnecessary network traffic
Lots of unnecessary function invocations
I would like to be notified as soon as my training jobs complete. It'd be great if there is some way to assign hooks or callbacks to run when the job status changes.
I've also considered adding calls to cloud functions directly within the training task python package that runs on AI Platform. However, I don't think those function calls will occur in cases where the training job is shutdown unexpectedly, such as when a job is cancelled or forced to end by GCP.
Is there a better way to go about this?
You can use a Stackdriver sink to read the logs and send it to Pub/Sub. From Pub/Sub, you can connect to a bunch of other providers:
1. Set up a Pub/Sub sink
Make sure you have access to the logs and publish rights to the topic you desire before you get started. Follow the instructions for setting up a Stackdriver -> Pub/Sub sink. You’ll want to use this query to limit the events only to Training jobs:
resource.type = "ml_job"
resource.labels.task_name = "service"
Note that Stackdriver can further limit down the query. For example, you can limit to a particular Job by adding a condition like resource.labels.job_id = "..." or to a certain event with a filter like jsonPayload.message : "..."
2. Respond to the Pub/Sub message
In order to tell what changed, the recipient of the Pub/Sub message can either query the job status from the ml.googleapis.com API or read the text of the message
Reading state from ml.googleapis.com
When you receive the message, make a call to https://ml.googleapis.com/v1/<project_id>/jobs/<job_id> to get the Job information, replacing [project_id] and [job_id] in the URL with the values of resource.label.project_id and resource.label.job_id from the Pub/Sub message, respectively.
The returned Job object contains a field state that, naturally, tells the status of the job.
Reading state from the message text
The Pub/Sub message will contain a string telling what happened to the job. You probably want behavior when the job ends. Look for these strings in jsonPayload.message:
"Job completed successfully."
"Job cancelled."
"Job failed."
I implemented a Terraform module as #htappen said. I'm happy if it would help you. But my real hope is that Google updates AI Platform with the same feature.
https://github.com/sfujiwara/terraform-google-ai-platform-notification
I think you can programmatically publish a PubSub message at the end of your training job code. Something like this:
from google.cloud import pubsub_v1
# publish job complete message
client = pubsub_v1.PublisherClient()
topic = client.topic_path(args.gcp_project_id, 'topic-name')
data = {
'ACTION': 'JOB_COMPLETE',
'SAVED_MODEL_DIR': args.job_dir
}
data_bytes = json.dumps(data).encode('utf-8')
client.publish(topic, data_bytes)
Then you can setup a cloud function to be triggered by the same pubsub topic.
You can work around the lack of a callback from the service on a custom TF training job by adding a LamdbaCallback to the fit() call. In the on_epoch method, you could then send yourself a notification on job progress and on_train_end when it finishes.
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback

Can Control-M execute a http service endpoint to GET job status?

I am very new to control-m and wanted to ask if control-m supports this scenario:
We have a http webservice that runs a log running job e.g.
http://myserver/runjob?jobname=A
This will then start job A on the server and returns a job id back. I return job id so i can get status of the job from the server when ever i want to. The job has many statuses e.g. Waiting, In progress, error
I want the control-m job status to be updated as soon as the job on the server updates. For that, I have created a webservice url:
http://localhost/getjobsatus?jobid=1
This url request will get the job status of the job id 1
Can control-m poll a web service url for a job status and can I call a web service to run a job and get its id back?
Apologies for asking this basic level question. Any help will be really appreciated.
Welcome to the Control-M community :-)
You can implement 2 Control-M WebServices jobs (available with BPI – Business Process Integration Suite), one to submit your job and get its ID, and one to track its status.
Alternatively you can implement this in 1 Control-M OS type job using the ctmsubmit command inside a script…
Feel free to join our Control-M online community