Propagating Error messages in Google Cloud Platform (GCP) - google-cloud-platform

I am building a near real time service. The input is a cloud storage bucket and blob path to a photo image. This horizontally-scalable service is made up of multiple components including ML models running on k8s and Google Cloud Functions, each of which has a chance of failing for a variety of reasons. The ML models are independent and run in parallel. Each component is triggered by a PubSub push message topic unique to the component. Running the entire flow for one photo may take 15 seconds.
I want to return a meaning error message back to the service requester telling which component failed if there is a failure. Essentially, I want to report which image failed and where it failed.
What is the recommended practice for returning an error back to the requester?

There is no built in service for this. But, because you already use PubSub for asynchronous call, I propose to use it also to push back the error.
You can do this in 2 flavors
First, create a PubSub topic for the errors, let's say 'error_topic'
1. Without message customization
In the PubSub message, the requester put which it is in the attribute (let's say 'requester' attribute name)
In the consumer service, if an error occurs, return an error code (500 for example) for push subscription or a NACK in pull subscription.
Configure the PubSub subscription to manage retry and dead letter topic (the dead letter topic is 'error_topic')
Then, create one subscription per requester on the 'error_topic' (use the filter capability for this) and consume the message in the requester services
2. With message customization
In the PubSub message, the requester put which it is in the attribute (let's say 'requester' attribute name)
The consumer service that raises the error create a new message with custom information and copies the 'requester' attribute value and then puts it in attribute of the message in the 'error_topic' (let's say 'original_requester' attribute name).
Then, create one subscription per requester on the 'error_topic' (use the filter capability for this) and consume the message in the requester services

Related

Google Cloud Pub/Sub retrieve message by ID

Problem: My use case is I want to publish thousends of messages to Google Cloud Pub/Sub with a 5min retention period but only retrieve specific messages by their ID - So a cloud function will retrieve one message by ID using the Nodejs SDK and all the untreated messages will be deleted by the retention policy. All the current examples mention are to handle random messages from the subscriber.
Is it possible to just pull 1 message by id or any other metadata and close the connection.
There is no way to retrieve individual messages by ID, no. It doesn't really fit into the expected use cases for Cloud Pub/Sub where the publishers and subscribers are meant to be decoupled, meaning the subscriber inherently doesn't know the message IDs prior to receiving the messages.
You may instead want to transmit the messages via whatever mechanism you are using to making the subscribers aware of the message IDs. Or, if you know at publish time which messages will ultimately need to be retrieved, you could add an attribute to the message to indicate this and use filtering.

How should I handle asynchronous processes that occur after API calls in AWS?

I'm designing the backend for a website that uses API Gateway and Lambda to handle API requests, many of which target a MySQL DB on RDS. Some processes need to happen asynchronously but I'm debating which is best practice or cleaner.
In the given scenario, every time a user creates a new row in a certain table, let's say an email also needs to be sent asynchronously. There are many other scenarios similar to this but this will set precedent.
Option 1: In the lambda that handles the API request, first write to the MySQL instance to add the new row. When the response from MySQL comes back successful, write to something like SQS which will later be read from another lambda that sends an email. When the response from SQS is successful that the record was added to the queue, send a 201 response saying the REST API call was successful.
Option 2: In the lambda that handles the API request, write to the MySQL instance to add the new row. When the response from the MySQL comes back successful, send a 201 response saying the REST API call was successful. Then set up a DMS (data migration service) task that runs indefinitely to send database modification binlogs to a kinesis stream which will trigger a lambda that will handle all DB changes, read the change as a new row in a certain table, and send an email.
Option 1:
less infrastructure
more direct tracking of logic from an API call
1 extra http call (to sqs) delaying response times for an api for a web page
Option 2:
more infrastructure (dms task, replication instance)
scaling out shards may mean loss of ordering when processes binlog events if ordering is a requirement (it is)
side question: Are you able to choose hash key for kinesis for dms tasks from mysql?
a single codebase for reacting to all modifications in the DB may actually make following logic in code simpler
Is this the tradeoff or am I missing something? What is best practice in this scenario?
Option 1 in my view seems most logical, but I would replace SQS and second lambda with SNS. So, modified option 1 could be:
Option 1: In the lambda that handles the API request, first write to the MySQL instance to add the new row. When the response from MySQL comes back successful, publish confirmation message to SNS that sends an email. When the response from SNS is successful send a 201 response saying the REST API call was successful.
This should be faster, cheaper and easier to implement then using SQS and second lambda for sending email.

Move Google Pub/Sub Messages Between Topics

How can I bulk move messages from one topic to another in GCP Pub/Sub?
I am aware of the Dataflow templates that provide this, however unfortunately restrictions do not allow me to use Dataflow API.
Any suggestions on ad-hoc movement of messages between topics (besides one-by-one copy and pasting?)
Specifically, the use case is for moving messages in a deadletter topic back into the original topic for reprocessing.
You can't use snapshots, because snapshots can be applied only on subscriptions of the same topics (to avoid message ID overlapping).
The easiest way is to write a function that pull your subscription. Here, how I will do it:
Create a topic (named, for example, "transfer-topic") with a push subscription. Set the timeout to 10 minutes
Create a Cloud Functions HTTP triggered by PubSub push subscription (or a CLoud Run service). When you deploy it, set the timeout to 9 minutes for Cloud Function and to 10 minutes for Cloud Run. The content of the processing is the following
Read a chunk of messages (for examples 1000) from the deadletter pull subscription
Publish the messages (in bulk mode) into the initial topic
Acknowledge the messages of the dead letter subscription
Repeat this up to the pull subscription is empty
Return code 200.
The global process:
Publish a message in the transfer-topic
The message trigger the function/cloud run with a push HTTP
The process pull the messages and republish them into the initial topic
If the timeout is reached, the function crash and PubSub perform a retry of the HTTP request (according with an exponential backoff).
If all the message are processed, the HTTP 200 response code is returned and the process stopped (and the message into the transfer-topic subscription is acked)
this process allow you to process a very large amount of message without being worried about the timeout.
I suggest that you use a Python script for that.
You can use the PubSub CLI to read the messages and publish to another topic like below:
from google.cloud import pubsub
from google.cloud.pubsub import types
# Defining parameters
PROJECT = "<your_project_id>"
SUBSCRIPTION = "<your_current_subscription_name>"
NEW_TOPIC = "projects/<your_project_id>/topics/<your_new_topic_name>"
# Creating clients for publishing and subscribing. Adjust the max_messages for your purpose
subscriber = pubsub.SubscriberClient()
publisher = pubsub.PublisherClient(
batch_settings=types.BatchSettings(max_messages=500),
)
# Get your messages. Adjust the max_messages for your purpose
subscription_path = subscriber.subscription_path(PROJECT, SUBSCRIPTION)
response = subscriber.pull(subscription_path, max_messages=500)
# Publish your messages to the new topic
for msg in response.received_messages:
publisher.publish(NEW_TOPIC, msg.message.data)
# Ack the old subscription if necessary
ack_ids = [msg.ack_id for msg in response.received_messages]
subscriber.acknowledge(subscription_path, ack_ids)
Before running this code you will need to install the PubSub CLI in your Python environment. You can do that running pip install google-cloud-pubsub
An approach to execute your code is using Cloud Functions. If you decide to use it, pay attention in two points:
The maximum time that you function can take to run is 9 minutes. If this timeout get exceeded, your function will terminate without finishing the job.
In Cloud Functions you can just put google-cloud-pubsub in a new line of your requirements file instead of running a pip command.

Is there a way to be notified of status changes in Google AI Platform training jobs without polling the REST API?

Right now I monitor my submitted jobs on Google AI Platform (formerly ml engine) by polling the job REST API. I don't like this solution for a few reasons:
Awareness of status changes is often delayed or missed altogether if the interval between status changes is smaller than the monitoring polling rate
Lots of unnecessary network traffic
Lots of unnecessary function invocations
I would like to be notified as soon as my training jobs complete. It'd be great if there is some way to assign hooks or callbacks to run when the job status changes.
I've also considered adding calls to cloud functions directly within the training task python package that runs on AI Platform. However, I don't think those function calls will occur in cases where the training job is shutdown unexpectedly, such as when a job is cancelled or forced to end by GCP.
Is there a better way to go about this?
You can use a Stackdriver sink to read the logs and send it to Pub/Sub. From Pub/Sub, you can connect to a bunch of other providers:
1. Set up a Pub/Sub sink
Make sure you have access to the logs and publish rights to the topic you desire before you get started. Follow the instructions for setting up a Stackdriver -> Pub/Sub sink. You’ll want to use this query to limit the events only to Training jobs:
resource.type = "ml_job"
resource.labels.task_name = "service"
Note that Stackdriver can further limit down the query. For example, you can limit to a particular Job by adding a condition like resource.labels.job_id = "..." or to a certain event with a filter like jsonPayload.message : "..."
2. Respond to the Pub/Sub message
In order to tell what changed, the recipient of the Pub/Sub message can either query the job status from the ml.googleapis.com API or read the text of the message
Reading state from ml.googleapis.com
When you receive the message, make a call to https://ml.googleapis.com/v1/<project_id>/jobs/<job_id> to get the Job information, replacing [project_id] and [job_id] in the URL with the values of resource.label.project_id and resource.label.job_id from the Pub/Sub message, respectively.
The returned Job object contains a field state that, naturally, tells the status of the job.
Reading state from the message text
The Pub/Sub message will contain a string telling what happened to the job. You probably want behavior when the job ends. Look for these strings in jsonPayload.message:
"Job completed successfully."
"Job cancelled."
"Job failed."
I implemented a Terraform module as #htappen said. I'm happy if it would help you. But my real hope is that Google updates AI Platform with the same feature.
https://github.com/sfujiwara/terraform-google-ai-platform-notification
I think you can programmatically publish a PubSub message at the end of your training job code. Something like this:
from google.cloud import pubsub_v1
# publish job complete message
client = pubsub_v1.PublisherClient()
topic = client.topic_path(args.gcp_project_id, 'topic-name')
data = {
'ACTION': 'JOB_COMPLETE',
'SAVED_MODEL_DIR': args.job_dir
}
data_bytes = json.dumps(data).encode('utf-8')
client.publish(topic, data_bytes)
Then you can setup a cloud function to be triggered by the same pubsub topic.
You can work around the lack of a callback from the service on a custom TF training job by adding a LamdbaCallback to the fit() call. In the on_epoch method, you could then send yourself a notification on job progress and on_train_end when it finishes.
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback

Monitoring the status of a google pub/sub submitted job

I am new to Google Compute/Google App Engine platform. I am currently migrating a python flask application using celery for async tasks to Google Compute/Google App Engine platform. However in the docs it's written I should use Google Pub/Sub instead of celery. In my application whenever I run an async task I have a page to monitor the status of the job using the same principle as http://blog.miguelgrinberg.com/post/using-celery-with-flask. I have checked the documents for google pub/sub, but I am at loss how to implement the same using google pub/sub. Can anybody help or point me to the right direction to implement the same in google pub/sub.
You might be able to use psq for this, which is designed to look like celery. From a general Cloud Pub/Sub perspective, you would follow these steps:
Create a topic for your status update messages.
In the async task whose status you want to monitor, periodically publish a message with the status. This message will be of some format of your choosing that would indicate percentage completion or specific message to display.
Create a subscription for your monitoring page that will receive messages on the topic.
In your monitoring page (or a background process that will supply the data to your monitoring page), pull messages for the subscription.
Process the messages and update the state of your jobs for your monitoring page.
Ack the messages you pulled and processed.
A couple of things to keep in mind in this workflow:
Cloud Pub/Sub guarantees at-least-once delivery. That means you could potentially receive the same message more than once.
Cloud Pub/Sub does not provide any guarantees on ordering. Therefore, if you are periodically publishing status updates, your subscriber could potentially receive these out of order. For your case, you'll probably want your message to include some sort of timestamp or strictly-increasing identifier in your message to sequence your status updates per task. If you keep track of the most recent status update received, then you can disregard older messages and ack them immediately.