It was not clear to me how to use Cloud Run on a PubSub topic for for medium-run tasks (inside of the time limit of Cloud Run, of course.)
Let's see this example taken from the tutorials[1]:
app.post('/', (req, res) => {
if (!req.body) {
const msg = 'no Pub/Sub message received'
console.error(`error: ${msg}`)
res.status(400).send(`Bad Request: ${msg}`)
return
}
if (!req.body.message) {
const msg = 'invalid Pub/Sub message format'
console.error(`error: ${msg}`)
res.status(400).send(`Bad Request: ${msg}`)
return
}
const pubSubMessage = req.body.message
const name = pubSubMessage.data
? Buffer.from(pubSubMessage.data, 'base64').toString().trim()
: 'World'
console.log(`Hello ${name}!`)
res.status(204).send()
})
My doubt is: Should it return HTTP 204 only after the task finishes, otherwise the task will terminated sudden?
1 - https://cloud.google.com/run/docs/tutorials/pubsub
My doubt is: Should it return HTTP 204 only after the task finishes,
otherwise the task will terminated sudden?
You do not have a choice. If you return before your task/objective finishes, the CPU will be idled to zero and nothing will happen in your Cloud Run instance.
In your example, you are just processing a pub/sub message and extracting the name. If you return before this is finished, no name will be processed.
Cloud Run is designed for an HTTP Request/Response system. This means processing begins when you receive an HTTP Request (GET, POST, PUT, etc.) and ends when your code returns an HTTP Response (or just returns with no response). You might try to create background threads but there is no guarantee that they will execute once your main function returns.
Related
Situation:
I'm trying to have a single message in Pub/Sub processed by exactly 1 instance of Cloud Run. Additional messages will be processed by another instance of Cloud Run. Each message triggers a heavy computation that runs for around 100s in the Cloud Run instance.
Currently, Cloud Run is configured with max concurrency requests = 1, and min/max instances of 0/5. Subscription is set to allow for 600s Ack deadline.
Issue:
Each message seems to be triggering multiple instances of Cloud Run to be spun up. I believe that it is due to high CPU utilization that's causing Cloud Run to spin up additional instances to help process. Unfortunately, these new instances are attempting to process the same exact message, causing unintented results.
Question:
Is there a way to force Cloud Run to only have 1 instance process a single message, regardless of CPU utilization and other potential factors?
Relevant Code Snippet:
import base64
import json
from fastapi import FastAPI, Request, Response
app = FastAPI()
#app.post("/")
async def handleMessage(request: Request):
envelope = await request.json()
# Basic data validation
if not envelope:
msg = "no Pub/Sub message received"
print(f"error: {msg}")
return Response(content=msg, status_code=400)
if not isinstance(envelope, dict) or "message" not in envelope:
msg = "invalid Pub/Sub message format"
print(f"error: {msg}")
return Response(content=msg, status_code=400)
message = envelope["message"]
if isinstance(message, dict) and "data" in message:
data = json.loads(base64.b64decode(message["data"]).decode("utf-8").strip())
try:
# Do computationally heavy operations here
# Will run for about 100s
return Response(status_code=204)
except Exception as e:
print(e)
Thanks!
I've found the issue.
Apparently, Pub/Sub guarantees "at least once" delivery, which means it is possible for it to deliver a message to a subscriber more than once. The onus is therefore on the subscriber, which in my case is Cloud Run, to handle such scenarios (idempotency) gracefully.
In my application, I have implemented Google Tasks so that my users can receive notifications on when their ToDo item is due.
My main issue is that when my Cloud Task fires, I noticed that I still see it located in my Cloud Task Console. So, do they delete themselves once they are fired? For my application, I want the cloud tasks to delete themselves once they are done.
I noticed in the documentation this line you can also fine-tune the configuration for the task, like scheduling a time in the future when it should be executed or limiting the number of times you want the task to be retried if it fails. The thing is, my task is not failing and yet I see the number of retries at 4.
firebase cloud functions
exports.firestoreTtlCallback = functions.https.onRequest(async (req, res) => {
try {
const payload = req.body;
let entry = await (await admin.firestore().doc(payload.docPath).get()).data();
let tokens = await (await admin.firestore().doc(`/users/${payload.uid}`).get()).get('tokens')
await admin.messaging().sendMulticast({
tokens,
notification: {
title: "App",
body: entry['text']
}
}).then((response) => {
log('Successfully sent message:')
log(response)
}).catch((error) => {
log('Error in sending Message')
log(error)
})
const taskClient = new CloudTasksClient();
let { expirationTask } = admin.firestore().doc(payload.docPath).get()
await taskClient.deleteTask({ name: expirationTask })
await admin.firestore().doc(payload.docPath).update({ expirationTask: admin.firestore.FieldValue.delete() })
res.status(200)
} catch (err) {
log(err)
res.status(500).send(err)
}
})
A task can be deleted if it is scheduled or dispatched. A task cannot be deleted if it has completed successfully or permanently failed according to this documentation.
The task attempt has succeeded if the app's request handler returns
an HTTP response code in the range [200 - 299].
The task attempt has failed if the app's handler returns a non-2xx
response code or Cloud Tasks does not receive response before the
deadline which is :
For HTTP tasks, 10 minutes. The deadline must be in the interval [15
seconds, 30 minutes]
For App Engine tasks, 0 indicates that the request has the default
deadline. The default deadline depends on the scaling type of
the service: 10 minutes for standard apps with automatic scaling, 24
hours for standard apps with manual and basic scaling, and 60
minutes for flex apps.
Failed tasks will be retried according to the retry
configuration. Please check your queue.yaml file for the retry
configuration set and if you want to specify and set them as per your
choice follow this.
The task will be pushed to the worker as an HTTP request. If the worker or the redirected worker acknowledges the task by returning a successful HTTP response code ([200 - 299]), the task will be removed from the queue as per this documentation. If any other HTTP response code is returned or no response is received, the task will be retried according to the following:
User-specified throttling: retry configuration, rate limits, and
the queue's state.
System throttling: To prevent the worker from overloading, Cloud
Tasks may temporarily reduce the queue's effective rate.
User-specified settings will not be changed.
I have a cloud function which publishes a message to PubSub and that triggers a cloud run to perform an archive file process. When there are large files, my cloud run python code takes some time to process the data it looks like PubSub is retrying the message after 20 seconds (default acknowledge deadline time) which is triggering another instance from my Cloud Run. I've increased the acknowledge deadline to 600s and redeployed everything but it's still retrying the message after 20 seconds. I am missing anything?
Cloud Function publishing the message code:
# Publishes a message
try:
publish_future = publisher.publish(topic_path, data=message_bytes)
publish_future.result() # Verify the publish succeeded
return 'Message published.'
except Exception as e:
print(e)
return (e, 500)
Here is the PubSub subscription config:
Logging showing a second instance being triggered after 20s:
Cloud Run code:
#app.route("/", methods=["POST"])
def index():
envelope = request.get_json()
if not envelope:
msg = "no Pub/Sub message received"
print(f"error: {msg}")
return f"Bad Request: {msg}", 400
if not isinstance(envelope, dict) or "message" not in envelope:
msg = "invalid Pub/Sub message format"
print(f"error: {msg}")
return f"Bad Request: {msg}", 400
pubsub_message = envelope["message"]
if isinstance(pubsub_message, dict) and "data" in pubsub_message:
#Decode base64 event['data']
event_data = base64.b64decode(pubsub_message['data']).decode('utf-8')
message = json.loads(event_data)
#logic to process data/archive
return ("", 204)
You should be able to control the retries by setting the minimumBackoff retrypolicy. You can set the minimumBackoff time to the max of 600 seconds, like your ack deadline, so that redelivered messages will be more than 600 seconds old. This should lower the number of occurrences you see.
To handle duplicates, making your subscriber idempotent is recommended. You need to apply some kind of code check to see if the messageId was processed before.
You can find below in the documentation at-least-once-delivery :
Typically, Pub/Sub delivers each message once and in the order in which it was published. However, messages may sometimes be delivered out of order or more than once. In general, accommodating more-than-once delivery requires your subscriber to be idempotent when processing messages. You can achieve exactly once processing of Pub/Sub message streams using the Apache Beam programming model. The Apache Beam I/O connectors let you interact with Cloud Dataflow via controlled sources and sinks. You can use the Apache Beam PubSubIO connector (for Java and Python) to read from Cloud Pub/Sub. You can also achieve ordered processing with Cloud Dataflow by using the standard sorting APIs of the service. Alternatively, to achieve ordering, the publisher of the topic to which you subscribe can include a sequence token in the message.
I'm implementing my own webhooks service which will send out events to subscribed webhooks.
Overview of architecture:
events are pushed onto an SQS queue
a lambda function is triggered by SQS messages (event source mapping)
for each event, I make outgoing http requests to subscribed webhooks
non-2xx responses must be retried with exponential backoff (in such event, I change the message visibility on the received message)
since lambdas that are invoked by SQS will automatically delete the message upon completion I throw an error at the end of the function to prevent the automatic delete
As far as I can tell, the call to change the message visibility is succeeding. I'm wondering if there's something else baked into lambdas that are invoked by SQS. Upon failure from the lambda, is it internally changing the message visibility again? Or do lambdas that are invoked by SQS not respect message visibility changes (this really doesn't make any sense to me). Curious if anyone has any insight into this problem. I was quite surprised to find out that lambda automatically deletes messages upon success since it makes my particular use-case a little clunky feeling - throwing an error to fail the lambda function to prevent the message from being deleted.
Thanks in advance!
The nature of the SQS integration with Lambda is that the integration controls the polling of the messages. The mechanism used to determine if the message should be deleted is that the response from the lambda is not an error. It's not clearly stated in the documentation, but I believe when an error occurs the integration is going to set the visibility timeout to zero in order to make it immediately available for another process to pick it up. So in your example you are setting it to some number that allows you to retry, but when you return an error the integration is setting the timeout back to zero. If you need to have more control over that process you probably should not use the integration.
Update: This is actually not the case. I was failing to properly wait for the call to adjust the timeout to finish. As such the lambda was closing before that request was completing. The timeout I'm setting on the message within the lambda is being respected. I'm then throwing an error to prevent the message from being deleted.
Since SQS triggers for lambdas process messages in batches, using await in a standard forEach doesn't work (forEach does not await the callbacks if they are async). To bypass that, you can create your own version of an async forEach:
var AWS = require('aws-sdk');
exports.handler = async function(event, context) {
var sqs = new AWS.SQS({apiVersion: '2012-11-05'});
await asyncForEach(event.Records, async record => {
const { receiptHandle } = record;
const sqsParams = {
QueueUrl: '<YOUR_QUEUE_URL>', /* required */
ReceiptHandle: receiptHandle, /* required */
VisibilityTimeout: '<in seconds>' /* required */
};
try {
let res = await sqs.changeMessageVisibility(sqsParams).promise();
console.log(res);
} catch (err) {
console.log(err, err.stack);
throw new Error('Fail.');
}
}
});
return {};
};
async function asyncForEach(array, callback) {
for (let index = 0; index < array.length; index++) {
await callback(array[index], index, array);
}
}
Is it possible to invoke a AWS Step function by API Gateway endpoint and listen for the response (Until the workflow completes and return the results from end step)?
Currently I was able to find from the documentation that step functions are asynchronous by nature and has a final callback at the end. I have the need for the API invocation response getting the end results from step function flow without polling.
I guess that's not possible.
It's async and also there's the API Gateway Timeout
You don't need get the results by polling, you can combine Lambda, Step Functions, SNS and Websockets to get your results real time.
If you want to push a notification to a client (web browser) and you don't want to manage your own infra structure (scaling socket servers and etc) you could use AWS IOT. This tutorial may help you to get started:
http://gettechtalent.com/blog/tutorial-real-time-frontend-updates-with-react-serverless-and-websockets-on-aws-iot.html
If you only need to send the result to a backend (a web service endpoint for example), SNS should be fine.
This will probably work: create an HTTP "gateway" server that dispatches requests to your Steps workflow, then holds onto the request object until it receives a notification that allows it to send a response.
The gateway server will need to add a correlation ID to the payload, and the step workflow will need to carry that through.
One plausible way to receive the notification is with SQS.
Some psuedocode that's vaguely Node/Express flavoured:
const cache = new Cache(); // pick your favourite cache library
const gatewayId = guid(); // this lets us scale horizontally
const subscription = subscribeToQueue({
filter: { gatewayId },
topic: topicName,
});
httpServer.post( (req, res) => {
const correlationId = guid();
cache.add(correlationId, res);
submitToStepWorkflow(gatewayId, correlationId, req);
});
subscription.onNewMessage( message => {
const req = cache.pop(message.attributes.correlationId);
req.send(extractResponse(message));
req.end();
});
(The hypothetical queue reading API here is completely unlike aws-sdk's SQS API, but you get the idea)
So at the end of your step workflow, you just need to publish a message to SQS (perhaps via SNS) ensuring that the correlationId and gatewayId are preserved.
To handle failure, and avoid the cache filling with orphaned request objects, you'd probably want to set an expiry time on the cache, and handle expiry events:
cache.onExpiry( (key, req) => {
req.status(502);
req.send(gatewayTimeoutMessage());
req.end();
}
This whole approach only makes sense for workflows that you expect to normally complete in the sort of times that fit in a browser and proxy timeouts, of course.