GCP PubSub mysteriously/silently failing with Cloud Functions - google-cloud-platform

I have about a dozen or so GCF functions (Python) which run in series, once a day. In order to keep the correct sequence, I use PubSub. So for example:
topic1 triggers function1 -> function1 runs -> function1 writes a message to topic2 -> topic2 triggers function2 -> function2 runs -> etc.
This use case is low throughput and a very straightforward (I thought) way to use GCF and PubSub together to each others advantage. The functions use pubsub_v1 in Python to publish messages. There are no problems with IAM, permissions, etc. Code looks like:
from google.cloud import pubsub_v1
# Publish message
publisher = pubsub_v1.PublisherClient()
topic2 = publisher.topic_path('my-project-name', 'topic2_id')
publish_message = '{short json message to be published}'
print('sending message ' + publish_message)
publisher.publish(topic2, publish_message.encode("utf-8"))
And I deploy function1 and other functions using:
gcloud functions deploy function1 --entry-point=my_python_function --runtime=python37 \
--trigger-topic=topic1 --memory=4096MB --region=us-central1 \
--source="url://source-repository-with-my-code"
However, recently I have started to see some really weird behaviour. Basically, function1 runs, the logs look great, message has seemingly been published to topic2...then nothing. function2 doesn't begin execution or show anything in the logs to suggest it's been triggered. No logs suggesting either success or failure. So essentially it seems that either:
the message from function1 to topic2 is not getting published, despite function1 finishing with Function execution took 24425 ms, finished with status: 'ok'
the message from function1 to topic2 is getting published, but topic2 is not triggering function2.
Is this expected behaviour for PubSub? These failures seem completely random. I went months with everything working very reliably, and now suddenly I have no idea whether the messages are going to be delivered or not. It also seems really difficult to track the lifespan of these PubSub messages to see where exactly they're going missing. I've read in the docs about dead letter topics etc, but I don't really understand how to set up something that makes it easy to track.
Is it normal for very low frequency, short messages to "fail" to be delivered?
Is there something I'm missing or something I should be doing, e.g. in the publisher.publish() call to ensure more reliable delivery?
Is there a transparent way to see what's going on and see where these messages are going missing? Setting up a new subscription which I can view in the console and see which messages are being delivered and which are failing, something like that?
If I need 100% (or close to that) reliability, should I be ditching GCF and PubSub? What's better?

The issue here is that you aren't waiting for publisher.publish to actually succeed. This method returns a future and may not complete synchronously. If you want to ensure the publish has completed successfully, you need to call result() on the value returned from publish:
future = publisher.publish(topic2, publish_message.encode("utf-8"))
future.result()
You will also want to ensure that you have "Retry on failure" enabled on your cloud function by passing the --retry argument to gcloud functions deploy. That way, if the publish fails, the message from topic1 will be redelivered to the cloud function to be tried again.

Related

Is my approach right when using Cloud Functions, Pub/Sub and Dead-letter queues/topics?

I'm developing my first microservice, I chose to deploy it in a Cloud Functions service with messaging done via Pub/Sub.
The Cloud Functions service is triggered by events (published messages) in a Pub/Sub topic, the microservice processes the message and so far so good. I know that Cloud Functions guarantees the acknowledgement and delivery of messages, and that's good.
The Cloud Functions service has automatic retrying: If I throw an exception in the code, a new program execution occurs. In order to avoid looping executions on consecutive failures, I introduced an if conditional that checks the 'age' of the message, and since I don't want to simply discard the message, I send/publish it to another Pub/Sub topic that I've named "my-dead-letter-queue" topic.
As I am unsure about everything now, I ask you: Is my approach good enough? What would you do instead considering Cloud Functions microservices and Pub/Sub messaging?
Yes your approach is good if you want to base on the message age the rule to send your bad messages in a dead letter topic.
If you want to do it on the number of fails (after 5 fails, put the message in the dead letter topic), you can't achieve that with a cloud function directly plugged on PubSub. You need to create an HTTP functions and then to create PubSub push subscription, on which you can set a dead letter topic (min 5 fails before sending the message automatically in a dead letter topic).
The advantage of that second solution is that you haven't to process the message and to push it to pubsub in your Cloud Functions (all processing time cost money), it's automatic and therefore you save money ;)
The approach you are trying to use is good every time you get an exception and insert it into your dead letter topic. It will work every time you try to achieve exception handling without any problem in the future, but if you would want to throw in there more exceptions, you should consider changing how you manage the exceptions.
Here you can see how you can publish the messages within gcloud command-line

AWS Lambda: is there a way that I can watch live log printed by a function while it is executing

I am new to AWS. I have just developed a lambda function(Python) which print messages while executing. However I am not sure where I can watch the log printed out while the function is executing.
I found CloudWatch log in the function, but it seems that log is only available after function completed.
Hope you can help,
many thanks
You are correct -- the print() messages will be available in CloudWatch Logs.
It is possible that a long-running function might show logs before it has completed (I haven't tried that), but AWS Lambda functions only run for a maximum of 15 minutes and most complete in under one second. It is not expected that you would need to view logs while a function is running.

Can I schedule a retry of a cloud function in nodejs?

When subscribing to a topic, I handle the ack/nack myself and can easily call message.nack(millisecondsToNextRetry).
I would like to do the same thing in cloud functions using nodejs, i.e. under certain circumstances retry the function after a specified time.
Anyone know a good solution or workaround when triggering a cloud function from pub/sub?
Cloud Functions will retry automatically if you enable that by configuration. Retries will happen if your function throws an exception, returns a rejected promise, or times out. You won't be able to control the schedule of retries.

AWS Lambda Stops Running Randomly

Has anyone ever seen a Lambda function stop randomly?
I've got a method that subscribes to an SNS topic that is published to every hour. I recently had 4 messages come through to the subscriber Lambda and 3 of the four worked perfectly.
CloudWatch gives me all of the console logs I have logged, I get responses from all of the APIs the method reaches out to, end with a success message, but the fourth message logs the console log to CloudWatch and then I get the "Request End" log immediately following. None of the following: console.logs, no error from lambda, no insight as to why it would have stopped, no timeout error, it just stopped.
I've never seen this kind of behavior before and have yet to have a Lambda function stop working without logging out the error (everything that runs is written in a try/catch that has reliably logged the errors until now).
Has anyone ever come across this and have any insight as to what it may be?

Is there an AWS / Pagerduty service that will alert me if it's NOT notified

We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).
We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.
OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.