Pub/Sub messages from snapshot not processed in a Dataflow streaming pipeline

Pub/Sub messages from snapshot not processed in a Dataflow streaming pipeline - google-cloud-platform

We have a Dataflow consumming from a Pub/Sub and writing into bigquery in streaming. Due to a permits issue the pipeline got stuck and the messages were not consumed, we re-started the pipeline, save the unacked messages in a snapshot, replay the messages but they are discarded
We fix the problem, re-deployed the pipeline with a new subscription to the topic and all the events are consumed in streaming without a problem
For all the unacked messages accumulated (20M) in the first subscription, we created a snapshot
This snapshot was then connected to the new subscription via the UI using Replay messages dialog
In the metrics dashboard we see that the unacked messages spike to 20M and then they get consumed
subscription spike
But then the events are not sent to BigQuery, checking inside dataflow job metrics we are able to see a spike in the Duplicate message count within the read from pubsub step
Dataflow Duplicate counter
The messages are < 3 days old, does anybody knows why this happen? Thanks in advance
The pipeline is using Apache Beam SDK 2.39.0 and python 3.9 with streming engine and v2 runner enable.

How long does it take for a Pub/Sub message to process, is it a long process?
In that case, Pub/Sub may redeliver messages, according to subscription configuration/delays. See Subscription retry policy.
Dataflow can work-around that, as it acknowledges from the source after a successful shuffle. If you add a GroupByKey (or artificially, a Reshuffle) transform, it may resolve source duplications.
More information at https://beam.apache.org/contribute/ptransform-style-guide/#performance

Related

long running cloud run process and pubsub message retry

I have a cloud run service which will run upto 60 minutes.The pubsub is the trigger point for execution of cloud run service.
pubsub configuration for Retry policy is set to max (600s).
Now when a message is published from pubsub, cloud run starts executing, as the complete execution takes around 60 minutes to complete, but the pubsub message after 600s starts to retry again as it doesn't received any acknowledge from cloud run and again causing cloud run service executing again and again.
How to handle the pubsub retry here so that cloud run will not execute again and again because of retrying.

I was thinking to use Cloud Tasks, or Cloud Workflows as a proxy for your long running Cloud Run. Unfortunately both services have max timeout of 1800s (30minutes). By the way upcoming callback feature of Cloud Workflows will have 12h timeout. In the meantime I would create a proxy as Cloud Function triggered by PubSub message that will be immediately acknowledged, and the function will call your Cloud Run in async with the PubSub message and return right away.

With push subscriptions, such as what you'd use with a Cloud Run service, the maximum ack deadline for a message is indeed 600s. If using pull, one can call ModifyAckDeadline to extend the deadline for a message. In fact, the client libraries for Cloud Pub/Sub do this automatically for up to a configured amount of time (default is 60m).
There is not going to be a way to extend the deadline if using a push subscription. Therefore, your options are:
Switch to a pull subscription. You could potentially do this via Cloud Run, though it would not be the best fit. More likely, you want to spin up a job in an environment that can keep it running without any kind of trigger, e.g., GKE. If you switch to pull, you can extend the ack deadline, though note that duplicates are still possible, even if the ack deadline has not expired or the message has already been acknowledged. They should be rare, but you still have to account for it.
When you receive the message, persist it somewhere, either on disk or in a database, and then acknowledge the message once persisted. Once you are actually done processing the message an hour later, you remove it from this persistent storage. Of course, you could just persist the message instead of publishing it via Pub/Sub and rely on the persistence layer's notifications mechanisms to learn of the new message. For example, if you write to GCS, you could use Cloud Storage notifications via Pub/Sub. In this case, you probably want to have some periodic read from your storage to see if there are any messages that have not been processed for some period of time and if so, reprocess them. For example, if you write with the message the time at which processing started and if more than some amount of time has passed since then and the message is still present, you could start the processing over again.

GCP Dataflow Pub/Sub to Text Files on Cloud Storage

I'm referring to Google provided dataflow Pub/Sub to Text Files on Cloud Storage.
The messages once read by dataflow don't get acknowledged. How do we ensure that messages once consumed by dataflow is acknowledged and is not available to any other subscriber?
To reproduce and test it, create 2 Jobs from the same template and you would see that both the job processing the same message.

Firstly, the messages are correctly acknowledge.
Then, to demonstrate this, and how your reproduction is wrong, I would like to focus on PubSub behavior.
One or several publishers publish messages in a topic
One or several subscription can be created on a topic
All the messages published in a topic are copied in each subscription
Subscription can have one or several subscribers.
Each subscriber receives a subset of the messages in the subscription.
Go back to your template. You specify only a topic, not a subscription. When your dataflow is running, go to the subscription, you will be able to see a new subscription created.
-> When you start a PubSub to TextFiles template a subscription is automatically created on the provided topic
Therefore, if you create 2 jobs, you will have 2 subscribtions, and thus, all the messages published in the topic are copied in each subscription. That's why you will have 2 times the same messages.
Now, keep your job up and go to the subscription. Here you can see the number of message in the queue and the unacked messages. You should see 0 in the unacked message graph.

GCP - how to add alert on number of messages sent to a pubsub dead letter queue?

I have my application which processes messages from a pubsub topic and if it fails the message is send to a separate dlq topic. I want to be able to set an alarm in monitoring that when during a day there were 30k messages sent to the dlq it notifies me and I can check why my service is not wokring.
I tried to set up some polices in gcp but I don't know and couldn't find anywhere in the docs how to setup a metric of daily processed messages on a topic.
Can anyone help me ?

You can create a new alert policy like this
PubSub subscription/unacked messages.
You can add a filter on your subscription name if you have several subscriptions in your project.
Add the notification channel that you want, an email in my case. After few minutes, you can see the first alert
And the email
EDIT
For the acked messages, you can do this
I never tried an aggregation over 1 day, but it should be OK.

Please check the following GCP community tutorials which outline how to create an alert-based event archiver with Stackdriver and Cloud Pub/Sub
https://cloud.google.com/community/tutorials/cloud-pubsub-drainer

Push vs Pull for GCP Dataflow

I want to know what type of subscription one should create in GCP pubsub in order to handle high-frequency data from pubsub topic.
I will be ingesting data in dataflow with 100 plus messages per second.
Will pull or push subscription really matters and how it will gonna affect the speed and all.

If you consume the PubSub subscription with Dataflow, only Pull subscription is available
either you create one and you give it in the parameter of your dataflow pipeline
or you specify only the topic in your dataflow pipeline and Dataflow will create by itself the pull subscription.
If both case, Dataflow will process the messages in streaming mode
The difference
If you create the subscription by yourselves, all the messages will be stored and kept (up to 7 days by default) and will be consumed when the dataflow pipeline will be started.
If you let Dataflow to create the subscription, only the message that arrives AFTER the subscription creation will be consumed by the dataflow pipeline. If you want to not loose a message, it's not the recommended solution. If you don't care about the old message, it's a good choice.
High frequency
Then, 100 messages per second is absolutely not high frequency. 1 pubsub topic can ingest up to 1 000 000 of messages per second. Don't worry about that!
Push VS Pull
The model is different.
With the push subscription, you have to specify an HTTP endpoint (on GCP or elsewhere) that consumes the message. It's a webhook pattern. If the platform endpoint scale automatically with the traffic (Cloud Run, Cloud Functions for example), the message rate can go very high!! And the HTTP return code stands for message acknowledgment.
With Pull subscription, the client needs to open a connection to the subscription and then pull the message. The client needs to explicitly acknowledge the messages. Several clients can be connected at the same time. With the client library, the message is consumed with gRPC protocol and it's more efficient (in terms of network bandwidth) to receive and consume the message
Security point of view
With push, it's the PubSub to be authenticated on the HTTP endpoint, if the endpoint required authentication
With pull, it's the client that needs to be authenticated on the PubSub subscription.

Dataflow job drain does not end and system latency grows for a long time

We have a dataflow pipeline that ends with either
send a PubSub message to "Done" topic OR
send a PubSub message to "DLQ" or "RETRY" topic
Here is the graph for data pipeline:
and here is the system latency issue although all 6 elements processed successfully :
For the scenarios when we have some messages sent on both topics, dataflow does not recognize a successful end and system latency grows and draining gets stuck!

We found out that the incident happens when you miss the pub/sub topic, specifically in our case DLQ topic was forgotten to be created.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js