GCP Dataflow Pub/Sub to Text Files on Cloud Storage - google-cloud-platform

I'm referring to Google provided dataflow Pub/Sub to Text Files on Cloud Storage.
The messages once read by dataflow don't get acknowledged. How do we ensure that messages once consumed by dataflow is acknowledged and is not available to any other subscriber?
To reproduce and test it, create 2 Jobs from the same template and you would see that both the job processing the same message.

Firstly, the messages are correctly acknowledge.
Then, to demonstrate this, and how your reproduction is wrong, I would like to focus on PubSub behavior.
One or several publishers publish messages in a topic
One or several subscription can be created on a topic
All the messages published in a topic are copied in each subscription
Subscription can have one or several subscribers.
Each subscriber receives a subset of the messages in the subscription.
Go back to your template. You specify only a topic, not a subscription. When your dataflow is running, go to the subscription, you will be able to see a new subscription created.
-> When you start a PubSub to TextFiles template a subscription is automatically created on the provided topic
Therefore, if you create 2 jobs, you will have 2 subscribtions, and thus, all the messages published in the topic are copied in each subscription. That's why you will have 2 times the same messages.
Now, keep your job up and go to the subscription. Here you can see the number of message in the queue and the unacked messages. You should see 0 in the unacked message graph.

Related

Pub/Sub messages from snapshot not processed in a Dataflow streaming pipeline

We have a Dataflow consumming from a Pub/Sub and writing into bigquery in streaming. Due to a permits issue the pipeline got stuck and the messages were not consumed, we re-started the pipeline, save the unacked messages in a snapshot, replay the messages but they are discarded
We fix the problem, re-deployed the pipeline with a new subscription to the topic and all the events are consumed in streaming without a problem
For all the unacked messages accumulated (20M) in the first subscription, we created a snapshot
This snapshot was then connected to the new subscription via the UI using Replay messages dialog
In the metrics dashboard we see that the unacked messages spike to 20M and then they get consumed
subscription spike
But then the events are not sent to BigQuery, checking inside dataflow job metrics we are able to see a spike in the Duplicate message count within the read from pubsub step
Dataflow Duplicate counter
The messages are < 3 days old, does anybody knows why this happen? Thanks in advance
The pipeline is using Apache Beam SDK 2.39.0 and python 3.9 with streming engine and v2 runner enable.
How long does it take for a Pub/Sub message to process, is it a long process?
In that case, Pub/Sub may redeliver messages, according to subscription configuration/delays. See Subscription retry policy.
Dataflow can work-around that, as it acknowledges from the source after a successful shuffle. If you add a GroupByKey (or artificially, a Reshuffle) transform, it may resolve source duplications.
More information at https://beam.apache.org/contribute/ptransform-style-guide/#performance

GCP Pub/Sub: Life of a Message

I'm trying to learn about GCP Pub/Sub and I have a problem about the life of a message in Pub/Sub. In fact, I used this article as my reference. And in this article, they said:
Once at least one subscriber for each subscription has acknowledged the message, Pub/Sub deletes the message from storage.
So my first question is: for example I have a Subscription A which connects to Subscriber X et Subscriber Y. According to the docs, when the Subscriber X received the message and it sends an ACK to the Subscription A, the Pub/Sub will delete the message from storage without considering if the Subscriber Y received or not the message. In other words, Pub/Sub doesn't care if all subscribers have received messages or not, just one subscriber gets the message and Pub/Sub will delete the message from storage? Am I right, please?
Then, in the following part of the article, the article said:
Once all subscriptions on a topic have acknowledged a message, the message is asynchronously deleted from the publish message source and from storage.
And I feel a little bit confuse here. What I understood is that, for instance, I have a topic that has N subscriptions, each subscription has M subscriber, Pub/Sub just needs to known that for each subscription, at least one subscriber has acknowledged the message, it'll delete the message from storage. Am I right, please?
I also found that in the documentation, we have two concepts: Publishing Forwarder and Subscribing Forwarder. So may I ask some last questions:
What is the relationship between Subscription, Publishing Forwarder and Subscribing Forwarder? (for example, a Subscription consists only one Publishing Forwarder and one Subscribing Forwarder?)
The relationship between Publishing Forwarder and Subscribing Forwarder is one-to-one or one-to-many or many-to-one or many-to-many, please?
Can a Subscriber be associated with many Subscription or not, please?
Once a Subscriber consumes a message (here I say this message is not duplicated, it has no copy, it is unique), is it possible to this Subscriber re-consumes/re-reads exactly this message?
If I misunderstand something, please, point it out for me, I really appreciate that.
Thank you guys !!!
Quite a bit to unpack here. It is best not to think of a subscription as attaching to subscribers and also to understand that these two things are different. A subscription is a named entity that wants to receive all messages published to a topic. A subscriber is an actual client running to receive and process messages on behalf of a subscription. A topic can have many subscriptions. A subscription can have many subscribers. If there are multiple subscribers in a subscription, then, assuming there are no duplicate deliveries and subscriber ack all messages received, each message published to a topic will be delivered to one subscriber for the subscription. This is called load balancing: the processing of messages is spread out over many subscribers. If a topic has multiple subscriptions, each with one subscriber, then every subscriber will receive all messages. This is called fan out: each subscriber receives the complete set of messages published. Of course, it is possible to combine these two and have more than one subscriber for each subscription, in which case each message will be delivered to one subscriber for each subscription.
Forwarders are just the servers that are responsible for delivering messages. A publishing forwarder receives messages from publishers and a subscribing forwarder sends messages to subscribers. All of the relationships along the path of delivering a message, from publisher to publishing forwarder, publishing forwarder to subscribing forwarder, and subscribing forwarder to subscriber, can be many-to-many relationships.
A subscriber is associated with a single subscription. However, a job running could have multiple subscribers running within it, e.g., one could instantiate the subscriber client library several times on different subscriptions.
All of the above assumed an important caveat: assuming there are no duplicate deliveries. In general, Cloud Pub/Sub guarantees at least once delivery. That means that even a message that was properly acked by a subscriber could be redelivered--either to the same subscriber or a different subscriber--in which case the subscriber needs to ack the message on the subsequent delivery. Generally, duplicate rates should be very low, in the 0.1% range for a well-behaved subscriber that is acking messages before the ack deadline expires.

GCP - how to add alert on number of messages sent to a pubsub dead letter queue?

I have my application which processes messages from a pubsub topic and if it fails the message is send to a separate dlq topic. I want to be able to set an alarm in monitoring that when during a day there were 30k messages sent to the dlq it notifies me and I can check why my service is not wokring.
I tried to set up some polices in gcp but I don't know and couldn't find anywhere in the docs how to setup a metric of daily processed messages on a topic.
Can anyone help me ?
You can create a new alert policy like this
PubSub subscription/unacked messages.
You can add a filter on your subscription name if you have several subscriptions in your project.
Add the notification channel that you want, an email in my case. After few minutes, you can see the first alert
And the email
EDIT
For the acked messages, you can do this
I never tried an aggregation over 1 day, but it should be OK.
Please check the following GCP community tutorials which outline how to create an alert-based event archiver with Stackdriver and Cloud Pub/Sub
https://cloud.google.com/community/tutorials/cloud-pubsub-drainer

Does Google Cloud (GCP) Pub/Sub supports feature similar to ConsumerGroups as in Kafka

Trying to decide between Google Cloud (GCP) Pub/Sub vs Manager Kafka Service.
In latest update, Pub/Sub added support to replay messages which were processed before, is a welcome change.
One feature I am not able to find on their documentation is whether we can have something similar to Kafka's Consumer Groups, i.e have group of Subscribers and each processing data from the same topic, and be able to re-process the data from beginning for some Subscriber(consumer group) while others are not affected by it.
eg:
Lets say you have a Topic as StockTicks
And you have two consumer groups
CG1: with two consumers
CG2: With another two consumers
In Kafka I can read messages independently between these groups, but can I do the same thing with Pub/Sub.
And Kafka allows you to replay the messages from the beginning, can I do the same with Pub/Sub, I am ok if I cant replay the messages that were published before the CG was created, but can I replay the message that were submitted after a CG/Subscribers were created?
Cloud Pub/Sub's equivalent of a Kafka's consumer groups is a subscription. Subscribers are the equivalent of a consumer. This answer spells out the relationship between subscriptions and subscribers in a little more detail.
Your example in Cloud Pub/Sub terms would have a single topic, StockTicks, with two subscriptions (call them CG1 and CG2). You would bring up four subscribers, two that fetch messages for the subscription CG1 and two that fetch messages for the CG2 subscription. Acking and replay would be independent on CG1 and CG2, so if you were to seek back on CG1, it would not affect the delivery of messages to subscribers for CG2 at all.
Keep in mind with Cloud Pub/Sub that only messages published after a subscription is successfully created will be delivered to subscribers on that subscription. Therefore, if you create a new subscription, you won't get the all of the messages published since the beginning of time; you will only get messages published from that point on.
If you seek back on a subscription, you can only get up to 7 days of messages to replay (assuming the subscription was created at least 7 days ago) since that is the max retention time for messages in Cloud Pub/Sub.

Should Dataflow consume events from a Pub/Sub topic or subscription? [duplicate]

This question already has an answer here:
Dataflow Template Cloud Pub/Sub Topic vs Subscription to BigQuery
(1 answer)
Closed 3 years ago.
I am looking to stream events from from PubSub into BigQuery using Dataflow. I see that there are two templates for doing this in GCP: one where Dataflow reads messages from a topic; and one from a subscription.
What are the advantages of using a subscription here, rather than just consuming the events from the topic?
Core concepts
Topic: A named resource to which messages are sent by publishers.
Subscription: A named resource representing the stream of messages from a single, specific topic, to be delivered to the subscribing
application.
According to the core concepts, the the difference is rather simple:
Use a Topic when you would like to publish messages from Dataflow to Pub/Sub (indeed, for a given topic).
Use a Subscription when you would like to consume messages coming from Pub/Sub in Dataflow.
Thus, in your case, go for a subscription.
More info:
Keep into account that Pub/Sub manages topics using is own message store. However, a Cloud Pub/Sub Topic to BigQuery template is particularly useful when you would like to move these messages as well in BigQuery (and eventually perform your own analysis).
The Cloud Pub/Sub Topic to BigQuery template is a streaming pipeline
that reads JSON-formatted messages from a Cloud Pub/Sub topic and
writes them to a BigQuery table. You can use the template as a quick
solution to move Cloud Pub/Sub data to BigQuery. The template reads
JSON-formatted messages from Cloud Pub/Sub and converts them to
BigQuery elements.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#cloudpubsubtobigquery
Disclaimer: Comments and opinions are my own and not the views of my employer.
Both the Topic to BigQuery and Subscription to BigQuery templates consume messages from Pub/Sub and stream them into BigQuery.
If you use the Topic to BigQuery template, Dataflow will create a subscription behind the scenes for you that reads from the specified topic. If you use the Subscription to BigQuery template, you will need to provide your own subscription.
You can use Subscription to BigQuery templates to emulate the behavior of a Topic to BigQuery template by creating multiple subscription-connected BigQuery pipelines reading from the same topic.
For new deployments, using the Subscription to BigQuery template is preferred. If you stop and restart a pipeline using the Topic to BigQuery template, a new subscription will be created, which may cause you to miss some messages that were published while the pipeline was down. The Subscription to BigQuery template doesn't have this disadvantage, since it uses the same subscription even after the pipeline is restarted.