Message lost and duplicates in GCP Pubsub - google-cloud-platform

I'm running into a issue reading GCP PubSub from Dataflow where when publish large number of messages in short period of time, Dataflow will receive most of the sent messages, except some messages will be lost, and some other messages would be duplicated. And the most weird part is that the number of lost messages will be exactly the same as the number of messages being duplicated.
In one of the examples, I send 4,000 messages in 5 sec, and in total 4,000 messages were received, but 9 messages were lost, and exactly 9 messages were duplicated.
The way I determine the duplicates is via logging. I'm logging every message that is published to Pubsub along with the message id generated by pubsub. I'm also logging the message right after reading from PubsubIO in a Pardo transformation.
The way I read from Pubsub in Dataflow is using org.apache.beam.sdk.ioPubsubIO:
public interface Options extends GcpOptions, DataflowPipelineOptions {
// PUBSUB URL
#Description("Pubsub URL")
#Default.String("https://pubsub.googleapis.com")
String getPubsubRootUrl();
void setPubsubRootUrl(String value);
// TOPIC
#Description("Topic")
#Default.String("projects/test-project/topics/test_topic")
String getTopic();
void setTopic(String value);
...
}
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
options.setRunner(DataflowRunner.class);
...
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(PubsubIO
.<String>read()
.topic(options.getTopic())
.withCoder(StringUtf8Coder.of())
)
.apply("Logging data coming out of Pubsub", ParDo
.of(some_logging_transformation)
)
.apply("Saving data into db", ParDo
.of(some_output_transformation)
)
;
pipeline.run().waitUntilFinish();
}
I wonder if this is a known issue in Pubsub or PubsubIO?
UPDATE:
tried 4000 request with pubsub emulator, no missing data and no duplicates
UPDATE #2:
I went through some more experiments and found that the duplicating messages are taking the message_id from the missing ones. Because the direction of the issue has been diverted from it's origin quite a bit, I decide to post another question with detailed logs as well as the code I used to publish and receive messages.
link to the new question: Google Cloud Pubsub Data lost

I talked with a Google guy from the PubSub team. It seems to be caused by a thread-safety issue with the Python client. Please refer to the accepted answer for Google Cloud Pubsub Data lost for the response from Google

Related

GCP PubSub understanding filters

In my understanding PubSub filters are supposed to reduce number of messages sent to a specific subscription. We currently observe behaviour that we didn't expect.
Assuming there is a PubSub Topic "XYZ" and a subscription to that topic "XYZ-Sub" with a filter attributes.someHeader = "x"
There are 2 messages published to that topic:
First one attributes.someHeader = "a". Second one with attributes.someHeader = "x"
I expect the only message 2 will be delivered to the subscription as message 1 does not match the filter.
If it is not the case and still both messages get delivered (what we currently observe):
GCP console shows a rising number of unacked messages on a sub when no client is connected. Pulling this messages in the gcp console removes them without showing any received messages, which makes me assume that the filters are applied when pulling messages.
Are filters evaluated on PubSub client and not topic level?
What is the point in using filters with pub/sub?
Will the delivery of the unwanted message (the bytes of the message) be billed?
Filtering in Cloud Pub/Sub only delivers messages that match the filter to subscribers. The filters are applied in the Pub/Sub service itself, not in the client. They allow you to limit the set of messages delivered to subscribers when the subscriber only wants to process a subset of the messages.
In your example, only the message with attributes.someHeader = "x" should be delivered. However, note that as the documentation, the backlog metrics might include messages that don't match the filter. Such messages will not be delivered to subscribers, but may still show up in the backlog metrics for a time.
You do get charged the Pub/Sub message delivery price for messages that were not delivered. However, you do not pay any network fees for them, nor do you end up paying for any compute to process messages you do not receive.

Google Cloud Pub/Sub retrieve message by ID

Problem: My use case is I want to publish thousends of messages to Google Cloud Pub/Sub with a 5min retention period but only retrieve specific messages by their ID - So a cloud function will retrieve one message by ID using the Nodejs SDK and all the untreated messages will be deleted by the retention policy. All the current examples mention are to handle random messages from the subscriber.
Is it possible to just pull 1 message by id or any other metadata and close the connection.
There is no way to retrieve individual messages by ID, no. It doesn't really fit into the expected use cases for Cloud Pub/Sub where the publishers and subscribers are meant to be decoupled, meaning the subscriber inherently doesn't know the message IDs prior to receiving the messages.
You may instead want to transmit the messages via whatever mechanism you are using to making the subscribers aware of the message IDs. Or, if you know at publish time which messages will ultimately need to be retrieved, you could add an attribute to the message to indicate this and use filtering.

Consuming messages from Google Pubsub and publishing it to Kafka

I am trying to consume Google PubSub messages using synchronous PULL API. This is available in Apache Beam Google PubSub IO connector library.
I want to write the consumed messages to Kafka using KafkaIO. I want to use FlinkRunner to execute the job, since we run this application outside GCP.
The problem I am facing is that the consumed messages are not getting ACK'd in GCP PubSub. I have confirmed that the local Kafka instance has the messages consumed from GCP PubSub. The documentation in GCP DataFlow indicates that the data bundle gets finalized when the pipeline is terminated with a data sink, which is Kafka in my case.
But since code is running in Apache Flink and not GCP DataFlow, I think some sort of callback is not getting fired related to ACK'ing the committed message.
What am I doing wrong here?
pipeline
.apply("Read GCP PubSub Messages", PubsubIO.readStrings()
.fromSubscription(subscription)
)
.apply(ParseJsons.of(User.class))
.setCoder(SerializableCoder.of(User.class))
.apply("Filter-1", ParDo.of(new FilterTextFn()))
.apply(AsJsons.of(User.class).withMapper(new ObjectMapper()))
.apply("Write to Local Kafka",
KafkaIO.<Void,String>write()
.withBootstrapServers("127.0.0.1:9092,127.0.0.1:9093,127.0.0.1:9094")
.withTopic("test-topic")
.withValueSerializer((StringSerializer.class))
.values()
);
In the Beam documentation on the PubSub IO class it's mentioned this:
Checkpoints are used both to ACK received messages back to Pubsub (so that they may be retired on the Pubsub end), and to NACK already consumed messages should a checkpoint need to be restored (so that Pubsub will resend those messages promptly).
The ACK are not linked to Dataflow, you should have the same behavior on dataflow. The ack are sent on Checkpoints. Usually the Checkpoints are the windows that you set on your stream flow.
But, you didn't set window! By default, the windows is global, and it closed only at the end, if you stop gracefully your job (and even, I'm not sure about this). Anyway, a better solution is to have fixed windows (for example of 5 minutes) to ack the messages on each of these windows.
The way I fixed this solution was by using Guillaume Blaquiere's (https://stackoverflow.com/users/11372593/guillaume-blaquiere) suggestion of looking at Checkpoints. Even after adding the Window.into() function in the pipeline, the source PubSub subscription endpoint did not receive ACKs.
The problem was in the Flink server configuration I had failed to mention checkpoint configuration. Without these parameters, checkpoints are disabled.
state.backend: rocksdb
state.checkpoints.dir: file:///tmp/flink-1.9.3/state/checkpoints/
These configs should go in the flink_home/conf/flink-conf.yaml.
After adding these entries and restarting flink. All the backlogged (unack'd messages) went to 0 in the GCP pubsub monitoring chart.

Move Google Pub/Sub Messages Between Topics

How can I bulk move messages from one topic to another in GCP Pub/Sub?
I am aware of the Dataflow templates that provide this, however unfortunately restrictions do not allow me to use Dataflow API.
Any suggestions on ad-hoc movement of messages between topics (besides one-by-one copy and pasting?)
Specifically, the use case is for moving messages in a deadletter topic back into the original topic for reprocessing.
You can't use snapshots, because snapshots can be applied only on subscriptions of the same topics (to avoid message ID overlapping).
The easiest way is to write a function that pull your subscription. Here, how I will do it:
Create a topic (named, for example, "transfer-topic") with a push subscription. Set the timeout to 10 minutes
Create a Cloud Functions HTTP triggered by PubSub push subscription (or a CLoud Run service). When you deploy it, set the timeout to 9 minutes for Cloud Function and to 10 minutes for Cloud Run. The content of the processing is the following
Read a chunk of messages (for examples 1000) from the deadletter pull subscription
Publish the messages (in bulk mode) into the initial topic
Acknowledge the messages of the dead letter subscription
Repeat this up to the pull subscription is empty
Return code 200.
The global process:
Publish a message in the transfer-topic
The message trigger the function/cloud run with a push HTTP
The process pull the messages and republish them into the initial topic
If the timeout is reached, the function crash and PubSub perform a retry of the HTTP request (according with an exponential backoff).
If all the message are processed, the HTTP 200 response code is returned and the process stopped (and the message into the transfer-topic subscription is acked)
this process allow you to process a very large amount of message without being worried about the timeout.
I suggest that you use a Python script for that.
You can use the PubSub CLI to read the messages and publish to another topic like below:
from google.cloud import pubsub
from google.cloud.pubsub import types
# Defining parameters
PROJECT = "<your_project_id>"
SUBSCRIPTION = "<your_current_subscription_name>"
NEW_TOPIC = "projects/<your_project_id>/topics/<your_new_topic_name>"
# Creating clients for publishing and subscribing. Adjust the max_messages for your purpose
subscriber = pubsub.SubscriberClient()
publisher = pubsub.PublisherClient(
batch_settings=types.BatchSettings(max_messages=500),
)
# Get your messages. Adjust the max_messages for your purpose
subscription_path = subscriber.subscription_path(PROJECT, SUBSCRIPTION)
response = subscriber.pull(subscription_path, max_messages=500)
# Publish your messages to the new topic
for msg in response.received_messages:
publisher.publish(NEW_TOPIC, msg.message.data)
# Ack the old subscription if necessary
ack_ids = [msg.ack_id for msg in response.received_messages]
subscriber.acknowledge(subscription_path, ack_ids)
Before running this code you will need to install the PubSub CLI in your Python environment. You can do that running pip install google-cloud-pubsub
An approach to execute your code is using Cloud Functions. If you decide to use it, pay attention in two points:
The maximum time that you function can take to run is 9 minutes. If this timeout get exceeded, your function will terminate without finishing the job.
In Cloud Functions you can just put google-cloud-pubsub in a new line of your requirements file instead of running a pip command.

How to Poll for a specific number of messages using Amazon SQS Springn cloud annotation #SqsListener

Need: to poll/listen 10 messages at a time(Specific count not based on polling time)
Existing code base using spring cloud aws messaging polls based on time by default.
now required to poll based on the counts of messages.
Looking for a annotation based configuration similar approach to the below code
#SqsListener(value = "xxx-sqs",deletionPolicy = ON_SUCCESS)
public void auditProcessor(String json) throws IOException {
log.info("*********** Inside Listner***********");
log.info("JSON Data "+json);
}
Any help on this would be appreciated.
The amount of messages requested from SQS can be configured in SimpleMessageListenerContainerFactory.
That being said SQS works in a way that even if you ask for 10 messages you have no guarantee you will get 10 even if there are more messages in the queue. You can only be sure that you will not get more than 10.