Is there a way I can consume Google PubSub message using synchronous pull in Apache Beam job - google-cloud-platform

I have already gone through the client library provided by google in the below doc. The given client library is just to poll the message from PubSub, But it will not poll continuously until we create the Unbounded Source Connector.
https://cloud.google.com/pubsub/docs/pull#synchronous_pull
Since the source connector I'm trying to build is Unbounded source,For that I need to take care of Checkpoint Marker, implement PubSub reader, PubSub split source and implement ACK and NACK logic and I believe it will take a good amount to time to create my own Unbounded source connector. Right now PubSubIO(Beam api) only supports asynchronous pull. So is there any way I can just implement ACK n NACK logic logic on top of PubSubIO api provided by Apache Beam. Is there any kind of already developed api which is more suitable for this kind of use-cases.
With synchronous pull, you can acknowledge the intended message and NACK the consumed message in case of any parsing failure.

The feature that you expect doesn't exist, and should not exist.
The current behavior, the async pull, get the message and as soon as the message is persisted (in the worker or sink in the pipeline, the first of both), the message is ACK.
In your case, you expect to ACK manually the message according with the pipeline processing. However, you are limited to 10 minutes to ACK your messages with PubSub. Imagine if you build a pipeline with windows of 15 minutes (or more). You need to wait the end of the processing of the windows to ACK the messages; impossible!
The correct design, in your case, is to manage your errors in your pipeline.

Related

Can you do batch pull messages with Google Pub Sub?

Trying to optimize our application but doing batch pulling. Pub Sub seems to allow asynchronously pulling one message at a time with different client nodes, but is there no way for a single node to do a batch pull from pub sub?
Both Streaming Pull and Pull RPC both only allow the subscriber to consume one message at a time. Right now, it looks like we would have to pull one message at a time and do application level batching.
Any insight would be helpful. Pretty new to this GCP in general.
The underlying pull and streaming pull operations can receive batches of messages in the same response. The Cloud Pub/Sub client library, which uses streaming pull, breaks these batches apart and hands them to the provided user callback one at a time. Therefore, you need not worry about optimizing the underlying receiving of messages.
If your concern is optimizing the subscriber code at the application level, e.g., you want to batch writes into a database, then you have a couple of options:
Use Pull directly, which allows one to process all of the messages in a batch at a time. Note that using pull effectively requires many simultaneously outstanding pull requests and replacing requests that return with new requests immediately.
In your user callback, re-batch messages and once the batch reaches a desired size (or you've waited a sufficient amount of time to fill the batch), process all of the messages together and then ack them.
You probably can implement that by using Dataflow (Apache Beam). You can have a running streaming job, where you group, window, transform messages according to your requirements. The results of processing can be saved in batches or steam further. It probably makes sense in case the number of messages is really big.

Consuming messages from Google Pubsub and publishing it to Kafka

I am trying to consume Google PubSub messages using synchronous PULL API. This is available in Apache Beam Google PubSub IO connector library.
I want to write the consumed messages to Kafka using KafkaIO. I want to use FlinkRunner to execute the job, since we run this application outside GCP.
The problem I am facing is that the consumed messages are not getting ACK'd in GCP PubSub. I have confirmed that the local Kafka instance has the messages consumed from GCP PubSub. The documentation in GCP DataFlow indicates that the data bundle gets finalized when the pipeline is terminated with a data sink, which is Kafka in my case.
But since code is running in Apache Flink and not GCP DataFlow, I think some sort of callback is not getting fired related to ACK'ing the committed message.
What am I doing wrong here?
pipeline
.apply("Read GCP PubSub Messages", PubsubIO.readStrings()
.fromSubscription(subscription)
)
.apply(ParseJsons.of(User.class))
.setCoder(SerializableCoder.of(User.class))
.apply("Filter-1", ParDo.of(new FilterTextFn()))
.apply(AsJsons.of(User.class).withMapper(new ObjectMapper()))
.apply("Write to Local Kafka",
KafkaIO.<Void,String>write()
.withBootstrapServers("127.0.0.1:9092,127.0.0.1:9093,127.0.0.1:9094")
.withTopic("test-topic")
.withValueSerializer((StringSerializer.class))
.values()
);
In the Beam documentation on the PubSub IO class it's mentioned this:
Checkpoints are used both to ACK received messages back to Pubsub (so that they may be retired on the Pubsub end), and to NACK already consumed messages should a checkpoint need to be restored (so that Pubsub will resend those messages promptly).
The ACK are not linked to Dataflow, you should have the same behavior on dataflow. The ack are sent on Checkpoints. Usually the Checkpoints are the windows that you set on your stream flow.
But, you didn't set window! By default, the windows is global, and it closed only at the end, if you stop gracefully your job (and even, I'm not sure about this). Anyway, a better solution is to have fixed windows (for example of 5 minutes) to ack the messages on each of these windows.
The way I fixed this solution was by using Guillaume Blaquiere's (https://stackoverflow.com/users/11372593/guillaume-blaquiere) suggestion of looking at Checkpoints. Even after adding the Window.into() function in the pipeline, the source PubSub subscription endpoint did not receive ACKs.
The problem was in the Flink server configuration I had failed to mention checkpoint configuration. Without these parameters, checkpoints are disabled.
state.backend: rocksdb
state.checkpoints.dir: file:///tmp/flink-1.9.3/state/checkpoints/
These configs should go in the flink_home/conf/flink-conf.yaml.
After adding these entries and restarting flink. All the backlogged (unack'd messages) went to 0 in the GCP pubsub monitoring chart.

Google Cloud PubSub Message Delivered More than Once before reaching deadline acknowledgement time

Background:
We configured cloud pubsub topic to interact within multiple app engine services,
There we have configured push based subscribers. We have configured its acknowledgement deadline to 600 seconds
Issue:
We have observed pubsub has pushed same message twice (more than twice from some other topics) to its subscribers, Looking at the log I can see this message push happened with the gap of just 1 Second, Ideally as we have configured ackDeadline to 600 seconds, pubsub should re-attempt message delivery only after 600 seconds.
Need following answers:
Why same message has got delivered more than once in 1 second only
Does pubsub doesn’t honors ackDeadline configuration before
reattempting message delivery?
References:
- https://cloud.google.com/pubsub/docs/subscriber
Message redelivery can happen for a couple of reasons. First of all, it is possible that a message got published twice. Sometimes the publisher will get back an error like a deadline exceeded, meaning the publish took longer than anticipated. The message may or may not have actually been published in this situation. Often, the correct action is for the publisher to retry the publish and in fact that is what the Google-provided client libraries do by default. Consequently, there may be two copies of the message that were successfully published, even though the client only got confirmation for one of them.
Secondly, Google Cloud Pub/Sub guarantees at-least-once delivery. This means that occasionally, messages can be redelivered, even if the ackDeadline has not yet passed or an ack was sent back to the service. Acknowledgements are best effort and most of the time, they are successfully processed by the service. However, due to network glitches, server restarts, and other regular occurrences of that nature, sometimes the acknowledgements sent by the subscriber will not be processed, resulting in message redelivery.
A subscriber should be designed to be resilient to these occasional redeliveries, generally by ensuring that operations are idempotent, i.e., that the results of processing the message multiple times are the same, or by tracking and catching duplicates. Alternatively, one can use Cloud Dataflow as a subscriber to remove duplicates.

AWS SQS Boto3 sending messages to dead letter manually

So I am building a small application that uses SQS. I have a simple handler process that determines if a given message is considered processed, marked for retry (to be re-queued) or is not able to be processed (should be sent to dead letter).
However based on the docs it would appear the only way to truly send a message to DL is by using a redrive policy which operates over # of receives a message has racked up. Because of the nature of my application, I could have several valid retries if my process isn't ready to handle a given message, but there are also times I may want to DL a message I have just received. Does AWS/Boto3 not provide a way to mark a specific message for DL?
I know I can just send the message myself to another queue I consider my own DL, I would just rather use AWS' built in tools for this.
I don't believe there is any limitation that would prevent you from sending the message to the deal-letter-queue by yourself.
So just read the message from the Q, if you know it needs to go to the DLQ directly, send it to the DLQ and remove it from the regular Q.

does MSMQ have "lock until expire" functionality similar to Amazon SQS?

I've been using AWS SQS, which has a nice feature that when a message is claimed from the queue it locks for a period of time. During this lock if it is processed successfully the message is marked as completed. If the processing fails (and no response is received from the message processor), after a period of time the lock expires and the message is available for another processor to pick up.
Now I have a requirement to use queues outside of SQS (mostly for latency reasons, but potentially for cost reasons too). I'm really looking for a queue provider that has the same characteristic. MSMQ would be the obvious choice for me, since it's already installed and we use it elsewhere, but I can't find any functionality that handles failed messages in the same way.
Does MSMQ allow for this, or is there an easy way to replicate it?
Alternatively, is there another lightweight, open-source messaging service that does?
MSMQ does this already. If you read a message within a transaction and the transaction aborts then the message will reappear in the queue.