Push vs Pull for GCP Dataflow - google-cloud-platform

I want to know what type of subscription one should create in GCP pubsub in order to handle high-frequency data from pubsub topic.
I will be ingesting data in dataflow with 100 plus messages per second.
Will pull or push subscription really matters and how it will gonna affect the speed and all.

If you consume the PubSub subscription with Dataflow, only Pull subscription is available
either you create one and you give it in the parameter of your dataflow pipeline
or you specify only the topic in your dataflow pipeline and Dataflow will create by itself the pull subscription.
If both case, Dataflow will process the messages in streaming mode
The difference
If you create the subscription by yourselves, all the messages will be stored and kept (up to 7 days by default) and will be consumed when the dataflow pipeline will be started.
If you let Dataflow to create the subscription, only the message that arrives AFTER the subscription creation will be consumed by the dataflow pipeline. If you want to not loose a message, it's not the recommended solution. If you don't care about the old message, it's a good choice.
High frequency
Then, 100 messages per second is absolutely not high frequency. 1 pubsub topic can ingest up to 1 000 000 of messages per second. Don't worry about that!
Push VS Pull
The model is different.
With the push subscription, you have to specify an HTTP endpoint (on GCP or elsewhere) that consumes the message. It's a webhook pattern. If the platform endpoint scale automatically with the traffic (Cloud Run, Cloud Functions for example), the message rate can go very high!! And the HTTP return code stands for message acknowledgment.
With Pull subscription, the client needs to open a connection to the subscription and then pull the message. The client needs to explicitly acknowledge the messages. Several clients can be connected at the same time. With the client library, the message is consumed with gRPC protocol and it's more efficient (in terms of network bandwidth) to receive and consume the message
Security point of view
With push, it's the PubSub to be authenticated on the HTTP endpoint, if the endpoint required authentication
With pull, it's the client that needs to be authenticated on the PubSub subscription.

Related

Pub/Sub messages from snapshot not processed in a Dataflow streaming pipeline

We have a Dataflow consumming from a Pub/Sub and writing into bigquery in streaming. Due to a permits issue the pipeline got stuck and the messages were not consumed, we re-started the pipeline, save the unacked messages in a snapshot, replay the messages but they are discarded
We fix the problem, re-deployed the pipeline with a new subscription to the topic and all the events are consumed in streaming without a problem
For all the unacked messages accumulated (20M) in the first subscription, we created a snapshot
This snapshot was then connected to the new subscription via the UI using Replay messages dialog
In the metrics dashboard we see that the unacked messages spike to 20M and then they get consumed
subscription spike
But then the events are not sent to BigQuery, checking inside dataflow job metrics we are able to see a spike in the Duplicate message count within the read from pubsub step
Dataflow Duplicate counter
The messages are < 3 days old, does anybody knows why this happen? Thanks in advance
The pipeline is using Apache Beam SDK 2.39.0 and python 3.9 with streming engine and v2 runner enable.
How long does it take for a Pub/Sub message to process, is it a long process?
In that case, Pub/Sub may redeliver messages, according to subscription configuration/delays. See Subscription retry policy.
Dataflow can work-around that, as it acknowledges from the source after a successful shuffle. If you add a GroupByKey (or artificially, a Reshuffle) transform, it may resolve source duplications.
More information at https://beam.apache.org/contribute/ptransform-style-guide/#performance

google cloud pubsublite client on a serverless service

First of all, I wanted to tag this post to google-cloud-pubsub-lite, but it's not created yet, my apologizes
I'm trying to get introduced with pubsub lite. I think it can be used as a "cheap" way to get an event store in a GCP project.
We usually create GAE standard services so we pay for what we used and at the same time it offers a great scalability.
Reading samples about how to currently subscribe to pubsub lite I observe that there's no option to supply an endpoint to receive new messages. The client connects to a subscription and stays awaiting for new messages to be streamed throw the connection.
I'm wondering a few qustions:
Can we receive messages from a pubsub lite topic in a Cloud Function or in an endpoint of a GAE standard service?
How can we scale to several clients for a topic subscription
Thanks
PubSub lite subscription supports only the Pull mode. So, you need to create one or several clients, to plug them to the subscription and to get the messages.
In serverless mode, you should use the Push subscription more suitable for scalability and integration. In the pull subscription mode, you need to perform microbatches
Create a Cloud Scheduler
* * * * * as frequency
Call the serverless tool that you want (Cloud Run, Cloud Function, App Engine)
On the serverless product, when you receive a request, create a connection to the PubSub lite subscription and start to pull the messages.
If the pulling takes more than 1 minutes a new request will be received from Cloud Scheduler
Cloud Function will create a new instance automatically and start the pulling
Cloud Run can handle up to 80 requests concurrently. I recommend you to set the Concurrency paramater to 1 to have the exact same behavior as Cloud Function
You can't play with the concurrency on App Engine
Set the timeout to the max
If there is no new message (for example during 500ms) exit gracefully.
If the service timeout is close (15s before for example), stop the pulling and exit gracefully.
Like this, you could have several client to the same subscription (scale + 1 per minutes
and per scheduler, if the previous run is still active)
This workaround keep the serverless mode. If there is no messages, the pulling stopped after 500ms, or when there is no new messages. You scale up with your traffic.
However, I don't understand your concept of cheap event store.
PubSub lite is not a pay as you go model, but a flat model. You reserve capacity and you pay for it 24/7 even if it is not used
PubSub lite is zonal, and dangerous for HA
You can keep the event up to the partition is full. But will not be cheaper to store the event elsewhere? BigQuery? Firestore? Cloud SQL?

Specifics of using a push subscription as a load balancer

I am trying to send IoT commands using a push subscription. I have 2 reasons for this. Firstly, my devices are often on unstable connections so going through the pubsub let me have retries and I don't have to wait the QoS 1 timeout (I still need it because I log it for later use) at the time I send the message. The second reason is the push subscription can act as a load balancer. To my understanding, if multiple consumers listen to the same push subscription, each will receive a subset of the messages, effectively balancing my workload. Now my question is, this balancing is a behavior I observed on pull subscriptions, I want to know if:
Do push subscription act the same ?
Is it a reliable way to balance a workload ?
Am I garanteed that these commands will be executed at most once if there is, lets say, 15 instances listening to that subscription ?
Here's a diagram of what I'm trying to acheive:
Idea here is that I only interact with IoT Core when instances receive a subset of the devices to handle (when the push subscription triggers). Also to note that I don't need this perfect 1 instance for 1 device balancing. I just need the workload to be splitted in a semi equal manner.
EDIT: The question wasn't clear so I rewrote it.
I think you are a bit confused about the concepts behind Pub/Sub. In general, you publish messages to a topic for one or multiple subscribers. I prefer to compare Pub/Sub with a magazine that is being published by a big publishing company. People who like the magazine can get a copy of that magazine by means of a subscription. Then when a new edition of that magazine arrives, a copy is being sent to the magazine subscribers, having exactly the same content among all subscribers.
For Pub/Sub you can create multiple push subscriptions for a topic, up to the maximum of 10,000 subscriptions per topic (also per project). You can read more about those quotas in the documentation. Those push subscriptions can contain different endpoints, in your case, representing your IoT devices. Referring back to the publishing company example, those push endpoints can be seen as the addresses of the subscribers.
Here is an example IoT Core architecture, which focuses on the processing of data from your devices to a store. The other way around could also work. Sending a message (including device/registry ID) from your front-end to a Cloud Function wrapped in API gateway. This Cloud Function then publishes the message to a topic, which sends the message to a cloud Function that posts the message using the MQTT protocol. I worked out both flows for you that are loosely coupled so that if anything goes wrong with your device or processing, the data is not lost.
Device to storage:
Device
IoT Core
Pub/Sub
Cloud Function / Dataflow
Storage (BigQuery etc.)
Front-end to device:
Front-end (click a button)
API Gateway / Cloud Endpoints
Cloud Function (send command to pub/sub)
Pub/Sub
Cloud Function (send command to device with MQTT)
Device (execute the command)

Google Cloud IoT Core and Pubsub Pricing?

I am using google IoT core and pubsub services for my IoT devices. I am publishing data using pubsub to the database. but I think its quite expensive to store every data into the database. I have some data like if the device is on or off and a configuration file which has some parameter which I need to process my IoT payload. Now I am not able to understand if configuration and state topic in IoT is expensive or not? and how long the data is stored in the config topic and is it feasible that whenever the parameter is changed in the config file it publish that data into config topic? and what if I publish my state of a device that if it is online or not every 3 seconds or more into the state topic?
You are mixing different things. There is Cloud IoT, where you have a device registry, with metadata, configuration and states. You also have PubSub topic in which you can publish message about IoT payload that can contain configuration data (I assume that is that you means in this sentence: "it publish that data into config topic").
In definitive it's simple.
All the management operations on Cloud IoT are free (device registration, configuration, metadata,...). There is no limitation and no duration limit. The only one which exists in the quotas for rate limit and configuration size.
The inbound and outbound traffic from and to the IoT devices is billed as described here
If you use PubSub for pushing your messages, Cloud Functions (or Cloud Run, or other compute option), a database (Cloud SQL or Datastore/Firestore), all these services are billed as usual, there is no relation with Cloud IoT service & billing. The constraints of each services are applied as a regular usage. For example, a PubSub message live up to 7 days (by default) in a subscription and until it hasn't acknowledged.
EDIT
Ok, got it, I took time to understood what you wanted to achieve.
The state is designed for getting the internal representation of the devices, but the current limitation doesn't allow you to update it automatically when you received message.
You have 2 solutions:
Either you can update your devices and send an update message only when its state changes (it's for this kind of use case that the feature is designed!)
Or, let the device published the messages every 3 seconds, but in the event PubSub topic. Get the events in a function which get the state list, get the first one (the most recent) and compare the value with the PubSub message. If different, update the state. This workflow also work with external database like Datastore or Firestore.

Does Google Cloud (GCP) Pub/Sub supports feature similar to ConsumerGroups as in Kafka

Trying to decide between Google Cloud (GCP) Pub/Sub vs Manager Kafka Service.
In latest update, Pub/Sub added support to replay messages which were processed before, is a welcome change.
One feature I am not able to find on their documentation is whether we can have something similar to Kafka's Consumer Groups, i.e have group of Subscribers and each processing data from the same topic, and be able to re-process the data from beginning for some Subscriber(consumer group) while others are not affected by it.
eg:
Lets say you have a Topic as StockTicks
And you have two consumer groups
CG1: with two consumers
CG2: With another two consumers
In Kafka I can read messages independently between these groups, but can I do the same thing with Pub/Sub.
And Kafka allows you to replay the messages from the beginning, can I do the same with Pub/Sub, I am ok if I cant replay the messages that were published before the CG was created, but can I replay the message that were submitted after a CG/Subscribers were created?
Cloud Pub/Sub's equivalent of a Kafka's consumer groups is a subscription. Subscribers are the equivalent of a consumer. This answer spells out the relationship between subscriptions and subscribers in a little more detail.
Your example in Cloud Pub/Sub terms would have a single topic, StockTicks, with two subscriptions (call them CG1 and CG2). You would bring up four subscribers, two that fetch messages for the subscription CG1 and two that fetch messages for the CG2 subscription. Acking and replay would be independent on CG1 and CG2, so if you were to seek back on CG1, it would not affect the delivery of messages to subscribers for CG2 at all.
Keep in mind with Cloud Pub/Sub that only messages published after a subscription is successfully created will be delivered to subscribers on that subscription. Therefore, if you create a new subscription, you won't get the all of the messages published since the beginning of time; you will only get messages published from that point on.
If you seek back on a subscription, you can only get up to 7 days of messages to replay (assuming the subscription was created at least 7 days ago) since that is the max retention time for messages in Cloud Pub/Sub.