First of all, I wanted to tag this post to google-cloud-pubsub-lite, but it's not created yet, my apologizes
I'm trying to get introduced with pubsub lite. I think it can be used as a "cheap" way to get an event store in a GCP project.
We usually create GAE standard services so we pay for what we used and at the same time it offers a great scalability.
Reading samples about how to currently subscribe to pubsub lite I observe that there's no option to supply an endpoint to receive new messages. The client connects to a subscription and stays awaiting for new messages to be streamed throw the connection.
I'm wondering a few qustions:
Can we receive messages from a pubsub lite topic in a Cloud Function or in an endpoint of a GAE standard service?
How can we scale to several clients for a topic subscription
Thanks
PubSub lite subscription supports only the Pull mode. So, you need to create one or several clients, to plug them to the subscription and to get the messages.
In serverless mode, you should use the Push subscription more suitable for scalability and integration. In the pull subscription mode, you need to perform microbatches
Create a Cloud Scheduler
* * * * * as frequency
Call the serverless tool that you want (Cloud Run, Cloud Function, App Engine)
On the serverless product, when you receive a request, create a connection to the PubSub lite subscription and start to pull the messages.
If the pulling takes more than 1 minutes a new request will be received from Cloud Scheduler
Cloud Function will create a new instance automatically and start the pulling
Cloud Run can handle up to 80 requests concurrently. I recommend you to set the Concurrency paramater to 1 to have the exact same behavior as Cloud Function
You can't play with the concurrency on App Engine
Set the timeout to the max
If there is no new message (for example during 500ms) exit gracefully.
If the service timeout is close (15s before for example), stop the pulling and exit gracefully.
Like this, you could have several client to the same subscription (scale + 1 per minutes
and per scheduler, if the previous run is still active)
This workaround keep the serverless mode. If there is no messages, the pulling stopped after 500ms, or when there is no new messages. You scale up with your traffic.
However, I don't understand your concept of cheap event store.
PubSub lite is not a pay as you go model, but a flat model. You reserve capacity and you pay for it 24/7 even if it is not used
PubSub lite is zonal, and dangerous for HA
You can keep the event up to the partition is full. But will not be cheaper to store the event elsewhere? BigQuery? Firestore? Cloud SQL?
Related
I have a Google Cloud Storage Trigger set up on a Cloud Function with max instances of 5, to fire on the google.storage.object.finalize event of a Cloud Storage Bucket. The docs state that these events are "based on" the Cloud Pub/Sub.
Does anyone know:
Is there any way to see configuration of the topic or subscription in the console, or through the CLI?
Is there any way to get the queue depth (or equivalent?)
Is there any way to clear events?
No, No and No. When you plug Cloud Functions to Cloud Storage event, all the stuff are handle behind the scene by Google and you see nothing and you can't interact with anything.
However, you can change the notification mechanism. Instead of plugin directly your Cloud Functions on Cloud Storage Event, plug a PubSub on your Cloud Storage event.
From there, you have access to YOUR pubsub. Monitor the queue, purge it, create the subscription that you want,...
The recomended way to work with storage notifications is using Pubsub.
Legacy storage notifications still work, but with pubsub you can "peek" into the pubsub message queue and clear it if you need it.
Also, you can process pubsub events with cloud run - which is easier to develop and test (just web service), easier to deploy (just a container) and it can process several requests in parallel without having to pay more (great when you have a lot of requests together).
Where does pubsub storage notifications go?
You can see where gcloud notifications go with the gsutil command:
% gsutil notification list gs://__bucket_name__
projects/_/buckets/__bucket_name__/notificationConfigs/1
Cloud Pub/Sub topic: projects/__project_name__/topics/__topic_name__
Filters:
Event Types: OBJECT_FINALIZE
Is there any way to get the queue depth (or equivalent?)
In pubsub you can have many subsciptions to topics.
If there is no subsciption, messages get lost.
To send data to a cloud function or cloud run you setup a push subscription.
In my experience, you won't be able to see what happened because it faster that you can click: you'll find this empty 99.9999% of the time.
You can check the "queue" depht in the console (pubsub -> choose you topics -> choose the subscription).
If you need to troubleshoot this, set up a second subscription with a time to live low enough that it does not use a lot of space (you'll be billed for it).
Is there any way to clear events?
You can empty the messages from the pubsub subscription, but...
... if you're using a push notification agains a cloud function it will much faster than you can "click".
If you need it, it is on the web console (opent the pubsub subscription and click in the vertical "..." on the top right).
I am trying to send IoT commands using a push subscription. I have 2 reasons for this. Firstly, my devices are often on unstable connections so going through the pubsub let me have retries and I don't have to wait the QoS 1 timeout (I still need it because I log it for later use) at the time I send the message. The second reason is the push subscription can act as a load balancer. To my understanding, if multiple consumers listen to the same push subscription, each will receive a subset of the messages, effectively balancing my workload. Now my question is, this balancing is a behavior I observed on pull subscriptions, I want to know if:
Do push subscription act the same ?
Is it a reliable way to balance a workload ?
Am I garanteed that these commands will be executed at most once if there is, lets say, 15 instances listening to that subscription ?
Here's a diagram of what I'm trying to acheive:
Idea here is that I only interact with IoT Core when instances receive a subset of the devices to handle (when the push subscription triggers). Also to note that I don't need this perfect 1 instance for 1 device balancing. I just need the workload to be splitted in a semi equal manner.
EDIT: The question wasn't clear so I rewrote it.
I think you are a bit confused about the concepts behind Pub/Sub. In general, you publish messages to a topic for one or multiple subscribers. I prefer to compare Pub/Sub with a magazine that is being published by a big publishing company. People who like the magazine can get a copy of that magazine by means of a subscription. Then when a new edition of that magazine arrives, a copy is being sent to the magazine subscribers, having exactly the same content among all subscribers.
For Pub/Sub you can create multiple push subscriptions for a topic, up to the maximum of 10,000 subscriptions per topic (also per project). You can read more about those quotas in the documentation. Those push subscriptions can contain different endpoints, in your case, representing your IoT devices. Referring back to the publishing company example, those push endpoints can be seen as the addresses of the subscribers.
Here is an example IoT Core architecture, which focuses on the processing of data from your devices to a store. The other way around could also work. Sending a message (including device/registry ID) from your front-end to a Cloud Function wrapped in API gateway. This Cloud Function then publishes the message to a topic, which sends the message to a cloud Function that posts the message using the MQTT protocol. I worked out both flows for you that are loosely coupled so that if anything goes wrong with your device or processing, the data is not lost.
Device to storage:
Device
IoT Core
Pub/Sub
Cloud Function / Dataflow
Storage (BigQuery etc.)
Front-end to device:
Front-end (click a button)
API Gateway / Cloud Endpoints
Cloud Function (send command to pub/sub)
Pub/Sub
Cloud Function (send command to device with MQTT)
Device (execute the command)
we have a microservices architecture developed on Google cloud.
Actually the microservices are all running on cloud run and talk each other with rest (sync) or with pub/sub (async).
It is an event-driven pattern so when a service publish something happened (like "user_created") on the right pub/sub topic, many services receive that event with a push subscription on their http endpoint.
Now we are moving to kafka for message ordering and replaying features.
Unfortunately kafka consumers are pull based so we need to change the way services are receiving events.
Since cloud run is a serverless solutions that scale to zero, we cannot make it listen to kafka topic, because the service could shut down during the night because no request arrive.
We have different services which can safely be updated with a scheduled cron, so every one hour as example, we make a get request to service, which download all new kafka messages and update itself accordingly.
But many other services need a near real time update to accomplish their role.
So which product of google cloud platform is best suited to consume kafka topic in this architecture?
Thanks!
I want to know what type of subscription one should create in GCP pubsub in order to handle high-frequency data from pubsub topic.
I will be ingesting data in dataflow with 100 plus messages per second.
Will pull or push subscription really matters and how it will gonna affect the speed and all.
If you consume the PubSub subscription with Dataflow, only Pull subscription is available
either you create one and you give it in the parameter of your dataflow pipeline
or you specify only the topic in your dataflow pipeline and Dataflow will create by itself the pull subscription.
If both case, Dataflow will process the messages in streaming mode
The difference
If you create the subscription by yourselves, all the messages will be stored and kept (up to 7 days by default) and will be consumed when the dataflow pipeline will be started.
If you let Dataflow to create the subscription, only the message that arrives AFTER the subscription creation will be consumed by the dataflow pipeline. If you want to not loose a message, it's not the recommended solution. If you don't care about the old message, it's a good choice.
High frequency
Then, 100 messages per second is absolutely not high frequency. 1 pubsub topic can ingest up to 1 000 000 of messages per second. Don't worry about that!
Push VS Pull
The model is different.
With the push subscription, you have to specify an HTTP endpoint (on GCP or elsewhere) that consumes the message. It's a webhook pattern. If the platform endpoint scale automatically with the traffic (Cloud Run, Cloud Functions for example), the message rate can go very high!! And the HTTP return code stands for message acknowledgment.
With Pull subscription, the client needs to open a connection to the subscription and then pull the message. The client needs to explicitly acknowledge the messages. Several clients can be connected at the same time. With the client library, the message is consumed with gRPC protocol and it's more efficient (in terms of network bandwidth) to receive and consume the message
Security point of view
With push, it's the PubSub to be authenticated on the HTTP endpoint, if the endpoint required authentication
With pull, it's the client that needs to be authenticated on the PubSub subscription.
I am using google IoT core and pubsub services for my IoT devices. I am publishing data using pubsub to the database. but I think its quite expensive to store every data into the database. I have some data like if the device is on or off and a configuration file which has some parameter which I need to process my IoT payload. Now I am not able to understand if configuration and state topic in IoT is expensive or not? and how long the data is stored in the config topic and is it feasible that whenever the parameter is changed in the config file it publish that data into config topic? and what if I publish my state of a device that if it is online or not every 3 seconds or more into the state topic?
You are mixing different things. There is Cloud IoT, where you have a device registry, with metadata, configuration and states. You also have PubSub topic in which you can publish message about IoT payload that can contain configuration data (I assume that is that you means in this sentence: "it publish that data into config topic").
In definitive it's simple.
All the management operations on Cloud IoT are free (device registration, configuration, metadata,...). There is no limitation and no duration limit. The only one which exists in the quotas for rate limit and configuration size.
The inbound and outbound traffic from and to the IoT devices is billed as described here
If you use PubSub for pushing your messages, Cloud Functions (or Cloud Run, or other compute option), a database (Cloud SQL or Datastore/Firestore), all these services are billed as usual, there is no relation with Cloud IoT service & billing. The constraints of each services are applied as a regular usage. For example, a PubSub message live up to 7 days (by default) in a subscription and until it hasn't acknowledged.
EDIT
Ok, got it, I took time to understood what you wanted to achieve.
The state is designed for getting the internal representation of the devices, but the current limitation doesn't allow you to update it automatically when you received message.
You have 2 solutions:
Either you can update your devices and send an update message only when its state changes (it's for this kind of use case that the feature is designed!)
Or, let the device published the messages every 3 seconds, but in the event PubSub topic. Get the events in a function which get the state list, get the first one (the most recent) and compare the value with the PubSub message. If different, update the state. This workflow also work with external database like Datastore or Firestore.