I have seen examples of using log sinks to send stack driver logs to pub/sub, but I want to do the opposite.
Is it possible to configure stack driver logs to subscribe to a topic and just dump the pubsub messages to stack driver as logs? Or some way to just have all events sent to a topic go to stack driver logs?
This is instead of having to write a custom application that has to read them and write them as logs.
No, you can't sink the PubSub messages in Cloud Logging. You have to create a small custom app (a Cloud Functions (use runtime v2 if you have a lot of message to reduce cost) or a Cloud Run) to get the messages, to log them and ack them. Less than 10 lines of code, but you have to do it.
Related
We have a Dataflow consumming from a Pub/Sub and writing into bigquery in streaming. Due to a permits issue the pipeline got stuck and the messages were not consumed, we re-started the pipeline, save the unacked messages in a snapshot, replay the messages but they are discarded
We fix the problem, re-deployed the pipeline with a new subscription to the topic and all the events are consumed in streaming without a problem
For all the unacked messages accumulated (20M) in the first subscription, we created a snapshot
This snapshot was then connected to the new subscription via the UI using Replay messages dialog
In the metrics dashboard we see that the unacked messages spike to 20M and then they get consumed
subscription spike
But then the events are not sent to BigQuery, checking inside dataflow job metrics we are able to see a spike in the Duplicate message count within the read from pubsub step
Dataflow Duplicate counter
The messages are < 3 days old, does anybody knows why this happen? Thanks in advance
The pipeline is using Apache Beam SDK 2.39.0 and python 3.9 with streming engine and v2 runner enable.
How long does it take for a Pub/Sub message to process, is it a long process?
In that case, Pub/Sub may redeliver messages, according to subscription configuration/delays. See Subscription retry policy.
Dataflow can work-around that, as it acknowledges from the source after a successful shuffle. If you add a GroupByKey (or artificially, a Reshuffle) transform, it may resolve source duplications.
More information at https://beam.apache.org/contribute/ptransform-style-guide/#performance
We have a Cloud Functions with a Pubsub trigger and it will invoke an application on HTTP endpoint based on the message. When we need to update the backend application we want the Cloud Functions to pause and queue up the messages and then start again when the application is up.
Currently we are logging all the failed messages in Stackdriver and resubmitting them after the release. Is there any better way to do this?
Your system must be resilient to outage (managed by you when you perform an update, or unexpected). Therefore, I recommend you to handle both in the same way.
Set the retry parameter when you deploy your Cloud Functions (background functions, bind with PubSub topic). When your Cloud Functions can't process the message (the application isn't available), raise an error. The message will be retried later.
Like that, you have nothing to worry. When the message can be delivered, it is, and when it can't, there is an error and the message is retried.
I have a Google Cloud Storage Trigger set up on a Cloud Function with max instances of 5, to fire on the google.storage.object.finalize event of a Cloud Storage Bucket. The docs state that these events are "based on" the Cloud Pub/Sub.
Does anyone know:
Is there any way to see configuration of the topic or subscription in the console, or through the CLI?
Is there any way to get the queue depth (or equivalent?)
Is there any way to clear events?
No, No and No. When you plug Cloud Functions to Cloud Storage event, all the stuff are handle behind the scene by Google and you see nothing and you can't interact with anything.
However, you can change the notification mechanism. Instead of plugin directly your Cloud Functions on Cloud Storage Event, plug a PubSub on your Cloud Storage event.
From there, you have access to YOUR pubsub. Monitor the queue, purge it, create the subscription that you want,...
The recomended way to work with storage notifications is using Pubsub.
Legacy storage notifications still work, but with pubsub you can "peek" into the pubsub message queue and clear it if you need it.
Also, you can process pubsub events with cloud run - which is easier to develop and test (just web service), easier to deploy (just a container) and it can process several requests in parallel without having to pay more (great when you have a lot of requests together).
Where does pubsub storage notifications go?
You can see where gcloud notifications go with the gsutil command:
% gsutil notification list gs://__bucket_name__
projects/_/buckets/__bucket_name__/notificationConfigs/1
Cloud Pub/Sub topic: projects/__project_name__/topics/__topic_name__
Filters:
Event Types: OBJECT_FINALIZE
Is there any way to get the queue depth (or equivalent?)
In pubsub you can have many subsciptions to topics.
If there is no subsciption, messages get lost.
To send data to a cloud function or cloud run you setup a push subscription.
In my experience, you won't be able to see what happened because it faster that you can click: you'll find this empty 99.9999% of the time.
You can check the "queue" depht in the console (pubsub -> choose you topics -> choose the subscription).
If you need to troubleshoot this, set up a second subscription with a time to live low enough that it does not use a lot of space (you'll be billed for it).
Is there any way to clear events?
You can empty the messages from the pubsub subscription, but...
... if you're using a push notification agains a cloud function it will much faster than you can "click".
If you need it, it is on the web console (opent the pubsub subscription and click in the vertical "..." on the top right).
I'd like to get notifications for all standard error logs sent to Google Cloud Logging. Preferably, I'd like to get the notifications through Google Cloud Error Reporting, so I can easily get notifications on my phone through the GCP mobile app.
I've deployed applications to Google Kubernetes Engine that are writing logs to standard error, and GKE is nicely forwarding all the stderr logs to Google Cloud Logging with logName: "projects/projectName/logs/stderr"
I see the logs show up in Google Cloud Logging, but Error Reporting does not pick up on them.
I've tried troubleshooting as described here: https://cloud.google.com/error-reporting/docs/troubleshooting. But the proposed solutions revolve around formatting the logs in a certain way. What if I've deployed applications for which I can't control the log messages?
A (totally ridiculous) option could be to create a "logs-based metric" based on any log sent to stderr, then get notified whenever that metric exceeds 1.
What's the recommended way to get notified for stderr logs?
If Error Reporting is not recognizing the stderr logs from your container it means they are not displayed in the correct format for this API to be able to detect them.
Take a look at this guide on how to setup error reporting for GKE API
There are other ways to do this with third party products like gSlack where basically you will export the stackdriver logs to pub/sub and then send them into the Slack channel with Cloud Functions.
You can also try to do it by using Cloud Build and try to integrate it with GKE container logs.
Still, I think the best and easiest option will be by using the Monitoring Alert.
You can force the error by setting the #type in the context as shown in the docs. For some reason, even if this is the Google library, and it has code to detect that an Exception is thrown, with its stack trace, it won't recognize it as an error worth reporting.
I also added the service array to be able to identify my service in the error reporting.
I want to know what type of subscription one should create in GCP pubsub in order to handle high-frequency data from pubsub topic.
I will be ingesting data in dataflow with 100 plus messages per second.
Will pull or push subscription really matters and how it will gonna affect the speed and all.
If you consume the PubSub subscription with Dataflow, only Pull subscription is available
either you create one and you give it in the parameter of your dataflow pipeline
or you specify only the topic in your dataflow pipeline and Dataflow will create by itself the pull subscription.
If both case, Dataflow will process the messages in streaming mode
The difference
If you create the subscription by yourselves, all the messages will be stored and kept (up to 7 days by default) and will be consumed when the dataflow pipeline will be started.
If you let Dataflow to create the subscription, only the message that arrives AFTER the subscription creation will be consumed by the dataflow pipeline. If you want to not loose a message, it's not the recommended solution. If you don't care about the old message, it's a good choice.
High frequency
Then, 100 messages per second is absolutely not high frequency. 1 pubsub topic can ingest up to 1 000 000 of messages per second. Don't worry about that!
Push VS Pull
The model is different.
With the push subscription, you have to specify an HTTP endpoint (on GCP or elsewhere) that consumes the message. It's a webhook pattern. If the platform endpoint scale automatically with the traffic (Cloud Run, Cloud Functions for example), the message rate can go very high!! And the HTTP return code stands for message acknowledgment.
With Pull subscription, the client needs to open a connection to the subscription and then pull the message. The client needs to explicitly acknowledge the messages. Several clients can be connected at the same time. With the client library, the message is consumed with gRPC protocol and it's more efficient (in terms of network bandwidth) to receive and consume the message
Security point of view
With push, it's the PubSub to be authenticated on the HTTP endpoint, if the endpoint required authentication
With pull, it's the client that needs to be authenticated on the PubSub subscription.