Google Cloud PubSub message republishing across GCP projects - google-cloud-platform

Context
I am working on a project where we are getting realtime data on a PubSub topic in a particular GCP project STAGE-1.
We have other GCP projects (we are treating as lower level environments) such as DEV-1, QA-1 etc, where we want these message to be re-published to as data realtime data is only hydrating the topic under STAGE-1 GCP project.
Question
Is there a way to configure message republishing (a bridge) to other PubSub topics across GCP project?
What could be a possible approach to follow if this sort of setup is not natively supported by Cloud PubSub?
P.S. I am very new to PubSub.
Thanks in advance. Cheers :)

Here are issues about this =>
A way for cloud functions to trigger a pub/sub topic from a different project
Cloud Function triggered by pubsub message from an external project topic
There may be at least a possible work around solution.
You will need to create additional subscription(s) to the original topic in the source project. That subscription is to be used by some 'active' component (in any project, subject to IAM permissions to access the given subscription).
The 'active' component can be a cloud function, cloud run, dataflow job, app engine, or something running on a compute engine or running on a k8s cluster...
From my point of view, one of the simplest solutions (but may be not the cheapest, depends on your context) - is to use a streaming dataflow job, which reads from a source subscription and push messages into one or many target topics - kind of a 'bridge'.
If the flow of messages (number of messages per time) is significant, or you need to serve many (dozens or hundreds of) source subscriptions - it can be quite good cost effective solution (from my point of view).
A potential side bonus in case you are to develop bespoke template for the dataflow job - you can implement additional message handling logic inside the dataflow job.
If you need something 'yesterday', no additional transformation is OK, only one source subscription and one target topic, than there is a Google provided template: Pub/Sub to Pub/Sub which can be used 'immediately'.

Related

Pros and Cons of Google Dataflow VS Cloud Run while pulling data from HTTP endpoint

This is a design approach question where we are trying to pick the best option between Apache Beam / Google Dataflow and Cloud Run to pull data from HTTP endpoints (source) and put them down the stream to Google BigQuery (sink).
Traditionally we have implemented similar functionalities using Google Dataflow where the sources are files in the Google Storage bucket or messages in Google PubSub, etc. In those cases, the data arrived in a 'push' fashion so it makes much more sense to use a streaming Dataflow job.
However, in the new requirement, since the data is fetched periodically from an HTTP endpoint, it sounds reasonable to use a Cloud Run spinning up on schedule.
So I want to gather pros and cons of going with either of these approaches, so that we can make a sensible design for this.
I am not sure this question is appropriate for SO, as it opens a big discussion with different opinions, without clear context, scope, functional and non functional requirements, time and finance restrictions including CAPEX/OPEX, who and how is going to support the solution in BAU after commissioning, etc.
In my personal experience - I developed a few dozens of similar pipelines using various combinations of cloud functions, pubsub topics, cloud storage, firestore (for the pipeline process state managemet) and so on. Sometimes with the dataflow as well (embedded into the pipelieines); but never used the cloud run. But my knowledge and experience may be not relevant in your case.
The only thing I might suggest - try to priorities your requirements (in a whole solution lifecycle context) and then design the solution based on those priorities. I know - it is a trivial idea, sorry to disappoint you.

GCP Best way to manage multiple cloud function flow

I'm studying GCP and reading about different ways to communicate and manage cloud functions I end up wondering when to use each of the services that offer GCP.
So, I have been reading about GCP Composer, GCP Workflows, Cloud Pub/Sub and I don't see clearly when to use each one, or use simple HTTP calls.
I understand that it depends a lot on the application that you are building, but for example, If I'm building a payment gateway and some functions should be fired after the payment was verified, like sending emails, making not related business logic, adding the purchase to a sales platform. So which one should be the way I manage this flow and in which case would be better to use the others? Should I use events to create an async flow with Pub/Sub, or use complex solutions like composer and workflows? or just simple HTTP calls?
As always, it depends!! Even in your use case, it depends! Ok, after a payment you want to send an email, make business logic, adding the order to your databases,...
But, is all theses actions can be done in parallel, or you need to execute them in a certain order and if a step fails, you stop the process?
In the first case, you can use Cloud PubSub with 1 message published (payment OK) and then a fan out to several functions in parallel. Else, you can use workflow to test the response of the fonction and then to call, or not the following fonctions. With composer you can perform much more checks and actions.
You can also imagine to send another email 24h after to thank the customer for their order, and use Cloud Task to delayed an action.
You talked about Cloud Functions, but you also have other solutions to host code on GCP: App Engine and Cloud Run. Cloud function is, most of the time, single purpose. Sending an email is perfect for a function.
Now, if you have "set of functions" to browse your stock, view the object details, review the price, and book an object (validate an order "books" the order content in your warehouse), the "functions" are all single purpose but related to the same domain: warehouse management. Thus you can create a webserver that propose different path to manage the warehouse (a microservice for the warehouse if you prefer) and host it on CloudRun or App Engine.
Each product has its strength and weakness. You will also see this when you will learn about the storage on GCP. Most of the time, you can achieve things with several product, but if you don't use the right one, it will be slower, or cost much more.

PubSubPullSensor fails when trying to pull messages larger than 50KB

I have been trying to use PubSubPullSensor in Airflow to pull JSON messages and ingest into bigquery, whenever the message size exceeds beyond a threshold the sensor fails to pull the message and push via XCOM. I understand XCOM has a max size limit but how do we overcome such a situation and are there any workarounds for this scenario?
There is dedicated solution in the Google Cloud Platform for scenario you have described. It's Dataflow Pub/Sub Subscription to BigQuery template, which allows you to reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.
As it is said in the official documentation, maximum XCom size is 48 KB. XComs let tasks exchange messages, allowing more nuanced forms of control and shared state. XComs don't seem like to be made for passing large amount of data.
If you wish to stay with Airflow solution and the current PubSubSensor’s functionality doesn’t fit your needs, you may develop your custom sensor using the PubSub Hook.
Another solution is to use Cloud Functions, which can be triggered by messages published to Pub/Sub topics, together with Composer DAG triggering solution. When a message is published to the Pub/Sub topic, the Cloud Function triggers the Composer DAG.
I hope you find the above pieces of information useful.

Simple task queue using Google Cloud Platform : issue with Google PubSub

My task : I cannot speak openly about what the specifics of my task are, but here is an analogy : every two hours, I get a variable number of spoken audio files. Sometimes only 10, sometimes 800 or more. Let's say I have a costly python task to perform on these files, for example Automatic Speech Recognition. I have a Google Intance managed group that can deploy any number of VMs for executing this task.
The issue : right now, I'm using Google PubSub. Every two hours, a topic is filled with audio ids. Instances of the managed group can be deployed depending on the size of queue. The problem is, only one worker get all the messages from the PubSub subscription, while the others are not receiving any, perhaps because the queue is not that long (maximum ~1000 messages). This issue is reported in a few cases in the python Google Cloud github, and it is not clear if it is the intended purpose of PubSub, or just a bug.
How could I implement the equivalent of a simple serverless task queue in Python and Google Cloud, and can spawn instances based on a given metric, for example the size of the queue ? Is this the intended purpose of PubSub ?
Thanks in advance.
In App Engine you can create push queues and set rate/concurrency limits and Google will handle the rest for you. App Engine will scale as needed (e.g. increase Python instances).
If you're outside of App Engine (e.g. GKE), the pubsub Python client library may be pulling many messages at once. We had a hard time controlling this (for google-cloud-pubsub==0.34.0) as well so we ended up writing a small adjustment on top of google-cloud-pubsub calling SubscriberClient.pull with max_messages set). The server side pubsub API does adhere to max_messages.

Replay events with Google Pub/Sub

I'm looking into Google Cloud, it is very appealing, specially for data intensive applications. I'm looking into Pub/Sub + Dataflow and I'm trying to figure out the best way to replay events that were send via Pub/Sub in case the processing logic changes.
As far as I can tell, Pub/Sub retention has an upper bound of 7 days and it is per subscription, the topic itself does not retain data. In my mind, it would allow to disable the log compaction, like in Kafka, so I can replay data from the very beginning.
Now, since dataflow promises that you can run the same jobs in batch and streaming mode, how effective would it be to simulate this desired behavior by dumping all events into Google Storage and replying from there?
I'm also open for any other ideas.
Thank you
As you said, Cloud Pub/Sub does not currently support replays, so you need to save events somewhere to replay later and Cloud Storage sounds like a good place to do that.
Cloud Pub/Sub now has the ability to replay previously acknowledged messages. Please see the quickstart and related blog post for information on how to use the feature.