PubSubPullSensor fails when trying to pull messages larger than 50KB - google-cloud-platform

I have been trying to use PubSubPullSensor in Airflow to pull JSON messages and ingest into bigquery, whenever the message size exceeds beyond a threshold the sensor fails to pull the message and push via XCOM. I understand XCOM has a max size limit but how do we overcome such a situation and are there any workarounds for this scenario?

There is dedicated solution in the Google Cloud Platform for scenario you have described. It's Dataflow Pub/Sub Subscription to BigQuery template, which allows you to reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.
As it is said in the official documentation, maximum XCom size is 48 KB. XComs let tasks exchange messages, allowing more nuanced forms of control and shared state. XComs don't seem like to be made for passing large amount of data.
If you wish to stay with Airflow solution and the current PubSubSensor’s functionality doesn’t fit your needs, you may develop your custom sensor using the PubSub Hook.
Another solution is to use Cloud Functions, which can be triggered by messages published to Pub/Sub topics, together with Composer DAG triggering solution. When a message is published to the Pub/Sub topic, the Cloud Function triggers the Composer DAG.
I hope you find the above pieces of information useful.

Related

Can Google Dataflow connect to API data source and insert data into Big Query

We are exploring few use cases where we might have to ingest data generated by the SCADA/PIMS devices.
For security reason, we are not allowed to directly connect to OT devices or datasources. Hence, this data has REST APIs which can be used to consume the data.
Please suggest if Dataflow or any other service from GCP can be used to capture this data and put it into Big Query or any other relevant target service.
If possible, please share any relevant documentation/link around such requirements.
Yes!
Here is what you need to know: when you write an Apache Beam pipeline, your processing logic lives in DoFn that you create. These functions can call any logic you want. If your data source is unbounded or just big, then you will author a "splittable DoFn" that can be read by multiple worker machines in parallel and checkpointed. You will need to figure out how to provide exactly-once ingestion from your REST API and how to not overwhelm your service; that is usually the hardest part.
That said, you may wish to use a different approach, such as pushing the data into Cloud Pubsub first. Then you would use Cloud Dataflow to read the data from Cloud Pubsub. This will provide a natural scalable queue between your devices and your data processing.
You can capture data with PubSub and direct it to be processed in Dataflow and then saved into BigQuery (or storage), with a specific IO connector.
Stream messages from Pub/Sub by using Dataflow:
https://cloud.google.com/pubsub/docs/stream-messages-dataflow
Google-provided streaming templates (for Dataflow): PubSub->Dataflow->BigQuery:
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
Whole solution:
https://medium.com/codex/a-dataflow-journey-from-pubsub-to-bigquery-68eb3270c93

How do you ensure it does work with google cloud pub/sub?

I am currently working on a distributed crawling service. When making this, I have a few issues that need to be addressed.
First, let's explain how the crawler works and the problems that need to be solved.
The crawler needs to save all posts on each and every bulletin board on a particular site.
To do this, it automatically discovers crawling targets and publishes several messages to pub/sub. The message is:
{
"boardName": "test",
"targetDate": "2020-01-05"
}
When the corresponding message is issued, the cloud run function is triggered, and the data corresponding to the given json is crawled.
However, if the same duplicate message is published, duplicate data occurs because the same data is crawled. How can I ignore the rest when the same message comes in?
Also, are there pub/sub or other good features I can refer to for a stable implementation of a distributed crawler?
because PubSub is, by default, designed to deliver AT LEAST one time the messages, it's better to have idempotent processing. (Exact one delivery is coming)
Anyway, your issue is very similar: twice the same message or 2 different messages with the same content will cause the same issue. There is no magic feature in PubSub for that. You need an external tool, like a database, to store the already received information.
Firestore/datastore is a good and serverless place for that. If you need low latency, Memory store and it's in memory database is the fastest.

Google Cloud PubSub message republishing across GCP projects

Context
I am working on a project where we are getting realtime data on a PubSub topic in a particular GCP project STAGE-1.
We have other GCP projects (we are treating as lower level environments) such as DEV-1, QA-1 etc, where we want these message to be re-published to as data realtime data is only hydrating the topic under STAGE-1 GCP project.
Question
Is there a way to configure message republishing (a bridge) to other PubSub topics across GCP project?
What could be a possible approach to follow if this sort of setup is not natively supported by Cloud PubSub?
P.S. I am very new to PubSub.
Thanks in advance. Cheers :)
Here are issues about this =>
A way for cloud functions to trigger a pub/sub topic from a different project
Cloud Function triggered by pubsub message from an external project topic
There may be at least a possible work around solution.
You will need to create additional subscription(s) to the original topic in the source project. That subscription is to be used by some 'active' component (in any project, subject to IAM permissions to access the given subscription).
The 'active' component can be a cloud function, cloud run, dataflow job, app engine, or something running on a compute engine or running on a k8s cluster...
From my point of view, one of the simplest solutions (but may be not the cheapest, depends on your context) - is to use a streaming dataflow job, which reads from a source subscription and push messages into one or many target topics - kind of a 'bridge'.
If the flow of messages (number of messages per time) is significant, or you need to serve many (dozens or hundreds of) source subscriptions - it can be quite good cost effective solution (from my point of view).
A potential side bonus in case you are to develop bespoke template for the dataflow job - you can implement additional message handling logic inside the dataflow job.
If you need something 'yesterday', no additional transformation is OK, only one source subscription and one target topic, than there is a Google provided template: Pub/Sub to Pub/Sub which can be used 'immediately'.

Simple task queue using Google Cloud Platform : issue with Google PubSub

My task : I cannot speak openly about what the specifics of my task are, but here is an analogy : every two hours, I get a variable number of spoken audio files. Sometimes only 10, sometimes 800 or more. Let's say I have a costly python task to perform on these files, for example Automatic Speech Recognition. I have a Google Intance managed group that can deploy any number of VMs for executing this task.
The issue : right now, I'm using Google PubSub. Every two hours, a topic is filled with audio ids. Instances of the managed group can be deployed depending on the size of queue. The problem is, only one worker get all the messages from the PubSub subscription, while the others are not receiving any, perhaps because the queue is not that long (maximum ~1000 messages). This issue is reported in a few cases in the python Google Cloud github, and it is not clear if it is the intended purpose of PubSub, or just a bug.
How could I implement the equivalent of a simple serverless task queue in Python and Google Cloud, and can spawn instances based on a given metric, for example the size of the queue ? Is this the intended purpose of PubSub ?
Thanks in advance.
In App Engine you can create push queues and set rate/concurrency limits and Google will handle the rest for you. App Engine will scale as needed (e.g. increase Python instances).
If you're outside of App Engine (e.g. GKE), the pubsub Python client library may be pulling many messages at once. We had a hard time controlling this (for google-cloud-pubsub==0.34.0) as well so we ended up writing a small adjustment on top of google-cloud-pubsub calling SubscriberClient.pull with max_messages set). The server side pubsub API does adhere to max_messages.

Replay events with Google Pub/Sub

I'm looking into Google Cloud, it is very appealing, specially for data intensive applications. I'm looking into Pub/Sub + Dataflow and I'm trying to figure out the best way to replay events that were send via Pub/Sub in case the processing logic changes.
As far as I can tell, Pub/Sub retention has an upper bound of 7 days and it is per subscription, the topic itself does not retain data. In my mind, it would allow to disable the log compaction, like in Kafka, so I can replay data from the very beginning.
Now, since dataflow promises that you can run the same jobs in batch and streaming mode, how effective would it be to simulate this desired behavior by dumping all events into Google Storage and replying from there?
I'm also open for any other ideas.
Thank you
As you said, Cloud Pub/Sub does not currently support replays, so you need to save events somewhere to replay later and Cloud Storage sounds like a good place to do that.
Cloud Pub/Sub now has the ability to replay previously acknowledged messages. Please see the quickstart and related blog post for information on how to use the feature.