I'm looking into Google Cloud, it is very appealing, specially for data intensive applications. I'm looking into Pub/Sub + Dataflow and I'm trying to figure out the best way to replay events that were send via Pub/Sub in case the processing logic changes.
As far as I can tell, Pub/Sub retention has an upper bound of 7 days and it is per subscription, the topic itself does not retain data. In my mind, it would allow to disable the log compaction, like in Kafka, so I can replay data from the very beginning.
Now, since dataflow promises that you can run the same jobs in batch and streaming mode, how effective would it be to simulate this desired behavior by dumping all events into Google Storage and replying from there?
I'm also open for any other ideas.
Thank you
As you said, Cloud Pub/Sub does not currently support replays, so you need to save events somewhere to replay later and Cloud Storage sounds like a good place to do that.
Cloud Pub/Sub now has the ability to replay previously acknowledged messages. Please see the quickstart and related blog post for information on how to use the feature.
Related
We are exploring few use cases where we might have to ingest data generated by the SCADA/PIMS devices.
For security reason, we are not allowed to directly connect to OT devices or datasources. Hence, this data has REST APIs which can be used to consume the data.
Please suggest if Dataflow or any other service from GCP can be used to capture this data and put it into Big Query or any other relevant target service.
If possible, please share any relevant documentation/link around such requirements.
Yes!
Here is what you need to know: when you write an Apache Beam pipeline, your processing logic lives in DoFn that you create. These functions can call any logic you want. If your data source is unbounded or just big, then you will author a "splittable DoFn" that can be read by multiple worker machines in parallel and checkpointed. You will need to figure out how to provide exactly-once ingestion from your REST API and how to not overwhelm your service; that is usually the hardest part.
That said, you may wish to use a different approach, such as pushing the data into Cloud Pubsub first. Then you would use Cloud Dataflow to read the data from Cloud Pubsub. This will provide a natural scalable queue between your devices and your data processing.
You can capture data with PubSub and direct it to be processed in Dataflow and then saved into BigQuery (or storage), with a specific IO connector.
Stream messages from Pub/Sub by using Dataflow:
https://cloud.google.com/pubsub/docs/stream-messages-dataflow
Google-provided streaming templates (for Dataflow): PubSub->Dataflow->BigQuery:
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
Whole solution:
https://medium.com/codex/a-dataflow-journey-from-pubsub-to-bigquery-68eb3270c93
Context
I am working on a project where we are getting realtime data on a PubSub topic in a particular GCP project STAGE-1.
We have other GCP projects (we are treating as lower level environments) such as DEV-1, QA-1 etc, where we want these message to be re-published to as data realtime data is only hydrating the topic under STAGE-1 GCP project.
Question
Is there a way to configure message republishing (a bridge) to other PubSub topics across GCP project?
What could be a possible approach to follow if this sort of setup is not natively supported by Cloud PubSub?
P.S. I am very new to PubSub.
Thanks in advance. Cheers :)
Here are issues about this =>
A way for cloud functions to trigger a pub/sub topic from a different project
Cloud Function triggered by pubsub message from an external project topic
There may be at least a possible work around solution.
You will need to create additional subscription(s) to the original topic in the source project. That subscription is to be used by some 'active' component (in any project, subject to IAM permissions to access the given subscription).
The 'active' component can be a cloud function, cloud run, dataflow job, app engine, or something running on a compute engine or running on a k8s cluster...
From my point of view, one of the simplest solutions (but may be not the cheapest, depends on your context) - is to use a streaming dataflow job, which reads from a source subscription and push messages into one or many target topics - kind of a 'bridge'.
If the flow of messages (number of messages per time) is significant, or you need to serve many (dozens or hundreds of) source subscriptions - it can be quite good cost effective solution (from my point of view).
A potential side bonus in case you are to develop bespoke template for the dataflow job - you can implement additional message handling logic inside the dataflow job.
If you need something 'yesterday', no additional transformation is OK, only one source subscription and one target topic, than there is a Google provided template: Pub/Sub to Pub/Sub which can be used 'immediately'.
We currently run an AWS Lambda function that primarily simply redirects the user to a different URL. The function is invoked via API-Gateway.
For tracking purposes, we would like to create a widget on our dashboard that provides real-time insights into how many redirects are performed each second. The creation of the widget itself is not the problem.
My main question currently is which AWS Services is best suited for telling our other services that an invocation took place. We plan to register the invocation in our database.
Some additional things:
low latency (< 5 seconds) in order to be real-time data
nearly no increased time wait for the user. We aim to redirect the user as fast as possible
Many thanks in advance!
Best Regards
Martin
I understand that your goal is to simply persist the information that an invocation happened somewhere with minimal impact on the response time of the Lambda.
For that purpose I'd probably use an SQS standard queue and just send a message to the queue that the invocation happened.
You can then have an asynchronous process (Lambda, Docker, EC2) process the messages from the queue and update your Dashboard.
Depending on the scalability requirements looking into Kinesis Data Analytics might also be worth it.
It's a fully managed streaming data solution and the analytics part allows you to do sliding window analyses using SQL on data in the Stream.
In that case you'd write the info that something happened to the stream, which also has a low latency.
I am using GCP Bigquery to store some data. I have created a pub/sub job for the Dataflow of the event.Currently, I am facing a issue with data loss. Sometimes, due to "no internet connection" the data is not uploaded to bigquery and the data for that time duration is lost. How can i overcome this situation.
Or what kind of database should i use to store data offline and then upload it online whenever there is connectivity.
Thank You in advance!
What you need to have is either a retry mechanism or a persistent storage. There can be several ways to implement this.
You can use a Message Queue to store data and process. Choice of message queue can be either cloud based like AWS SQS, Cloud Pub/Sub(GCP) or a hosted one like Kafka, RabbitMq.
Another but a bit unoptimized way could be to persist data locally till it is successfully uploaded on the cloud. Local storage can be either some buffer or database etc. If upload is failed you, retry again from the storage. This is something similar to Producer Consumer Problem.
You can use a Google Compute Engine to store your data and always run your data loading job from there. In that case, if your internet connection is lost, data will still continue to load into BigQuery.
By what I understood you are publishing data to PubSub and Dataflow does the rest to get the data inside BigQuery, is it right?
The options I suggest to you:
If your connection loss happens occasionally and for a short amount of time, a retry mechanism could be enough to solve this problem.
If you have frequent connection loss or connection loss for large periods of time, I suggest that you mix a retry mechanism with some process redundancy. You could for example have two process running in different machines to avoid this kind of situation. Its important to mention that for this case you could also try only a retry mechanism but it would be more complex because you would need to determine if the process failed, save the data somewhere (if its not saved) and trigger the process again in the future.
I suggest that you take a look in Apache Nifi. Its a very powerful data flow automation software that might help you solving this kind of issue. Apache Nifi has specific processors to push data directly to PubSub.
As a last suggestion, you could create an automated process to make data quality analysis after the data ingestion. Having this process you could determine more easily if your process failed.
I have been trying to use PubSubPullSensor in Airflow to pull JSON messages and ingest into bigquery, whenever the message size exceeds beyond a threshold the sensor fails to pull the message and push via XCOM. I understand XCOM has a max size limit but how do we overcome such a situation and are there any workarounds for this scenario?
There is dedicated solution in the Google Cloud Platform for scenario you have described. It's Dataflow Pub/Sub Subscription to BigQuery template, which allows you to reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.
As it is said in the official documentation, maximum XCom size is 48 KB. XComs let tasks exchange messages, allowing more nuanced forms of control and shared state. XComs don't seem like to be made for passing large amount of data.
If you wish to stay with Airflow solution and the current PubSubSensor’s functionality doesn’t fit your needs, you may develop your custom sensor using the PubSub Hook.
Another solution is to use Cloud Functions, which can be triggered by messages published to Pub/Sub topics, together with Composer DAG triggering solution. When a message is published to the Pub/Sub topic, the Cloud Function triggers the Composer DAG.
I hope you find the above pieces of information useful.