Bigquery to pubsub message push using dataflow - google-cloud-platform

We have requirement to push data from Bigquery to pubsub as event using dataflow.Do we have any template available for the same ( as I understand pubsub to BQ with DF template is available). If we use dataflow streaming mechanism is set to true - do we need any scheduler to invoke dataflow to fetch and push data to pubsub? Please guide me on this.

There isn't a template to push BigQuery rows to PubSub. In addition, the streaming mode works only when PubSub is the source, not the sink. When it's a database, or a file, it's always in batch mode.
For your use case, I recommend you to use a simple Cloud Functions or Cloud Run and to trigger it with Cloud Scheduler. The data volume is low, and a serverless product perfectly fit your use case. No need of a big and scalable product like Dataflow

There is an option to export BigQuery table to json files into cloud storage.
(Or may be, if your write to BQ is not heavy, when you write to BigQuery, write also json files into cloud storage into dated folders)
Then use text files to pub sub dataflow template.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#text-files-on-cloud-storage-to-pubsub-stream

Related

How to Create job for De-identification template in GCP DLP?

I am new to DLP so may have limited visibility w.r.t to DLP Services.
I have created one DE-IDENTIFY Template for data Masking. I want to use that DE-IDENTIFY Template over a few BigQuery tables but I am not getting any direct way to do that like the INSPECT Template that provides INSPECT Job in DLP itself.
Do I need to write a custom job for all the De-identify use cases? if yes then where I can run that custom logic DataFlow, DataProc, GKE, Cloud Function or else ?
I have checked the Dataflow template but that does not provide any template for Bigquery only. Most of the templates are for data ingestion such as from Cloud Storage to BigQuery. Not for the data already stored in BigQuery.
Is there anything that I am missing? I am using asia-south1 region for the Testing. Please help me to understand the use case of it.

GCP DataFlow vs CloudFunctions for small size and less frequent update

If I have a csv which is to update entries in SQL database. The file size is max 50 KB and update frequency is twice a week. Also if I have a requirement to do some automated sanity testing
What should I use Dataflow or Cloud Function.
For this use case (2 little files per week) Cloud Function will be the best option.
Other options would be to create a Dataflow batch pipeline and trigger by cloud function OR create a Dataflow streaming pipeline, but both options will be more expensive.
Here some documentation about how to connect on Cloud SQL from Cloud Function
And here some documentation related to triggering a Cloud Function from Cloud storage.
See ya.
If there are no aggregations, and each input "element" does not interact with others, that is, the pipeline works in 1:1, then you can use either Cloud Functions or Dataflow.
But if you need to do aggregations, filtering, any complex calculation that involves more than one single element in isolation, you will not be able to implement that with Cloud Functions. You need to use Dataflow in that situation.

Should I use pub/sub

I am trying to write an ingestion application using GCP services. There could be around 1 TB of data each day which can come in a streaming way (i.e, 100 GIG each hour or even by once at a specific time)
I am trying to design an ingestion application, I first thoght it is a good idea to write a simple Python script within a cron job to read files sequentiallly (or even within two three threads) and then publish them as a message to pub/sub. Further I need to have a Dataflow job running always read data from pub/sub and save them to BigQuery.
But I really want to know If I need pub/sub at all here, I know dataflow could be very flexible and i wanted to know can I ingest 1 TB of data directly from GCS to BigQuery as batch job, or it is better to be done by a streaming job (by pub/sub) as I told above? what are the pros cons of each approach in terms of cost?
It seems like you don't need Pub/Sub at all.
There is already a Dataflow template for direct transfer of text files from Cloud Storage to BigQuery (in BETA just like the Pub/Sub to BigQuery template) and in general, batch jobs are cheaper than stream jobs (see Pricing Details).

Create Jobs and schedule them on Bigquery

I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?
Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions

Data format in Cloud storage while streaming PubSubmessage(json string) from PubSub using Dataflow?

We are looking to stream the PubSubmessage(json string) from Pub-Sub using Dataflow and then write in Cloud storage. I am wondering what would be best dataformat while writing the data to Cloud storage? My further use case might also involve using Dataflow to read from Cloud storage again for further operations to persist to Data lake based on the need. Few of the options i was thinking:
a) Use Dataflow to directly write as json string itself to Cloud storage? I assume every line in the file in the Cloud storage is to be treated as a single message if reading from Cloud storage and then if processing for further operations to Datalake, right?
b) Transform the json to a text file format using Dataflow and save in Cloud storage
c) Any other options?
You could store your data with the JSON format for further use in BigQuery if you need to analyze your data later. The Dataflow solution that you're mentioning on the a) option will be a good way to handle your scenario. Additionally, you could use Cloud functions with a Pub/Sub trigger then write the content to cloud storage. You could use the code shown in this tutorial as a base for this scenario as this put the information in a topic, then gather the message from the topic and creates a cloud storage object with the message as its content.