I am new to DLP so may have limited visibility w.r.t to DLP Services.
I have created one DE-IDENTIFY Template for data Masking. I want to use that DE-IDENTIFY Template over a few BigQuery tables but I am not getting any direct way to do that like the INSPECT Template that provides INSPECT Job in DLP itself.
Do I need to write a custom job for all the De-identify use cases? if yes then where I can run that custom logic DataFlow, DataProc, GKE, Cloud Function or else ?
I have checked the Dataflow template but that does not provide any template for Bigquery only. Most of the templates are for data ingestion such as from Cloud Storage to BigQuery. Not for the data already stored in BigQuery.
Is there anything that I am missing? I am using asia-south1 region for the Testing. Please help me to understand the use case of it.
Related
If I have a csv which is to update entries in SQL database. The file size is max 50 KB and update frequency is twice a week. Also if I have a requirement to do some automated sanity testing
What should I use Dataflow or Cloud Function.
For this use case (2 little files per week) Cloud Function will be the best option.
Other options would be to create a Dataflow batch pipeline and trigger by cloud function OR create a Dataflow streaming pipeline, but both options will be more expensive.
Here some documentation about how to connect on Cloud SQL from Cloud Function
And here some documentation related to triggering a Cloud Function from Cloud storage.
See ya.
If there are no aggregations, and each input "element" does not interact with others, that is, the pipeline works in 1:1, then you can use either Cloud Functions or Dataflow.
But if you need to do aggregations, filtering, any complex calculation that involves more than one single element in isolation, you will not be able to implement that with Cloud Functions. You need to use Dataflow in that situation.
We have requirement to push data from Bigquery to pubsub as event using dataflow.Do we have any template available for the same ( as I understand pubsub to BQ with DF template is available). If we use dataflow streaming mechanism is set to true - do we need any scheduler to invoke dataflow to fetch and push data to pubsub? Please guide me on this.
There isn't a template to push BigQuery rows to PubSub. In addition, the streaming mode works only when PubSub is the source, not the sink. When it's a database, or a file, it's always in batch mode.
For your use case, I recommend you to use a simple Cloud Functions or Cloud Run and to trigger it with Cloud Scheduler. The data volume is low, and a serverless product perfectly fit your use case. No need of a big and scalable product like Dataflow
There is an option to export BigQuery table to json files into cloud storage.
(Or may be, if your write to BQ is not heavy, when you write to BigQuery, write also json files into cloud storage into dated folders)
Then use text files to pub sub dataflow template.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#text-files-on-cloud-storage-to-pubsub-stream
I can't find limit information about Cloud Data Fusion.
Does anyone know, how many data pipelines can I create with Cloud Data Fusion by default? (link, source needed)
You can create as many pipelines as long as you are not hitting the quotas of the resources used in the pipeline. For example your pipeline uses BigQuery, Compute Engine, etc. and one of these hit a quota, then you are not able to create a new pipeline. See Data Fusion Quotas and limits for reference.
I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?
Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions
We are looking to stream the PubSubmessage(json string) from Pub-Sub using Dataflow and then write in Cloud storage. I am wondering what would be best dataformat while writing the data to Cloud storage? My further use case might also involve using Dataflow to read from Cloud storage again for further operations to persist to Data lake based on the need. Few of the options i was thinking:
a) Use Dataflow to directly write as json string itself to Cloud storage? I assume every line in the file in the Cloud storage is to be treated as a single message if reading from Cloud storage and then if processing for further operations to Datalake, right?
b) Transform the json to a text file format using Dataflow and save in Cloud storage
c) Any other options?
You could store your data with the JSON format for further use in BigQuery if you need to analyze your data later. The Dataflow solution that you're mentioning on the a) option will be a good way to handle your scenario. Additionally, you could use Cloud functions with a Pub/Sub trigger then write the content to cloud storage. You could use the code shown in this tutorial as a base for this scenario as this put the information in a topic, then gather the message from the topic and creates a cloud storage object with the message as its content.