Programmatically edit Dataprep recipe - google-cloud-platform

We have a dataprep job to process input file and produce a cleaned file.
We are calling this dataprep job remotely using dataflow templates. We are using python to run a job from dataflow templates.
Since we need to do this for different files, we need to modify the recipe dynamically and execute the job in dataprep.
Is it possible to edit the recipe of a dataprep job from a Python code (Remotely)? If yes, is it possible to trigger a dataprep job from Python code?

It looks like there is not an API for Dataprep as far but there is in fact a Feature Request. You might want to give it a star it in order to prioritize it.

Related

Is there a way to specify ignoreExisting on pipelineJob?

Is there a way to specify ignoreExisting on pipelineJob? I don't see it listed in plugin/job-dsl/api-viewer/index.html but maybe I'm missing a way to do it.
In my setup all jobs are defined using job dsl thru the configuration as code module. All jobs defined by jobs dsl are use to load pipelines where all info for the jobs is configured. Since all configuration from the jobs are stored in the pipeline, I'd like to be able to define each job and have it not be modified by job dsl again unless the job is removed.
Current behavior is that the job dsl overwrites any changes in the job made by the pipeline which is not what I want. Any way around this? I thought ignoreExisting would do the trick but it doesn't seem to be available in pipelineJob

How to get the Dataflow template of a Dataprep job?

Good morning everyone, the client I work for is going to deprecate Dataprep in October, we currently work everything with Google Cloud Platform.
Dataprep is a "pretty" layer that runs under dataflow,
currently, one of the implemented solutions consists of: a file is received in a bucket, and with python I execute the dataprep.
https://www.trifacta.com/blog/automate-cloud-dataprep-pipeline-data-warehouse/
I need to know how I can obtain the template of those dataprep jobs so that when the file is received in the bucket I can trigger the dataflow corresponding to that dataprep and if I can eliminate dataprep from the solution.
https://mbha-phoenix.medium.com/running-cloud-dataprep-jobs-on-cloud-dataflow-for-more-control-37ed84e73cf3
Dataprep previously allowed you to do this, but this option is no longer available.
Screenshot: "Export Result Window"
I would appreciate your help.
Thanks a lot
Cheers
The export option is no longer available in DataPrep. From the DataPrep -> Job History, you can view the previously executed Dataflow job, but can not export

It is possible to re-run a job in Google Cloud Dataflow after succeded

Maybe the question sounds stupid but I was wondering if once the job is successfully finished and having ID, is it possible to start the same job again?
Or is it necessary to create another one?
Because otherwise I would have the job with the same name throughout the list.
I just want to know if there is a way to restart it without recreating it again.
It's not possible to run the exact same job again, but you can create a new job with the same name that runs the same code. It will just have a different job ID and show up as a separate entry in the job list.
If you want to make running repeated jobs easier, you can create a template. This will let you create jobs from that template via a gcloud command instead of having to run your pipeline code.
Cloud Dataflow does have a re-start function. See SDK here. One suggested pattern (to help with deployment) is to create a template for the graph you want to repeatedly run AND execute the template.

Way to trigger dataflow only after Big Query Job finished

actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

Google Dataprep: Scheduling with updated data source

Is there way to trigger dataprep flow on GCS (Google Cloud Storage) file upload? Or, at least, is it possible to make dataprep run each day and take the newest file from certain directory in GCS?
It should be possible, because otherwise what is the point in scheduling? Running the same job over the same data source with the same output?
It seems this product is very immature at the moment, so no API endpoint exists to run a job in this service. It is only possible to run a job in the UI.
In general, this is a pattern that is typically used for running jobs on a schedule. Maybe at some point the service will allow you to publish into the "queue" that Run Job already uses.