Good morning everyone, the client I work for is going to deprecate Dataprep in October, we currently work everything with Google Cloud Platform.
Dataprep is a "pretty" layer that runs under dataflow,
currently, one of the implemented solutions consists of: a file is received in a bucket, and with python I execute the dataprep.
https://www.trifacta.com/blog/automate-cloud-dataprep-pipeline-data-warehouse/
I need to know how I can obtain the template of those dataprep jobs so that when the file is received in the bucket I can trigger the dataflow corresponding to that dataprep and if I can eliminate dataprep from the solution.
https://mbha-phoenix.medium.com/running-cloud-dataprep-jobs-on-cloud-dataflow-for-more-control-37ed84e73cf3
Dataprep previously allowed you to do this, but this option is no longer available.
Screenshot: "Export Result Window"
I would appreciate your help.
Thanks a lot
Cheers
The export option is no longer available in DataPrep. From the DataPrep -> Job History, you can view the previously executed Dataflow job, but can not export
Related
I am facing a problem of dataflow not loading even after 30minutes of staying. How do I complete that Lab,
The Task is:- Make a chart on dataflow by running a query on BigQuery.
I found the second option.
We can also start dataflow in another tab and restarting it from the new phase will help it in loading correctly. Then using make a new chart using a big query option with a query that has completed the task.
actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples
I have several questions regarding the Google Spanner Export / Import tool. Apparently the tool creates a dataflow job.
Can an import/export dataflow job be re-run after it had run successfully from the tool? If so, will it use the current timestamp?
How to schedule a daily backup (export) of Spanner DBs?
How to get notified of new enhancements within the GCP platform? I was browsing the web for something else and I noticed that the export / import tool for GCP Spanner had been released 4 days earlier.
I am still browsing through the documentation for dataflow jobs and templates, etc.. Any suggestions to the above would be greatly appreciated.
Thx
My response based on limited experience with the Spanner Export tool.
I have not seen a way to do this. There is no option in the GCP console, though that does not mean it cannot be done.
There is no built-in scheduling capability. Perhaps this can be done via Google's managed Airflow service, Cloud Composer (https://console.cloud.google.com/composer)? I have yet to try this, but it is next step as I have similar needs.
I've made this request to Google several times. I have yet to get a response. My best recommendation is to read the change logs when updating the gcloud CLI.
Finally-- there is an outstanding issue with the Export tool that causes it to fail if you export a table with 0 rows. I have filed a case with Google (Case #16454353) and they confirmed this issue. Specifically:
After running into a similar error message during my reproduction of
the issue, I drilled down into the error message and discovered that
there is something odd with the file path for the Cloud Storage folder
[1]. There seems to be an issue with the Java File class viewing
‘gs://’ as having a redundant ‘/’ and that causes the ‘No such file or
directory’ error message.
Fortunately for us, there is an ongoing internal investigation on this
issue, and it seems like there is a fix being worked on. I have
indicated your interest in a fix as well, however, I do not have any
ETAs or guarantees of when a working fix will be rolled out.
We have a dataprep job to process input file and produce a cleaned file.
We are calling this dataprep job remotely using dataflow templates. We are using python to run a job from dataflow templates.
Since we need to do this for different files, we need to modify the recipe dynamically and execute the job in dataprep.
Is it possible to edit the recipe of a dataprep job from a Python code (Remotely)? If yes, is it possible to trigger a dataprep job from Python code?
It looks like there is not an API for Dataprep as far but there is in fact a Feature Request. You might want to give it a star it in order to prioritize it.
Is there way to trigger dataprep flow on GCS (Google Cloud Storage) file upload? Or, at least, is it possible to make dataprep run each day and take the newest file from certain directory in GCS?
It should be possible, because otherwise what is the point in scheduling? Running the same job over the same data source with the same output?
It seems this product is very immature at the moment, so no API endpoint exists to run a job in this service. It is only possible to run a job in the UI.
In general, this is a pattern that is typically used for running jobs on a schedule. Maybe at some point the service will allow you to publish into the "queue" that Run Job already uses.