Dataprep - Scheduling Jobs - google-cloud-platform

To anyone on the Dataprep beta, is it possible to schedule jobs being run? If so, is it the cron service via the app engine? I can't quite follow the cron for app engine instructions but want to make sure it's not a dead end before I try
Thanks

It's in there now, with the possibility to add multiple schedules (ie daily, weekly, etc).

Related

Way to trigger dataflow only after Big Query Job finished

actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

Google Dataprep: Scheduling with updated data source

Is there way to trigger dataprep flow on GCS (Google Cloud Storage) file upload? Or, at least, is it possible to make dataprep run each day and take the newest file from certain directory in GCS?
It should be possible, because otherwise what is the point in scheduling? Running the same job over the same data source with the same output?
It seems this product is very immature at the moment, so no API endpoint exists to run a job in this service. It is only possible to run a job in the UI.
In general, this is a pattern that is typically used for running jobs on a schedule. Maybe at some point the service will allow you to publish into the "queue" that Run Job already uses.

Do I need "RunAndBlock" for scheduled web jobs?

My intention is to run a 3 second web job every 5 min. What happens if I skip the host.RunAndBlock?
If you just want a simple time scheduled job, there is no need to use the WebJobs SDK at all, so there is no host at all. Just use a plain console app (can be as simple as a one line Main), and deploy it as a scheduled CRON WebJobs. See https://learn.microsoft.com/en-us/azure/app-service/web-sites-create-web-jobs.

Scheduling Dataflow pipelines

I want to schedule a google dataflow job to run every one hour
I check this url https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions
but I got many errors.
How can I achieve this?
From my perspective, using app engine is trying to repurpose a good tool for something different.
We opted to run our own CRON instance.
Please check doing such case using google dataflow windowing with unbounded source
https://cloud.google.com/dataflow/model/windowing
https://cloud.google.com/dataflow/examples/gaming-example
You can Use a Cloud scheduler that runs every 1 hour and calls a cloud function,
The Cloud function will use the Dataflow client API library to submit a Dataflow job.
Check this link https://dzone.com/articles/triggering-dataflow-pipelines-with-cloud-functions

Job Scheduling in SAS Data Integration Studio

i want to schedule a job in SAS-DIS. i tried the process using sas management console,bt an error is popping up saying scheluing server not found.
can anyone help me how to setup a scheduling server? or is it a software to be installed?
Thanks
I think a scheduling server is an extra package that has to be purchased. Our BI setup is lacking that option and no matter what we can't seem to get it approved. Check with your SAS server admin to see if the job scheduling has been enabled. If so he/she should be able to tell you the process for getting it scheduled.
Alternatively, without a scheduling server you still deploy your jobs and can either use
1. Cron and Crontab (in Unix or Linux)
2. Windows OS scheduler
to schedule jobs manually as this is the best option available if there is none. I know this can be very tedious and cumbersome , but can give it a try if you have less number of jobs to schedule.