GCP Dataflow Job Deployment - google-cloud-platform

I am trying to automate CI/CD of Classic Template.
Created and Staged template on GCS following the documentation
On code changes (Bug fixes etc), I intend to drain the existing job and create a new job with the same name.
To drain existing job, I need JOB_ID, but I have only JOB_NAME which I used during the creation of the job.
The only way i see is to use list command and fetch the active jobs, process the output to extract the job id to use it in the drain command. It seems to be quite a roundabout way. Isn't there a way to Drain a job with Job_Name or at least get JOB_ID from JOB_NAME.

When you use the gcloud dataflow jobs run command to create the job, the response from running this command should return the JOB_ID in the following way (e.g. if you create a batch job):
id: 2016-10-11_17_10_59-1234530157620696789
projectId: YOUR_PROJECT_ID
type: JOB_TYPE_BATCH
That and using the gcloud dataflow jobs list as you mention will be the straightforward way to associate a JOB_NAME and a JOB_ID using automation. The way to achieve this with a Python script is described within this other post in the community.

GCP provides REST API to update dataflow job. No need of explicitly draining the existing the job and creating a new job.
You can do it via Python Code too. Refer to my GIST for python code.

Related

How to get dataflow job id from inside that dataflow job - JAVA

In my current architecture, multiple dataflow jobs are triggered at various stages, as part of ABC framework, I need to capture the job id of those jobs as audit metrics inside the dataflow pipeline and update it in BigQuery.
How do I get the run id of dataflow job from the pipeline using JAVA?
Is there any existing method that I can use for that or do I need to use google cloud's client library inside the pipeline for that?
If you are submitting to dataflow, I believe this might work:
DataflowPipelineJob result = (DataflowPipelineJob)pipeline.run()
result.getJobId()
But you cannot access that within the pipeline itself afaik (DoFns etc).
The best way to ensure you know your job id/name, is to set it yourself. You can do this by setting --jobName and this is accessible via options.getJobName(), dataflow will use this. Note it must be unique.

Want to Trigger DataFlowpipeline with CloudFunctions

I have a scenario that we want to trigger a data-flow pipeline via cloud function, And in data Flow pipeline we have to transform some data and insert in big query
I had created our custom data_Flowpipeline and transformed the Data inserted in the big query (follow standard way of installing Apache beam and using Deployment command from cloud Shell) Pipeline ran successfully, log is showing in monitoring with DAG.
Now what I want do is to trigger the pipeline with cloud-function and for that I researched that
(i)we can create custom flex template of pipeline
(ii) stage it in google Bucket
(iii)Call it with REST-API from cloud function
is the mentioned step in second step is recommended way of doing it or should I try another approach? I don't get any other way apart from classic templates

It is possible to re-run a job in Google Cloud Dataflow after succeded

Maybe the question sounds stupid but I was wondering if once the job is successfully finished and having ID, is it possible to start the same job again?
Or is it necessary to create another one?
Because otherwise I would have the job with the same name throughout the list.
I just want to know if there is a way to restart it without recreating it again.
It's not possible to run the exact same job again, but you can create a new job with the same name that runs the same code. It will just have a different job ID and show up as a separate entry in the job list.
If you want to make running repeated jobs easier, you can create a template. This will let you create jobs from that template via a gcloud command instead of having to run your pipeline code.
Cloud Dataflow does have a re-start function. See SDK here. One suggested pattern (to help with deployment) is to create a template for the graph you want to repeatedly run AND execute the template.

Custom code containers for google cloud-ml for inference

I am aware that it is possible to deploy custom containers for training jobs on google cloud and I have been able to get the same running using command.
gcloud ai-platform jobs submit training infer name --region some_region --master-image-uri=path/to/docker/image --config config.yaml
The training job was completed successfully and the model was successfully obtained, Now I want to use this model for inference, but the issue is a part of my code has system level dependencies, so I have to make some modification into the architecture in order to get it running all the time. This was the reason to have a custom container for the training job in the first place.
The documentation is only available for the training part and the inference part, (if possible) with custom containers has not been explored to the best of my knowledge.
The training part documentation is available on this link
My question is, is it possible to deploy custom containers for inference purposes on google cloud-ml?
This response refers to using Vertex AI Prediction, the newest platform for ML on GCP.
Suppose you wrote the model artifacts out to cloud storage from your training job.
The next step is to create the custom container and push to a registry, by following something like what is described here:
https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements
This section describes how you pass the model artifact directory to the custom container to be used for interence:
https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#artifacts
You will also need to create an endpoint in order to deploy the model:
https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api#aiplatform_deploy_model_custom_trained_model_sample-gcloud
Finally, you would use gcloud ai endpoints deploy-model ... to deploy the model to the endpoint:
https://cloud.google.com/sdk/gcloud/reference/ai/endpoints/deploy-model

Way to trigger dataflow only after Big Query Job finished

actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples