Run dataflow job from Compute Engine

Run dataflow job from Compute Engine - google-cloud-platform

I am following the quickstart link to run dataflow job
https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven
It works fine when I run mvn command from google cloud shell.
mvn compile exec:java \
-Dexec.mainClass=com.example.WordCount \
-Dexec.args="--project=<my-cloud-project> \
--stagingLocation=gs://<my-wordcount-storage-bucket>/staging/ \
--output=gs://<my-wordcount-storage-bucket>/output \
--runner=DataflowRunner"
But when I try to launch a VM and run the command from it, I get permission denied error.
If I give full API access to VM, command runs successfully.
What are permissions I should give to VM to run dataflow job or shall I use service account?
Anybody can tell me best way to run dataflow jobs in production env.
Regards,
pari

So you will have to give Dataflow Admin rights to the VM to run the Dataflow job. Additionally, if your Dataflow job involves BigQuery then you'll have to provide BigQuery Editor role and so on.
You can also create a service account and provide the required roles to run the job.
Hope this helps.

To provide granular access, you can take advantage of Dataflow Roles:
Developer. Executes and manipulates Dataflow jobs.
Viewer. Read-only access to Dataflow related resources
Worker. Provides the permissions for a service account to execute work units for a Dataflow pipeline.
When you have an automated app that need to execute Dataflow jobs automatically without user intervention it is recommended to use a service account.
In fact, to deploy it in a production environment the recommendation that I can suggest is creating automated process that deploys and executes your pipeline, you can take advantage of the Cloud Composer that is based in Apache Airflow, and can launch Dataflow jobs (composer 1.0.0 or later in supported Dataflow regions).

If you are using airflow, create a service account with access to the components being used in dataflow and create a Connection in the airflow UI with required scopes. Once done, use DataflowJavaOperator/DataflowTemplateOperator to submit the job and orchestrate it via airflow
If you need further help, comment on this answer.

Related

Where to keep the Dataflow and Cloud composer python code?

It probably is a silly question. In my project we'll be using Dataflow and Cloud composer. For that I had asked permission to create a VM instance in the GCP project to keep the both the Dataflow and Cloud composer python program. But the client asked me the reason of creation of a VM instance and told me that you can execute the Dataflow without the VM instance.
Is that possible? If yes how to achieve it? Can anyone please explain it? It'll be really helpful to me.

You can run Dataflow pipelines or manage Composer environments in you own computer once your credentials are authenticated and you have both the Google SDK and Dataflow Python library installed. However, this depends on how you want to manage your resources. I prefer to use a VM instance to have all the resources I use in the cloud where it is easier to set up VPC networks including different services. Also, saving data from a VM instance into GCS buckets is usually faster than from an on-premise computer/server.

Scheduling cron jobs on Google Cloud DataProc

I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.
What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.
Thanks in advance!

For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.
If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.
Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:
I created a workflow in my project called terasort-example.
I then created a new Service Account in my project, called workflow-starter#example.iam.gserviceaccount.com and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.
After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:
Target: HTTP
URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json
HTTP Method: POST
Body: {}
Auth Header: OAuth Token
Service Account: workflow-starter#example.iam.gserviceaccount.com
Scope: (left blank)
You can test it by clicking Run Now.
Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json
Check out this official doc that discusses other scheduling options.

Please see the other answer for more comprehensive solution
What you will have to do is publish an event to pubsub topic from Cloud Scheduler and then have a Cloud Function react to that event.
Here's a complete example of using Cloud Function to trigger Dataproc:
How can I run create Dataproc cluster, run job, delete cluster from Cloud Function

How to submit job on Dataproc cluster with specific service account?

I'm trying to execute jobs in the Dataproc cluster which access several resources of GCP like Google Cloud Storage.
My concern is whatever file or object is being created through my job is owned/created by Dataproc default user.
Example - 123456789-compute#developer.gserviceaccount.com.
Is there any way I can configure this user/service-account so that the object gets created by a given user/service-account instead of default one?

You can configure service account to be used by a Dataproc cluster using flag --service-account at cluster creation time.
Gcloud command would look like:
gcloud dataproc clusters create cluster-name \
--service-account=your-service-account#project-id.iam.gserviceaccount.com
More details: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts
https://cloud.google.com/dataproc/docs/concepts/iam/iam
Note: it is better to have one dataproc cluster per job so that each job get isolated environment and doesnt affect each other and you can manage them better (in terms of security as well).
you can also look at GCP Composer using which you can schedule jobs and automate them.
Hope this helps.

How do I run a serverless batch job in Google Cloud

I have a batch job that takes a couple of hours to run. How can I run this in a serverless way on Google Cloud?
AppEngine, Cloud Functions, and Cloud Run are limited to 10-15 minutes. I don't want to rewrite my code in Apache Beam.
Is there an equivalent to AWS Batch on Google Cloud?

Note: Cloud Run and Cloud Functions can now last up to 60 minutes. The answer below remains a viable approach if you have a multi-hour job.
Vertex AI Training is serverless and long-lived. Wrap your batch processing code in a Docker container, push to gcr.io and then do:
gcloud ai custom-jobs create \
--region=LOCATION \
--display-name=JOB_NAME \
--worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=EXECUTOR_IMAGE_URI,local-package-path=WORKING_DIRECTORY,script=SCRIPT_PATH
You can run any arbitrary Docker container — it doesn’t have to be a machine learning job. For details, see:
https://cloud.google.com/vertex-ai/docs/training/create-custom-job#create_custom_job-gcloud
Today you can also use Cloud Batch: https://cloud.google.com/batch/docs/get-started#create-basic-job

Google Cloud does not offer a comparable product to AWS Batch (see https://cloud.google.com/docs/compare/aws/#service_comparisons).
Instead you'll need to use Cloud Tasks or Pub/Sub to delegate the work to another product, such as Compute Engine, but this lacks the ability to do this in a "serverless" way.

Finally Google released (in Beta for the moment) Cloud Batch which does exactly what you want.
You push jobs (containers or scripts) and it runs. Simple as that.
https://cloud.google.com/batch/docs/get-started

This answer to a How to make GCE instance stop when its deployed container finishes? will work for you as well:
In short:
First dockerize your batch process.
Then, create an instance:
Using a container-optmized image
And using a Startup script that pulls your docker image, runs it, and shutdown the machine at the end.

I have faced the same problem. in my case I went for:
Cloud Scheduler to start the job by pushing to Pub/Sub.
Pub/Sub triggers Cloud Functions.
Cloud Functions mounting a Compute Engine instance.
Compute Engine runs the batch workload and auto kills the instance
once it’s done. You can read by post on medium:
https://link.medium.com/1K3NsElGYZ
It might help you get started. There's also a follow up post showing how to use a Docker container inside the Compute Engine instance: https://medium.com/google-cloud/serverless-batch-workload-on-gcp-adding-docker-and-container-registry-to-the-mix-558f925e1de1

You can use Cloud Run. At the time of writing this, the timeout of Cloud Run (fully managed) is increased to 60 minutes, but in beta.
https://cloud.google.com/run/docs/configuring/request-timeout
Important: Although Cloud Run (fully managed) has a maximum timeout of 60 minutes, only timeouts of 15 minutes or less are generally available: setting timeouts greater than 15 minutes is a Beta feature.

Another alternative for batch computing is using Google Cloud Lifesciences.
An example application using Cloud Life Sciences is dsub.
Or see the Cloud Life Sciences Quickstart documentation.

I found myself looking for a solution to this problem and built something similar to what mesmacosta has described in a different answer, in the form of a reusable tool called gcp-runbatch.
If you can package your workload into a Docker image then you can run it using gcp-runbatch. When triggered, it will do the following:
Create a new VM
On VM startup, docker run the specified image
When the docker run exits, the VM will be deleted
Some features that are supported:
Invoke batch workload from the command line, or deploy as a Cloud Function and invoke that way (e.g. to trigger batch workloads via Cloud Scheduler)
stdout and stderr will be piped to Cloud Logging
Environment variables can be specified by the invoker, or pulled from Secret Manager
Here's an example command line invocation:
$ gcp-runbatch \
--project-id=long-octane-350517 \
--zone=us-central1-a \
--service-account=1234567890-compute#developer.gserviceaccount.com \
hello-world
Successfully started instance runbatch-38408320. To tail batch logs run:
CLOUDSDK_PYTHON_SITEPACKAGES=1 gcloud beta --project=long-octane-350517
logging tail 'logName="projects/long-octane-350517/logs/runbatch" AND
resource.labels.instance_id="runbatch-38408320"' --format='get(text_payload)'

GCP launched their new "Batch" service in July '22. It basically Compute Engine packaged with some utilities to easily productionize a batch job -- including defining required resources, executables (script or container-based), and define a run schedule.
Haven't used it yet, but seems like a great fit for batch jobs that take over 1 hr.

Running a code from an instance in Google Cloud Composer

I am new to google cloud composer. I have some code in google cloud compute engine -
for eg: test.py
Currently I am using Jenkins as my scheduler - and I'm running the code like below
echo "cd /home/user/src/digital_platform &&/home/user/venvs/bdp/bin/python -m test.test.test" | ssh user#instance-dp
I want to run the same code from google cloud composer.
How I can do that..
Basically I need to ssh to an instance in google cloud and run the code in an automated way using google cloud composer.

It seems that SSHOperator might be something that might work for you. This operator is an Airflow feature, not Cloud Composer feature per se.
The other operator that you might want to take a look at before making your final decision is BaskOperator

You need to create a DAG (workflows), Cloud Composer schedules only the DAGs that are in the DAGs folder in the environment's Cloud Storage bucket. Each Cloud Composer environment has a web server that runs the Airflow web interface that you can use to manage DAGs.
Bash Operator is useful to run command-line programs. I suggest you follow the Cloud Composer Quickstart which shows you how to create a Cloud Composer environment in the Google Cloud Console and run a simple Apache Airflow DAG.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js