How to submit job on Dataproc cluster with specific service account? - google-cloud-platform

I'm trying to execute jobs in the Dataproc cluster which access several resources of GCP like Google Cloud Storage.
My concern is whatever file or object is being created through my job is owned/created by Dataproc default user.
Example - 123456789-compute#developer.gserviceaccount.com.
Is there any way I can configure this user/service-account so that the object gets created by a given user/service-account instead of default one?

You can configure service account to be used by a Dataproc cluster using flag --service-account at cluster creation time.
Gcloud command would look like:
gcloud dataproc clusters create cluster-name \
--service-account=your-service-account#project-id.iam.gserviceaccount.com
More details: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts
https://cloud.google.com/dataproc/docs/concepts/iam/iam
Note: it is better to have one dataproc cluster per job so that each job get isolated environment and doesnt affect each other and you can manage them better (in terms of security as well).
you can also look at GCP Composer using which you can schedule jobs and automate them.
Hope this helps.

Related

Dataproc cluster underlying VMs using default service account

I created a Dataproc cluster using a service account via a Terraform script. The cluster has 1 master and 2 workers, so three Compute Engine instances got created as a part of this cluster creations. My questions are-
Why these VMs have default service accounts? Shouldn't they use the same service account that I used to create the dataproc cluster?
Edited: Removed one question as suggested in comment (as topic became too broad)
Here is how you can specify the service account used by the cluster VMs. If you are sure they still use the default service account, it might be a mistake in the Terraform script. You can test with gcloud without Terraform to confirm.

What is the difference between gcloud and gsutil?

I want to know difference between gcloud and gsuitl. Where do we use what? Why certain commands begin with gsutil while others with gcloud?
The gsutil command is used only for Cloud Storage.
With the gcloud command, you can interact with other Google Cloud products like the App Engine, Google Kubernetes Engine etc. You can have a look here and here for more info.
The gsutil is a Python application that lets you access Google Cloud Storage from the command line. You can use gsutil to do a wide range of bucket and object management tasks, including:
Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects.
Moving, copying, and renaming objects.
Editing object and bucket ACLs.
The gcloud command-line interface is the primary CLI tool to create and manage Google Cloud resources. You can use this tool to perform many common platform tasks either from the command line or in scripts and other automations.
For example, you can use the gcloud CLI to create and manage:
Google Compute Engine virtual machine instances and other resources,
Google Cloud SQL instances,
Google Kubernetes Engine clusters,
Google Cloud Dataproc clusters and jobs,
Google Cloud DNS managed zones and record sets,
Google Cloud Deployment manager deployments.
"gcloud" can create and manage Google Cloud resources while "gsutil" cannot do so.
"gsutil" can manipulate buckets, bucket's objects and bucket ACLs on GCS(Google Cloud Storage) while "gcloud" cannot do so.
With gcloud storage you can do now everything what you can do with gsutil. Look here: https://cloud.google.com/blog/products/storage-data-transfer/new-gcloud-storage-cli-for-your-data-transfers and also ACL on objects: https://cloud.google.com/sdk/gcloud/reference/storage/objects/update

How to build Google Cloud dataproc edge node?

We are moving from On-Premises environment to google cloud dataproc for spark jobs. I am able to build the cluster though and ssh to master node for job execution. I am not clear how to build the edge node where we can allow users to login and submit job. Is it going to be another gce vm? Any thoughts or best practices?
A new VM instance is a good option to map the EdgeNode role from other architectures:
You can execute your job from the Master node which you can make accessible through SSH.
You will need to find a balance between simplicity (SHH) or security (EdgeNode).
Please note that IAM can help to allow individual users to submit jobs by assigning Dataproc Editor role.
Don't forget the ability that Dataproc offers of creating ephemeral nodes. This means that you create a cluster, execute your job and delete your cluster.
Using ephemeral clusters will avoid unnecessary costs. Even, the script you create for that it can be executed from any machine that has the Google Cloud SDK installed, e.g. OnPrem servers or your PC.

Run dataflow job from Compute Engine

I am following the quickstart link to run dataflow job
https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven
It works fine when I run mvn command from google cloud shell.
mvn compile exec:java \
-Dexec.mainClass=com.example.WordCount \
-Dexec.args="--project=<my-cloud-project> \
--stagingLocation=gs://<my-wordcount-storage-bucket>/staging/ \
--output=gs://<my-wordcount-storage-bucket>/output \
--runner=DataflowRunner"
But when I try to launch a VM and run the command from it, I get permission denied error.
If I give full API access to VM, command runs successfully.
What are permissions I should give to VM to run dataflow job or shall I use service account?
Anybody can tell me best way to run dataflow jobs in production env.
Regards,
pari
So you will have to give Dataflow Admin rights to the VM to run the Dataflow job. Additionally, if your Dataflow job involves BigQuery then you'll have to provide BigQuery Editor role and so on.
You can also create a service account and provide the required roles to run the job.
Hope this helps.
To provide granular access, you can take advantage of Dataflow Roles:
Developer. Executes and manipulates Dataflow jobs.
Viewer. Read-only access to Dataflow related resources
Worker. Provides the permissions for a service account to execute work units for a Dataflow pipeline.
When you have an automated app that need to execute Dataflow jobs automatically without user intervention it is recommended to use a service account.
In fact, to deploy it in a production environment the recommendation that I can suggest is creating automated process that deploys and executes your pipeline, you can take advantage of the Cloud Composer that is based in Apache Airflow, and can launch Dataflow jobs (composer 1.0.0 or later in supported Dataflow regions).
If you are using airflow, create a service account with access to the components being used in dataflow and create a Connection in the airflow UI with required scopes. Once done, use DataflowJavaOperator/DataflowTemplateOperator to submit the job and orchestrate it via airflow
If you need further help, comment on this answer.

Google Dataproc cluster created via HTTP but not listed in SDK or viewer

I am currently working on Google Cloud Platform to run Spark Jobs in the cloud. To do so, I am planning to use Google Cloud Dataproc.
Here's the work flow I am automatising :
Upload a csv file on Google Cloud Storage which will be the input of my Spark job
On upload, trigger a Google Cloud Functions which should create the cluster, submit a job and shutdown the cluster though the HTTP API available for Dataproc
I am able to create a cluster from my Google Cloud Function using the google apis nodejs client (http://google.github.io/google-api-nodejs-client/latest/dataproc.html). But the problem is that I cannot see this cluster on the Dataproc cluster viewer or even by using the Gcloud sdk : gcloud dataproc clusters list.
However, I am able to see my newly created cluster on Google Api explorer : https://developers.google.com/apis-explorer/#p/dataproc/v1/dataproc.projects.regions.clusters.list.
Note that I am creating my cluster in the current project.
What can I possibly do wrong not to be able to see that cluster when listing with gcloud sdk ?
Thank you in advance for your help.
Regards.
I bet it has to do with "region" field. Out of the box Cloud SDK defaults to "global" region [1]. Try using dataproc Cloud SDK commands with --region flag (e.g., gcloud dataproc clusters list --region)
[1] https://cloud.google.com/dataproc/docs/concepts/regional-endpoints