I have a Dataflow job that is written using Apache Beam. It looks similar to this template, but it saves data from JDBC to Cloud Storage:
https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/JdbcToBigQuery.java
My problem was, that everybody could see database credentials in Dataflow UI. So I found article
https://medium.com/google-cloud/using-google-cloud-key-management-service-with-dataflow-templates-71924f0f841f
where community show how to encrypt this data. I did everything like in this article, but my Dataflow job doesn't want to decrypt credentials with KMS key given (when I run it using Cloud Function).
So I tried running it in Cloud Shell
gcloud dataflow jobs run JOB_NAME \
--region=us-west1 \
--gcs-location=TEMPLATE_LOCATION \
--dataflow-kms-key=projects/PROJECT_ID/locations/us-west1/keyRings/KEY_RING/cryptoKeys/KEY_NAME \
--parameters=...,KMSEncryptionKey=projects/PROJECT_ID/locations/us-west1/keyRings/KEY_RING/cryptoKeys/KEY_NAME,...
But I have an error
Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: com.google.api.gax.rpc.PermissionDeniedException: io.grpc.StatusRuntimeException: PERMISSION_DENIED: Permission 'cloudkms.cryptoKeyVersions.useToDecrypt' denied on resource 'projects/PROJECT_ID/locations/us-west1/keyRings/KEY_RING/cryptoKeys/KEY_NAME' (or it may not exist).
I am completely stuck. Has anyone had the same problem and could help?
You need to make sure that you have assigned the Cloud KMS CryptoKey Encrypter/Decrypter role to the Dataflow service account and also the Compute Engine service account.
Refer to this document, Cloud Key Management Service (Cloud KMS) encryption key with Dataflow
If using Cloud functions it might also be necessary to assign to the Google Cloud Functions service agent service account, the permissions to encrypt and decrypt using KMS.
Ensure the user that is calling the encrypt and decrypt methods has
the clAssign the Cloud KMS CryptoKey Encrypter/Decrypter role to the
Dataflow service account.oudkms.cryptoKeyVersions.useToEncrypt and
cloudkms.cryptoKeyVersions.useToDecrypt permissions on the key used to
encrypt or decrypt.
One way to permit a user to encrypt or decrypt is to add the user to
the roles/cloudkms.cryptoKeyEncrypter,
roles/cloudkms.cryptoKeyDecrypter, or
roles/cloudkms.cryptoKeyEncrypterDecrypter
Also make sure the parameters passed are correct;
PYTHON
python -m apache_beam.examples.wordcount \
--input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://STORAGE_BUCKET/counts \
--runner DataflowRunner \
--project PROJECT_ID \
--temp_location gs://STORAGE_BUCKET/tmp/ \
--dataflow_kms_key=KMS_KEY
JAVA
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--inputFile=gs://dataflow-samples/shakespeare/kinglear.txt \
--output=gs://STORAGE_BUCKET/counts \
--runner=DataflowRunner --project=PROJECT_ID \
--gcpTempLocation=gs://STORAGE_BUCKET/tmp \
--dataflowKmsKey=KMS_KEY"
-Pdataflow-runner
Specifying pipeline execution parameters
Related
We have developed a cloud function based on given requirements and did initial verification with 1st Gen and it went fine. But there were few modifications required which required additional processing time. Hence we had to switch to --gen2
Below is gcloud function deploy command.
gcloud functions deploy gen2-function \
--entry-point gen2 --runtime python37 --trigger-http --allow-unauthenticated \
--service-account=<> --region=<> --project=<> --timeout=3600s --gen2
This command deploys the function and internally cloud run service successfully, but it fails in the end with below error
[INFO] A new revision will be deployed serving with 100% traffic.
ERROR: (gcloud.functions.deploy) PERMISSION_DENIED: Permission 'run.services.setIamPolicy' denied on resource 'projects/<project>/locations/<region>/services/gen2-function' (or resource may not exist).
When we checked on cloud run, service name "gen2-function" does exist.
Can someone guide on this?
The account deploying the function is missing the permission run.services.setIamPolicy. That permission is required to specify the command argument --allow-unauthenticated.
That permission is in the role roles/run.admin.
Refer to this documentation on how to add a role to the account:
Grant a single role using the GUI
Grant or revoke multiple roles
I am currently trying to provision a GCE instance that will execute a Docker container in order to retrieve some information from the web and push them to BigQuery.
Now, the newly created service account (screenshot below) doesn't affect the api scopes whatsoever. This obviously makes the container fail when authenthicating to BQ. Funny thing is, when I use the GCE default service account and select auth scopes manually from the GUI everything works like a charm.
I am failing to understand why the following service account doesn't open api auth scopes to the machine. I might be overlooking something really simple on this one.
Context
The virtual machine is created and run with the following gcloud command:
#!/bin/sh
gcloud compute instances create-with-container gcp-scrape \
--machine-type="e2-micro" \
--boot-disk-size=10 \
--container-image="gcr.io/my_project/gcp_scrape:latest" \
--container-restart-policy="on-failure" \
--zone="us-west1-a" \
--service-account gcp-scrape#my_project.iam.gserviceaccount.com \
--preemptible
This is how bigquery errors out when using my custom service account:
Access Denied: BigQuery BigQuery: Missing required OAuth scope. Need BigQuery or Cloud Platform read scope.
You haven't specified a --scopes flag, so the instance uses the default scope which doesn't include BigQuery.
To let the instance access all services that the service account can access, add --scopes https://www.googleapis.com/auth/cloud-platform to your command line.
am trying to create Dataproc cluster with a service account via cloud sdk. It's throwing an error that compute.projects.get is denied. The service account has compute viewer access, compute instance admin, dataproc editor access. Unable to understand why this error. In the IAM policy troubleshooter, I checked dataproc.cluster.create is assigned to the service account
The command is:
gcloud dataproc clusters create cluster-dqm01 \
--region europe-west-2 \
--zone europe-west2-b \
--subnet dataproc-standalone-paasonly-europe-west2 \
--master-machine-typne n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 1.3-deb9 \
--project xxxxxx \
--service-account xxxx.iam.gserviceaccount.com
ERROR: (gcloud.dataproc.clusters.create) PERMISSION_DENIED: Required 'compute.projects.get' permission for 'projects/xxxxxx'
The project is correct as I have tried to create from the console getting the same error, generated the gcloud command from the console to run with a service account. This is the first time dataproc cluster is being created for the project
If you had assigned the various permissions to the same service account you're specifying with --service-account, the issue is that you probably meant to specify --impersonate-service-account instead.
There are three identities that are relevant here:
The identity issuing the CreateCluster command - this is often a human identity, but if you're automating things, using --impersonate-service-account, or running the command from inside another GCE VM, it may be a service account itself.
The "Control plane" identity - this is what the Dataproc backend service uses to actually create VMs
The "Data plane" identity - this is what the Dataproc workers behave as when processing data.
Typically, #1 and #2 need the various "compute" permissions and some minimal GCS permissions. #3 typically just needs GCS and optionally BigQuery, CloudSQL, Bigtable, etc. permissions depending on what you're actually processing.
See https://cloud.google.com/dataproc/docs/concepts/iam/dataproc-principals for more in-depth explanation of these identities.
It also lists the pre-existing curated roles to make this all easy (and typically, "default" project settings will automatically have the correct roles already so that you don't have to worry about it). Basically, the "human identity" or the service account you use with --impersonate-service-account needs Dataproc Editor or Project Editor roles, the "control plane identity" needs Dataproc Service Agent, and the "data plane identity" needs Dataproc Worker.
I am unable to non-interactively activate my Google Cloud service account; even after reading several SO threads.
Creating a service account
gcloud iam service-accounts create my-awesome-acct ...
Creating a role for the service account
gcloud iam roles create AwesomeRole \
--permissions storage.objects.create,storage.objects.delete ....
Generating the keys
gcloud iam service-accounts keys create ~/awesome-key.json ...
Activating the service account
gcloud auth activate-service-account my-awesome-acct ~/awesome-key.json
My Issue
Even after following the above steps, when I run gsutil ... commands, I still get the error message:
$ gsutil cp my_file.tgz gs://my_bucket
Copying file://my_file.tgz [Content-Type=application/x-tar]...
Your credentials are invalid. Please run
$ gcloud auth login
The only way I could get this to work is to actually run gcloud auth login and allow the authentication in a web browser.
Am I doing something wrong? Or is this intended for every service account?
I'm going to answer my own question here.
My Solution
Instead of using gsutil, I decided to use the Google Cloud Client Libraries.
What I did:
gsutil cp my_file.tgz gs://my_bucket
What I am doing now:
from gcloud import storage
# key file is located in my current directory
os.environ.get('GOOGLE_APPLICATION_CREDENTIALS', 'gcloud-auth.json')
client = storage.Client()
bucket = client.get_bucket("my_bucket")
blob = bucket.blob("my_file.tgz")
blob.upload_from_filename("my_file.tgz")
Hindsight 20/20
After getting the above solution working, it seems if I also set the environment variable, GOOGLE_APPLICATION_CREDENTIALS, my gsutil should've worked too. (untested)
I'm running a pyspark application through Google's dataproc on a cluster I created. In one stage, the application needs to access a directory in an Amazon S3 directory. At that stage, I get the error:
AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
I logged onto the headnode of the cluster and set the /etc/boto.cfg with my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY information, but that didn't solve the access issue.
(1) Any other suggestions for how to access AWS S3 from a dataproc cluster?
(2) Also, what is the name of the user that dataproc uses to access the cluster? If I knew that, I could set the ~/.aws directory on the cluster for that user.
Thanks.
Since you're using the Hadoop/Spark interfaces (like sc.textFile), everything should indeed be done through the fs.s3.* or fs.s3n.* or fs.s3a.* keys rather than trying to wire through any ~/.aws or /etc/boto.cfg settings. There are a few ways you can plumb those settings through to your Dataproc cluster:
At cluster creation time:
gcloud dataproc clusters create --properties \
core:fs.s3.awsAccessKeyId=<s3AccessKey>,core:fs.s3.awsSecretAccessKey=<s3SecretKey> \
--num-workers ...
The core prefix here indicates you want the settings to be placed in the core-site.xml file, as explained in the Cluster Properties documentation.
Alternatively, at job-submission time, if you're using Dataproc's APIs:
gcloud dataproc jobs submit pyspark --cluster <your-cluster> \
--properties spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey>,spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey> \
...
In this case, we're passing the properties through as Spark properties, and Spark provides a handy mechanism to define "hadoop" conf properties as a subset of Spark conf, simply using the spark.hadoop.* prefix. If you're submitting at the command line over SSH, this is equivalent to:
spark-submit --conf spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey> \
--conf spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey>
Finally, if you want to set it up at cluster creation time but prefer not to have your access keys explicitly set in your Dataproc metadata, you might opt to use an initialization action instead. There's a handy tool called bdconfig that should be present on the path with which you can modify XML settings easily:
#!/bin/bash
# Create this shell script, name it something like init-aws.sh
bdconfig set_property \
--configuration_file /etc/hadoop/conf/core-site.xml \
--name 'fs.s3.awsAccessKeyId' \
--value '<s3AccessKey>' \
--clobber
bdconfig set_property \
--configuration_file /etc/hadoop/conf/core-site.xml \
--name 'fs.s3.awsSecretAccessKey' \
--value '<s3SecretKey>' \
--clobber
Upload that to a GCS bucket somewhere, and use it at cluster creation time:
gsutil cp init-aws.sh gs://<your-bucket>/init-aws.sh
gcloud dataproc clustres create --initialization-actions \
gs://<your-bucket>/init-aws.sh
While Dataproc metadata is indeed encrypted at rest and heavily secured just like any other user data, using the init action instead helps prevent inadvertently showing your access key/secret for example to someone standing behind your screen when viewing your Dataproc cluster properties.
You can try with setting the AWS config, while initialization of sparkContext.
conf = < your SparkConf()>
sc = SparkContext(conf=conf)
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <s3AccessKey>)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <s3SecretKey>)