Reading S3 data from Google's dataproc - amazon-web-services

I'm running a pyspark application through Google's dataproc on a cluster I created. In one stage, the application needs to access a directory in an Amazon S3 directory. At that stage, I get the error:
AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
I logged onto the headnode of the cluster and set the /etc/boto.cfg with my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY information, but that didn't solve the access issue.
(1) Any other suggestions for how to access AWS S3 from a dataproc cluster?
(2) Also, what is the name of the user that dataproc uses to access the cluster? If I knew that, I could set the ~/.aws directory on the cluster for that user.
Thanks.

Since you're using the Hadoop/Spark interfaces (like sc.textFile), everything should indeed be done through the fs.s3.* or fs.s3n.* or fs.s3a.* keys rather than trying to wire through any ~/.aws or /etc/boto.cfg settings. There are a few ways you can plumb those settings through to your Dataproc cluster:
At cluster creation time:
gcloud dataproc clusters create --properties \
core:fs.s3.awsAccessKeyId=<s3AccessKey>,core:fs.s3.awsSecretAccessKey=<s3SecretKey> \
--num-workers ...
The core prefix here indicates you want the settings to be placed in the core-site.xml file, as explained in the Cluster Properties documentation.
Alternatively, at job-submission time, if you're using Dataproc's APIs:
gcloud dataproc jobs submit pyspark --cluster <your-cluster> \
--properties spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey>,spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey> \
...
In this case, we're passing the properties through as Spark properties, and Spark provides a handy mechanism to define "hadoop" conf properties as a subset of Spark conf, simply using the spark.hadoop.* prefix. If you're submitting at the command line over SSH, this is equivalent to:
spark-submit --conf spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey> \
--conf spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey>
Finally, if you want to set it up at cluster creation time but prefer not to have your access keys explicitly set in your Dataproc metadata, you might opt to use an initialization action instead. There's a handy tool called bdconfig that should be present on the path with which you can modify XML settings easily:
#!/bin/bash
# Create this shell script, name it something like init-aws.sh
bdconfig set_property \
--configuration_file /etc/hadoop/conf/core-site.xml \
--name 'fs.s3.awsAccessKeyId' \
--value '<s3AccessKey>' \
--clobber
bdconfig set_property \
--configuration_file /etc/hadoop/conf/core-site.xml \
--name 'fs.s3.awsSecretAccessKey' \
--value '<s3SecretKey>' \
--clobber
Upload that to a GCS bucket somewhere, and use it at cluster creation time:
gsutil cp init-aws.sh gs://<your-bucket>/init-aws.sh
gcloud dataproc clustres create --initialization-actions \
gs://<your-bucket>/init-aws.sh
While Dataproc metadata is indeed encrypted at rest and heavily secured just like any other user data, using the init action instead helps prevent inadvertently showing your access key/secret for example to someone standing behind your screen when viewing your Dataproc cluster properties.

You can try with setting the AWS config, while initialization of sparkContext.
conf = < your SparkConf()>
sc = SparkContext(conf=conf)
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <s3AccessKey>)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <s3SecretKey>)

Related

GCP Cloud Logging Cost increasing with Dataproc img version 2.0.39-ubuntu18

I've a Dataproc cluster with image version - 2.0.39-ubuntu18, which seems to be putting all logs into Cloud Logging, this is increasing our costs a lot.
Here is the command used to create the cluster, i've added the following - spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs
to stop using the Cloud Logging, however that is not working .. Logs are being re-directed to Cloud Logging as well.
Here is the command used to create the Dataproc cluster :
REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3
# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 100 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 500 \
--image-version $IMG_VERSION \
--autoscaling-policy versa-dataproc-autoscaling \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'
We have another Dataproc cluster (image version 1.4.37-ubuntu18, similar configuration as the image version 2.0-ubuntu18), which has similar configuration but does not seem to using Cloud Logging as much.
Attached is screenshot properties of both the clusters.
What do i need to change to ensure the Dataproc jobs(pyspark) donot use the Cloud Logging ?
tia!
[
I saw dataproc:dataproc.logging.stackdriver.job.driver.enable is set to true. By default, the value is false, which means driver logs will be saved to GCS and streamed back to the client for viewing, but it won't be saved to Cloud Logging. You can try disabling it. BTW, when it is enabled, the job driver logs will be available in Cloud Logging under the job resource (instead of the cluster resource).
If you want to disable Cloud Logging completely for a cluster, you can either add dataproc:dataproc.logging.stackdriver.enable=false when creating the cluster or write an init action with systemctl stop google-fluentd.service. Both will stop Cloud Logging on the cluster's side, but using property is recommended.
See Dataproc cluster properties for the property.
Here is the update on this (based on discussions with GCP Support) :
In the GCP Logging, we need to create a Log Routing sink with inclusion filter - this will write the logs to BigQuery or Cloud Storage depending upon the target you specify.
Additionally, the _Default sink needs to be modified to add exclusion filters so specific logs will NOT be re-directed to GCP Logging
Attached are screenshots of the _Default log sink and the Inclusion sink for Dataproc.

GCE instance ignoring service account roles

I am currently trying to provision a GCE instance that will execute a Docker container in order to retrieve some information from the web and push them to BigQuery.
Now, the newly created service account (screenshot below) doesn't affect the api scopes whatsoever. This obviously makes the container fail when authenthicating to BQ. Funny thing is, when I use the GCE default service account and select auth scopes manually from the GUI everything works like a charm.
I am failing to understand why the following service account doesn't open api auth scopes to the machine. I might be overlooking something really simple on this one.
Context
The virtual machine is created and run with the following gcloud command:
#!/bin/sh
gcloud compute instances create-with-container gcp-scrape \
--machine-type="e2-micro" \
--boot-disk-size=10 \
--container-image="gcr.io/my_project/gcp_scrape:latest" \
--container-restart-policy="on-failure" \
--zone="us-west1-a" \
--service-account gcp-scrape#my_project.iam.gserviceaccount.com \
--preemptible
This is how bigquery errors out when using my custom service account:
Access Denied: BigQuery BigQuery: Missing required OAuth scope. Need BigQuery or Cloud Platform read scope.
You haven't specified a --scopes flag, so the instance uses the default scope which doesn't include BigQuery.
To let the instance access all services that the service account can access, add --scopes https://www.googleapis.com/auth/cloud-platform to your command line.

Google Cloud Platform - Dataflow doesn't decrypt data with KMS key

I have a Dataflow job that is written using Apache Beam. It looks similar to this template, but it saves data from JDBC to Cloud Storage:
https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/JdbcToBigQuery.java
My problem was, that everybody could see database credentials in Dataflow UI. So I found article
https://medium.com/google-cloud/using-google-cloud-key-management-service-with-dataflow-templates-71924f0f841f
where community show how to encrypt this data. I did everything like in this article, but my Dataflow job doesn't want to decrypt credentials with KMS key given (when I run it using Cloud Function).
So I tried running it in Cloud Shell
gcloud dataflow jobs run JOB_NAME \
--region=us-west1 \
--gcs-location=TEMPLATE_LOCATION \
--dataflow-kms-key=projects/PROJECT_ID/locations/us-west1/keyRings/KEY_RING/cryptoKeys/KEY_NAME \
--parameters=...,KMSEncryptionKey=projects/PROJECT_ID/locations/us-west1/keyRings/KEY_RING/cryptoKeys/KEY_NAME,...
But I have an error
Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: com.google.api.gax.rpc.PermissionDeniedException: io.grpc.StatusRuntimeException: PERMISSION_DENIED: Permission 'cloudkms.cryptoKeyVersions.useToDecrypt' denied on resource 'projects/PROJECT_ID/locations/us-west1/keyRings/KEY_RING/cryptoKeys/KEY_NAME' (or it may not exist).
I am completely stuck. Has anyone had the same problem and could help?
You need to make sure that you have assigned the Cloud KMS CryptoKey Encrypter/Decrypter role to the Dataflow service account and also the Compute Engine service account.
Refer to this document, Cloud Key Management Service (Cloud KMS) encryption key with Dataflow
If using Cloud functions it might also be necessary to assign to the Google Cloud Functions service agent service account, the permissions to encrypt and decrypt using KMS.
Ensure the user that is calling the encrypt and decrypt methods has
the clAssign the Cloud KMS CryptoKey Encrypter/Decrypter role to the
Dataflow service account.oudkms.cryptoKeyVersions.useToEncrypt and
cloudkms.cryptoKeyVersions.useToDecrypt permissions on the key used to
encrypt or decrypt.
One way to permit a user to encrypt or decrypt is to add the user to
the roles/cloudkms.cryptoKeyEncrypter,
roles/cloudkms.cryptoKeyDecrypter, or
roles/cloudkms.cryptoKeyEncrypterDecrypter
Also make sure the parameters passed are correct;
PYTHON
python -m apache_beam.examples.wordcount \
--input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://STORAGE_BUCKET/counts \
--runner DataflowRunner \
--project PROJECT_ID \
--temp_location gs://STORAGE_BUCKET/tmp/ \
--dataflow_kms_key=KMS_KEY
JAVA
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--inputFile=gs://dataflow-samples/shakespeare/kinglear.txt \
--output=gs://STORAGE_BUCKET/counts \
--runner=DataflowRunner --project=PROJECT_ID \
--gcpTempLocation=gs://STORAGE_BUCKET/tmp \
--dataflowKmsKey=KMS_KEY"
-Pdataflow-runner
Specifying pipeline execution parameters

Error when creating GCP Dataproc cluster: permission denied for 'compute.projects.get'

am trying to create Dataproc cluster with a service account via cloud sdk. It's throwing an error that compute.projects.get is denied. The service account has compute viewer access, compute instance admin, dataproc editor access. Unable to understand why this error. In the IAM policy troubleshooter, I checked dataproc.cluster.create is assigned to the service account
The command is:
gcloud dataproc clusters create cluster-dqm01 \
--region europe-west-2 \
--zone europe-west2-b \
--subnet dataproc-standalone-paasonly-europe-west2 \
--master-machine-typne n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 1.3-deb9 \
--project xxxxxx \
--service-account xxxx.iam.gserviceaccount.com
ERROR: (gcloud.dataproc.clusters.create) PERMISSION_DENIED: Required 'compute.projects.get' permission for 'projects/xxxxxx'
The project is correct as I have tried to create from the console getting the same error, generated the gcloud command from the console to run with a service account. This is the first time dataproc cluster is being created for the project
If you had assigned the various permissions to the same service account you're specifying with --service-account, the issue is that you probably meant to specify --impersonate-service-account instead.
There are three identities that are relevant here:
The identity issuing the CreateCluster command - this is often a human identity, but if you're automating things, using --impersonate-service-account, or running the command from inside another GCE VM, it may be a service account itself.
The "Control plane" identity - this is what the Dataproc backend service uses to actually create VMs
The "Data plane" identity - this is what the Dataproc workers behave as when processing data.
Typically, #1 and #2 need the various "compute" permissions and some minimal GCS permissions. #3 typically just needs GCS and optionally BigQuery, CloudSQL, Bigtable, etc. permissions depending on what you're actually processing.
See https://cloud.google.com/dataproc/docs/concepts/iam/dataproc-principals for more in-depth explanation of these identities.
It also lists the pre-existing curated roles to make this all easy (and typically, "default" project settings will automatically have the correct roles already so that you don't have to worry about it). Basically, the "human identity" or the service account you use with --impersonate-service-account needs Dataproc Editor or Project Editor roles, the "control plane identity" needs Dataproc Service Agent, and the "data plane identity" needs Dataproc Worker.

Howto use gcloud to create cloud launcher products clusters

I'm new to google cloud and i try to experiment it.
I can see that preparing scripts is some kind of vital if i want to create and delete clusters every days.
For dataproc clusters, it's easy :
gcloud dataproc clusters create spark-6-m \
--async \
--project=my-project-id \
--region=us-east1 \
--zone=us-east1-b \
--bucket=my-project-bucket \
--image-version=1.2 \
--num-masters=1 \
--master-boot-disk-size=10GB \
--master-machine-type=n1-standard-1 \
--worker-boot-disk-size=10GB \
--worker-machine-type=n1-standard-1 \
--num-workers=6 \
--initialization-actions=gs://dataproc-initialization-actions/jupyter2/jupyter2.sh
Now, i'd like to create a cassandra cluster. I see that the code launcher allows to do that easily too but I can't find a gcloud command to automate it.
Is there a way to create cloud launcher products clusters via gcloud ?
Thanks
Cloud Launcher deployments can be replicated from the Cloud Shell using Custom Deployments [1].
Once the Cloud Launcher deployment (in this case a Cassandra cluster) is finished the details of the deployment can be seen in the Deployment Manager [2].
The deployment details have an Overview section with the configuration and the imported files used for the deployment process. Download the “Expanded Config” file, this will be the .yaml file for the custom deployment [3]. Download the imports files to the same directory as the .yaml file to be able to deploy correctly [4].
This files and configuration will create an equivalent deployment as the Cloud Launcher.