How to build Google Cloud dataproc edge node? - google-cloud-platform

We are moving from On-Premises environment to google cloud dataproc for spark jobs. I am able to build the cluster though and ssh to master node for job execution. I am not clear how to build the edge node where we can allow users to login and submit job. Is it going to be another gce vm? Any thoughts or best practices?

A new VM instance is a good option to map the EdgeNode role from other architectures:
You can execute your job from the Master node which you can make accessible through SSH.
You will need to find a balance between simplicity (SHH) or security (EdgeNode).
Please note that IAM can help to allow individual users to submit jobs by assigning Dataproc Editor role.
Don't forget the ability that Dataproc offers of creating ephemeral nodes. This means that you create a cluster, execute your job and delete your cluster.
Using ephemeral clusters will avoid unnecessary costs. Even, the script you create for that it can be executed from any machine that has the Google Cloud SDK installed, e.g. OnPrem servers or your PC.

Related

Airflow connection for a single DAG

I am creating a connection with a Google Service Account in my Google Cloud Composer that privilegies a DAG for a specific use case with deals with sensitive data, the point is that I want that connection to be exclusive for a certain DAG and no other could see or use it.
Is there a way of doing it?
Currently this is not possible in airflow, and even you cannot implement that using a custom backend secret or another solution, where the connection is not a context variable, and it's accessible from anywhere in airflow not only from a run context.
Infortunately the service account given to Cloud Composer in the creation of cluster, is for all DAGs of this cluster.
It can be too much, but maybe you can create another Cloud Composer cluster 2 (GKE autopilot), with the minimum sizing for machines, containing this DAG that treats sensitive data.
Then you can give a SA with the needed privileges to this cluster.
The disadvantage of this solution is you will have a higher cost, because you have a second cluster. It will increases the cost even if the machine sizes are low.
It is worth noting that Composer 2 with GKE autopilot is cheaper that classical GKE cluster.
Maybe another solution, if the rework is not too important, you can rewrite only your DAG treating sensitive data to Cloud Workflow.
Cloud Workflow is serverless and you can give it a dedicated service account.

How can I prevent Google Cloud Dataproc cluster VM instances from auto-shutoff?

When running vm instance cluster+ nodes even if I am using and running things on the cluster/ dataproc, the vm instance shuts off automatically after about 30 minutes or so. I cannot find this setting and would appreciate any help re: how to disable this to prevent it from shutting off or even how to configure a new cluster in a way that will prevent this from happening.
Thank you
Default Dataproc clusters do not have any kind of automatic shutdown.
If you are using the older Datalab initialization action, you are probably seeing Datalab's own non-Dataproc-aware shutdown functionality, which you can disable one of the ways suggested here: How to keep Google Dataproc master running?
Otherwise, if you're using some kind of template or copy/paste arguments for creating your Dataproc cluster, perhaps you're accidentally setting "scheduled deletion": https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion
If neither of those settings explain your situation, you should visit your "activity logs" from the "Cloud Logging" interface, selecting Cloud Dataproc Cluster, and opening up the activity_log type of logs to see an audit log of who was deleting your cluster. Alternatively, if the cluster still existed in Dataproc, but the underlying VM was being shut down, visit the "Compute Engine VM" log category and also look at "activity logs" to see who was stopping your VMs. Sometimes, in a shared project, a project admin might be running some kind of script to automatically shut down VMs to save cost.

How to submit job on Dataproc cluster with specific service account?

I'm trying to execute jobs in the Dataproc cluster which access several resources of GCP like Google Cloud Storage.
My concern is whatever file or object is being created through my job is owned/created by Dataproc default user.
Example - 123456789-compute#developer.gserviceaccount.com.
Is there any way I can configure this user/service-account so that the object gets created by a given user/service-account instead of default one?
You can configure service account to be used by a Dataproc cluster using flag --service-account at cluster creation time.
Gcloud command would look like:
gcloud dataproc clusters create cluster-name \
--service-account=your-service-account#project-id.iam.gserviceaccount.com
More details: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts
https://cloud.google.com/dataproc/docs/concepts/iam/iam
Note: it is better to have one dataproc cluster per job so that each job get isolated environment and doesnt affect each other and you can manage them better (in terms of security as well).
you can also look at GCP Composer using which you can schedule jobs and automate them.
Hope this helps.

Is VPC-native GKE cluster production ready?

This happens while trying to create a VPC-native GKE cluster. Per the documentation here the command to do this is
gcloud container clusters create [CLUSTER_NAME] --enable-ip-alias
However this command, gives below error.
ERROR: (gcloud.container.clusters.create) Only alpha clusters (--enable_kubernetes_alpha) can use --enable-ip-alias
The command does work when option --enable_kubernetes_alpha is added. But gives another message.
This will create a cluster with all Kubernetes Alpha features enabled.
- This cluster will not be covered by the Container Engine SLA and
should not be used for production workloads.
- You will not be able to upgrade the master or nodes.
- The cluster will be deleted after 30 days.
Edit: The test was done in zone asia-south1-c
My questions are:
Is VPC-Native cluster production ready?
If yes, what is the correct way to create a production ready cluster?
If VPC-Native cluster is not production ready, what is the way to connect privately from a GKE cluster to another GCP service (like Cloud SQL)?
Your command seems correct. Seems like something is going wrong during the creation of your cluster on your project. Are you using any other flags than the command you posted?
When I set my Google cloud shell to region europe-west1
The cluster deploys error free and 1.11.6-gke.2(default) is what it uses.
You could try to manually create the cluster using the GUI instead of gcloud command. While creating the cluster, check the “Enable VPC-native (using alias ip)” feature. Try using a newest non-alpha version of GKE if some are showing up for you.
Public documentation you posted on GKE IP-aliasing and the GKE projects.locations.clusters API shows this to be in GA. All signs point this to be production ready. For whatever it’s worth, the feature has been posted last May In Google Cloud blog.
What you can try is to update your version of Google Cloud SDK. This will bring everything up to the latest release and remove alpha messages for features that are in GA right now.
$ gcloud components update

Possibility of taking snapshot of AWS EMR cluster or namenode

I am new with AWS services and trying some use-cases. I want to create EMR clusters on demand with some predefined configurations and applications/scripts installed. I was planning to create a snapshot of existing EMR cluster or at-least namenode initially and then use it every-time whenever I want to create other clusters. But after some Google search, I couldn't find any way to capture snapshot of EMR cluster. Is it possible to create snapshot ? or any other alternate way that can help me out with my use-case.
Appreciate any kind of help.
Thanks
It is not possible to create a snapshot of an EMR cluster node and you cannot use a custom AMI when running a cluster. However you can install software on the cluster nodes at the cluster creation time using custom bootstrap actions. You can create your custom bootstrap scripts and use them every time you launch a new cluster. This way you can achieve a similar functionality with the one you are seeking.
For more information using bootstrap actions on EMR please visit: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#bootstrapCustom
Let us know if you need any further assistance.