How to use gsutil with authenication key - google-cloud-platform

I have a Apache Airflow DAG running on an on-prem server. In the DAG, I want to call Google Cloud CLI command gsutil to copy data file into GCP Storage bucket. In order to do that, I have to call gcloud auth activate-service-account first, then gscutil cp. Is it possible that I can merge two commands into just one? Or is it possible for me to setup the default authentication for my GCP service account, so I can skip the first command? Thanks in advance for any help!

Instead of using shell commands to call GCP operations, firstly set a GCP connection in Airflow, then you should consider to use Airflow GCP Operators such as LocalFilesystemToGCSOperator, GCSToLocalFilesystemOperator or Google API within PythonOperator.
Thus, you won't need to run extra command to authenticate. The gcp_conn_id that you prepared and specified
will already handle this step for you.
It is always better to use official providers' operators/sensors/hooks instead grueling bash commands. You can discover more here.

Related

Can you run gsutil in Google Cloud Workflows?

Looking for an easy way to run gsutil command line arguments in Google Cloud Workflows.
The specific problem I'm trying to solve is dynamically remove all objects in a Cloud Storage Bucket so that the bucket can then be deleted. Trying to do this as part of a Workflows project. gsutil can remove files with this command gsutil rm -a gs://bucket/** and if I could run that in a Workflows step that would be great.
Cloud Workflow call only API. So, you have to call API, as gsutil do. I wrote an article to list all the file and to call a "compose" operation on the file. You can customize the code to call the delete API
If you really want to use GSUTIL, you can use Cloud Run Job for instance. And execute it with an API call with Workflows

How to include gcp_conn_id argument while using gcloud commands (in bash operator)

I am trying to use composer to create dataproc cluster. While I use DataprocClusterCreateOperator or DataprocCreateClusterOperator in my dag code, I am able to pass this argument gcp_conn_id='' to communicate with other projects using composer. But some of the tasks requires bashoperator and using bashoperator involves gcloud commands. I didn't find any argument to pass gcp_conn_id while using gcloud commands.
Without this argument, I couldn't communicate to other projects as the default connection id of composer's service account do not have proper access to communicate with other projects.
I need either a way to include custom gcp connection id which I setup in Airflow connections or any alternate approach for my composer's default service account to make connection with other projects.
Any thoughts on this are highly appreciated.

Is there a way to run gsutil as a regular Linux cronjob on GCP?

I have a script that does some stuff with gcloud utils like gsutil and bq, eg:
#!/usr/bin/env bash
bq 'SELECT * FROM myproject.whatever WHERE date > $x' > res.csv
gsutil cp res.csv gs://my_storage/foo.csv
This works on my machine or VM, but I can't guarantee it will always be on, so I'd like to add this as a GCP cronjob/Lambda type of thing. From the docs here, it looks like the Cloud Scheduler can only do HTTP requests, Pub/Sub, or App Engine HTTP, none of which are exactly what I want.
So: is there any way in GCP to automate some gsutil / bq commands, like a cronjob, but without my supplying an always-on machine?
There are likely going to be multiple answers and this is but one.
For me, I would examine the concept of Google Cloud Run. The idea here is that you get to create a Docker image that is then instantiated, run and cleaned up when called by a REST request. What you put in your docker image is 100% up to you. It could be a simple image with tools like gcloud and gsutil installed with a script to run them with any desired parameters. Your contract with Cloud Run is only that you consume the incoming HTTP request.
When there are no requests to Cloud Run, there is no charge as nothing is running. You are only billed for the duration that your logic actually executes for.
I recommend Cloud Run over Cloud Functions as Cloud Run allows you to define the environment in which the commands run ... for example ... availability of gsutil.

Can service accounts be specified when using gsutil?

I'm using gsutil rsync in my Jenkins instance for deploying code to composer and I'd like to be able to deploy code to different projecs (prodcution, staging, dev...). When using gcloud the only thing that I need to do is to provide the --account parameter in order to pick the service account that allows Jenkins to do that but seems like gsutils only works with config files and that creates a race condition when several jobs are running simultaneously because it will all depend on the configuration present in gcloud config.
Is there a way to specify which account must be used by Google Cloud's gsutil?
First of all, note that if you're using an installation of gsutil bundled with gcloud, gcloud will pass its currently active credentials to gsutil. If you want to avoid this and use multiple different credentials/accounts for overlapping invocations, you should manage credentials via gsutil directly (using separate boto config files), not gcloud. You can disable gcloud's auto-credential-passing behavior via running gcloud config set pass_credentials_to_gsutil false.
Separate gsutil installations will all write to the same state directory by default ($HOME/.gsutil), as well as loading the same default boto config files. To avoid race conditions, you can (and should) use the same gsutil installation, but specify a different state_dir and/or boto config file for invocations that might overlap. This can be set either at the boto config file level, or with the -o option, e.g. gsutil -o "GSUtil:state_dir=$HOME/.gsutil2" cp src dst. You'll find more information about it here.
you can use gsutil config -e to pass service account credentials.
More details: https://cloud.google.com/storage/docs/gsutil/commands/config#configuring-service-account-credentials
Hope this helps.

Run dataflow job from Compute Engine

I am following the quickstart link to run dataflow job
https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven
It works fine when I run mvn command from google cloud shell.
mvn compile exec:java \
-Dexec.mainClass=com.example.WordCount \
-Dexec.args="--project=<my-cloud-project> \
--stagingLocation=gs://<my-wordcount-storage-bucket>/staging/ \
--output=gs://<my-wordcount-storage-bucket>/output \
--runner=DataflowRunner"
But when I try to launch a VM and run the command from it, I get permission denied error.
If I give full API access to VM, command runs successfully.
What are permissions I should give to VM to run dataflow job or shall I use service account?
Anybody can tell me best way to run dataflow jobs in production env.
Regards,
pari
So you will have to give Dataflow Admin rights to the VM to run the Dataflow job. Additionally, if your Dataflow job involves BigQuery then you'll have to provide BigQuery Editor role and so on.
You can also create a service account and provide the required roles to run the job.
Hope this helps.
To provide granular access, you can take advantage of Dataflow Roles:
Developer. Executes and manipulates Dataflow jobs.
Viewer. Read-only access to Dataflow related resources
Worker. Provides the permissions for a service account to execute work units for a Dataflow pipeline.
When you have an automated app that need to execute Dataflow jobs automatically without user intervention it is recommended to use a service account.
In fact, to deploy it in a production environment the recommendation that I can suggest is creating automated process that deploys and executes your pipeline, you can take advantage of the Cloud Composer that is based in Apache Airflow, and can launch Dataflow jobs (composer 1.0.0 or later in supported Dataflow regions).
If you are using airflow, create a service account with access to the components being used in dataflow and create a Connection in the airflow UI with required scopes. Once done, use DataflowJavaOperator/DataflowTemplateOperator to submit the job and orchestrate it via airflow
If you need further help, comment on this answer.