Can service accounts be specified when using gsutil? - google-cloud-platform

I'm using gsutil rsync in my Jenkins instance for deploying code to composer and I'd like to be able to deploy code to different projecs (prodcution, staging, dev...). When using gcloud the only thing that I need to do is to provide the --account parameter in order to pick the service account that allows Jenkins to do that but seems like gsutils only works with config files and that creates a race condition when several jobs are running simultaneously because it will all depend on the configuration present in gcloud config.
Is there a way to specify which account must be used by Google Cloud's gsutil?

First of all, note that if you're using an installation of gsutil bundled with gcloud, gcloud will pass its currently active credentials to gsutil. If you want to avoid this and use multiple different credentials/accounts for overlapping invocations, you should manage credentials via gsutil directly (using separate boto config files), not gcloud. You can disable gcloud's auto-credential-passing behavior via running gcloud config set pass_credentials_to_gsutil false.
Separate gsutil installations will all write to the same state directory by default ($HOME/.gsutil), as well as loading the same default boto config files. To avoid race conditions, you can (and should) use the same gsutil installation, but specify a different state_dir and/or boto config file for invocations that might overlap. This can be set either at the boto config file level, or with the -o option, e.g. gsutil -o "GSUtil:state_dir=$HOME/.gsutil2" cp src dst. You'll find more information about it here.

you can use gsutil config -e to pass service account credentials.
More details: https://cloud.google.com/storage/docs/gsutil/commands/config#configuring-service-account-credentials
Hope this helps.

Related

How to use gsutil with authenication key

I have a Apache Airflow DAG running on an on-prem server. In the DAG, I want to call Google Cloud CLI command gsutil to copy data file into GCP Storage bucket. In order to do that, I have to call gcloud auth activate-service-account first, then gscutil cp. Is it possible that I can merge two commands into just one? Or is it possible for me to setup the default authentication for my GCP service account, so I can skip the first command? Thanks in advance for any help!
Instead of using shell commands to call GCP operations, firstly set a GCP connection in Airflow, then you should consider to use Airflow GCP Operators such as LocalFilesystemToGCSOperator, GCSToLocalFilesystemOperator or Google API within PythonOperator.
Thus, you won't need to run extra command to authenticate. The gcp_conn_id that you prepared and specified
will already handle this step for you.
It is always better to use official providers' operators/sensors/hooks instead grueling bash commands. You can discover more here.

Run java -jar inside AWS Glue job

I have relatively simple task to do but struggle with best AWS service mix to accomplish that:
I have simple java program (provided by 3rd party- I can't modify that, just use) that I can run anywhere with java -jar --target-location "path on local disc". The program, once executed, is creating csv file on local disc in path defied in --target-location
Once file is created I need to upload it to S3
The way I'm doing it currently is by having dedicated EC2 instance with java installed and first point is covered by java -jar ... and second with aws s3 cp ... command
I'm looking for better way of doing that (preferably serverless). I'm wandering if above points can be accomplished with AWS Glue Job type Python Shell? Second point (copy local file to S3), most likely I can cover with boto3 but first (java -jar execution)- I'm not sure.
Am I force to use EC2 instance or you see smarter way with AWS Glue?
Or most effective would be to build docker image (that contains this two instructions), register in ECR and run wit AWS Batch?
I'm looking for better way of doing that (preferably serverless).
I cannot tell if a serverless option is better, however, an EC2 instance will do the job just fine. Assume that you have CentOS on your instance, you may do it through
aaPanel GUI
Some useful web panels offer cron scheduled tasks, such as backing up some files from one directory to another S3 directory. I will use aaPanel as an example.
Install aaPanel
Install AWS S3 plugin
Configure the credentials in the plugin.
Cron
Add a scheduled task to back up files from "path on local disc" to AWS S3.
Rclone
A web panel goes beyond the scope of this question. Rclone is another useful tool I use to back up files from local disk to OneDrive, S3, etc.
Installation
curl https://rclone.org/install.sh | sudo bash
Sync
Sync a directory to the remote bucket, deleting any excess files in the bucket.
rclone sync -i /home/local/directory remote:bucket

copy files during GCP instance creation from python

I am using the googleapiclient in python to launch VM instances. As part of that I am using the facility to run start up scripts to install docker and other python packages.
Now, one thing I would like to do is copy files to this instance ideally during the instance creation stage through python code.
What might be the way to achieve this? Ideally what would work is to be able to detect that the instance has booted and then be able to copy these files.
If I am hearing you correctly, you want files to be present inside the container that is being executed by Docker in your Compute Engine VM. Your Startup Script for the Compute Engine is installing docker.
My recommendation is not to try and copy those files into the container but instead, have them available on the local file system available to the Compute Engine. Configure your docker startup to then mount the directory from the Compute Engine into the docker container. Inside the docker container, you would now have accessibility to the desired files.
As for bringing the files into the Compute Engine environment in the first place, we have a number of options. The core story however will be describing where the files start from in the first place.
One common approach is to keep the files that you want copied into the VM in a Google Cloud Storage (GCS) bucket/folder. From there, your startup script can use GCS API or the gsutil command to copy the files from the GCS bucket to the local file system.
Another thought, and again, this depends on the nature of the files ... is that you can create a GCP disk that simply "contains" the files. When you now create a new Compute Engine instance, that instance could be defined to mount the disk which is shared read-only across all the VM instances.
First of all, I would suggest to use tool like Terraform or Google Deployment Manager to create cloud infrastructure instead of writing custom python code and handling all edge-cases by yourself.
For some reason, you can't use above tool and only Python program is an option for you the you can do following:
1. Create a GCS bucket using python api and put appropriate bucket policy to protect data.
2. Create a service account which has read permission to above GCS bucket.
3. Launch VM instance using python API and have your start-up script to install packages and run docker container. Attach above service account which has permission to read files from above GCS bucket.
3. Have a startup script in your docker container which can run ``gsutil` command to fetch files from GCS bucket and put at the right place.
Hope this helps.
Again, if you can use tools like Terraform, that will make things easy.

Limit Upload bandwidth for S3 sync with ansible

AWS provides a config to limit the upload bandwidth when copying files to s3 from ec2 instances. This can be configured by below AWS config.
aws configure set default.s3.max_bandwidth
Once we set this config and run an AWS CLI command to cp files to s3 bandwidth is limited.
But when I run the s3_sync ansible module on the same ec2 instance that limitation is not getting applied. Any possible workaround to apply the limitation to ansible as well?
Not sure if this is possible because botocore may not support this.
Mostly is up to Amazon to fix their python API.
For example Docker module works fine by sharing confugration between cli and python-api.
Obviously that I assumed you did run this command locally as the same user because otherwise the aws config you made would clearly not be used.

aws cli copy command halted

I used Putty to get into my AWS instance and ran a cp command to copy files into my S3 instance.
aws cli cp local s3://server_folder --recursive
Partway through, my internet dropped out and the copy halted even though the AWS instances was still running properly. Is there a way to make sure the cp command keeps running even if I lose my connection?
You can alternatively use Minio Client aka mc,it is open source and is compatible with AWS S3. Minio client is available for Windows along with mac, Linux.
The mc mirror command will help you in copying local content to remote AWS S3 bucket, incase of network issue the upload fails mc session resume will start uploading from where connection was terminated.
mc supports these commands.
COMMANDS:
ls List files and folders.
mb Make a bucket or folder.
cat Display contents of a file.
pipe Write contents of stdin to one target. When no target is specified, it writes to stdout.
share Generate URL for sharing.
cp Copy one or more objects to a target.
mirror Mirror folders recursively from a single source to single destination.
diff Compute differences between two folders.
rm Remove file or bucket [WARNING: Use with care].
access Set public access permissions on bucket or prefix.
session Manage saved sessions of cp and mirror operations.
config Manage configuration file.
update Check for a new software update.
version Print version.
You can check docs.minio.io for more details.
Hope it helps.
Disclaimer: I work for Minio.