Startup script doesn't run Airflow webserver VM GCP - google-cloud-platform

I'm trying to automatically run airflow webserver and scheduler in a VM upon boot using startup scripts just followed the documentation here: https://cloud.google.com/compute/docs/instances/startup-scripts/linux . Here is my script:
export AIRFLOW_HOME=/home/name/airflow
cd /home/name/airflow
nohup airflow scheduler >> scheduler.log &
nohup airflow webserver -p 8080 >> webserver.log &
The .log files are created which means the script is been executed but the webserver and the scheduler don't.
Any apparent reason?

I have tried replicating the Airflow webserver Startup script on GCP VM using the document.
Steps followed to run Airflow webserver Startup script on GCP VM :
Create a Service Account. Give minimum access to BigQuery with the role of BigQuery Job User and Dataflow with the role of Dataflow Worker. Click Add Key/Create new key/Done. This will download a JSON file.
Create a Compute Engine instance. Select the Service Account created.
Install Airflow libraries. Create a virtual environment using miniconda.
Init your metadata database and register at least one admin user using command:
airflow db init
airflow users create -r Admin -u username -p mypassword -e example#mail.com -f yourname -l lastname
Whitelist IP for port 8080. Create Firewall Rule and add firewall rule on GCP VM instance. Now go to terminal and start web server using command
airflow webserver -p 8080.
Open another terminal and start the Scheduler.
export AIRFLOW_HOME=/home/acachuan/airflow-medium
cd airflow-medium
conda activate airflow-medium
airflow db init
airflow scheduler
We want our Airflow to start immediately after the Compute Engine starts. So we can create a Cloud Storage bucket and then create a script, upload the file and keep it as a backup.
Now pass a Linux startup script from Cloud Storage to a VM. Refer Passing a startup script that is stored in Cloud Storage to an existing VM. You can also pass a startup script to an existing VM.
Note : PermissionDenied desc = The caller does not have permission means you don’t have sufficient permissions, you need to request access from your project, folder, or organization admin. Depending on the assets you are trying to export. And to access files which are created by root users you need read, write or execute permissions. Refer File permissions.

Related

GCSFuse not finding default credentials when running a cloud run app docker locally

I am working on mounting a Cloud Storage Bucket to my Cloud Run App, using the example and code from the official tutorial https://cloud.google.com/run/docs/tutorials/network-filesystems-fuse
The application uses docker only (no cloudbuild.yaml)
The docker file compiles with out issue using command:
docker build --platform linux/amd64 -t fusemount .
I then start docker run with the following command
docker run --rm -p 8080:8080 -e PORT=8080 fusemount
and when run gcsfuse is triggered with both the directory endpoint and the bitbucket URL
gcsfuse --debug_gcs --debug_fuse gs://<my-bucket> /mnt/gs
But the connection fails:
022/12/11 13:54:35.325717 Start gcsfuse/0.41.9 (Go version go1.18.4)
for app "" using mount point: /mnt/gcs 2022/12/11 13:54:35.618704
Opening GCS connection...
2022/12/11 13:57:26.708666 Failed to open connection: GetTokenSource:
DefaultTokenSource: google: could not find default credentials. See
https://developers.google.com/accounts/docs/application-default-credentials
for more information.
I have already set up the application-defaut credentials with the following command:
gcloud auth application-default login
and I have a python based cloud function project that I have tested on the same local machine which has no problem accessing the same storage bucket with the same default login credentials.
What am I missing?
Google libraries search for ~/.config/gcloud when using APPLICATION_DEFAULT authorization approach.
Your local Docker container doesn't contain this config when running locally.
So, you might want to mount it when running a container:
$ docker run --rm -v /home/$USER/.config/gcloud:/root/.config/gcloud -p 8080:8080 -e PORT=8080 fusemount
Some notes:
I'm not sure which OS you are using, so that replace /home/$USER with a real path to your home
Same, I'm not sure your image has /root home, so make sure that path from 1. is mounted properly
Make sure your local user is authorized to gcloud cli, as you mentioned, using this command gcloud auth application-default login
Let me know, if this helped.
If you are using docker and not using Google Compute engine (GCE), did you try mounting service account key when running container and using that key while mounting GCSFuse ?
If you are building and deploying to Cloud run, did you grant required permissions mentioned in https://cloud.google.com/run/docs/tutorials/network-filesystems-fuse#ship-code?

How do I ensure GCP start-script uses the correct service account?

I am creating a VM in GCP's Compute Engine with a service account that has permissions to read from a particular Cloud Storage bucket that contains some common configuration that may contain sensitive information, such as TLS certs. However when my startup script is executed, it is denied permission to access the bucket because it is using the Google Compute Engine default service account, not the service account I provisioned my VM to use. Can someone please help me figure out how to ensure that the startup script uses the right service account?
============= EDIT =============
Not sure how helpful this will be, but here is the puppet code that is failing, I can't/won't provide all of the puppet code. The actual startup script that is invoked when the instance starts is sudo puppet apply --verbose /opt/puppet/manifests/opensearch.pp >/var/log/puppetlabs/puppet/startup.log 2>&1. Note that I've already confirmed that puppet is not doing anything special with the service accounts. However puppet always uses the default service account, and fails to download the certs. If I SSH into the instance and run the same command by hand it works every time.
exec { 'download_ssl_certs':
command => "/snap/bin/gsutil cp -r gs://${opensearch::secrets_bucket}/${opensearch::cluster}/* ${opensearch::opensearch_path}/config/",
notify => Exec['ssl_certs_chown']
}
exec { 'ssl_certs_chown':
command => "/bin/chown -R ${opensearch::service_user}:${opensearch::service_group} ${opensearch::opensearch_path}/config",
onlyif => "/bin/ls -lhR ${opensearch::opensearch_path}/config | /bin/grep -i root | grep -v ${opensearch::service_user}",
refreshonly => true,
notify => Service['opensearch'],
}
Example:
gcloud compute instances create example-vm \
--service-account 123-my-sa#my-project-123.iam.gserviceaccount.com \
--scopes https://www.googleapis.com/auth/cloud-platform
Creating and enabling service accounts for instances
As the service account are part of metadata, you can access to metadata using startup scripts.
Accessing metadata from a Linux startup script

Connect to Memorystore from Cloud Run

I want to run a service on Google Cloud Run that uses Cloud Memorystore as cache.
I created an Memorystore instance in the same region as Cloud Run and used the example code to connect: https://github.com/GoogleCloudPlatform/golang-samples/blob/master/memorystore/redis/main.go this didn't work.
Next I created a Serverless VPC access Connectore which didn't help. I use Cloud Run without a GKE Cluster so I can't change any configuration.
Is there a way to connect from Cloud Run to Memorystore?
To connect Cloud Run (fully managed) to Memorystore you need to use the mechanism called "Serverless VPC Access" or a "VPC Connector".
As of May 2020, Cloud Run (fully managed) has Beta support for the Serverless VPC Access. See Connecting to a VPC Network for more information.
Alternatives to using this Beta include:
Use Cloud Run for Anthos, where GKE provides the capability to connect to Memorystore if the cluster is configured for it.
Stay within fully managed Serverless but use a GA version of the Serverless VPC Access feature by using App Engine with Memorystore.
While waiting for serverless VPC connectors on Cloud Run - Google said yesterday that announcements would be made in the near term - you can connect to Memorystore from Cloud Run using an SSH tunnel via GCE.
The basic approach is the following.
First, create a forwarder instance on GCE
gcloud compute instances create vpc-forwarder --machine-type=f1-micro --zone=us-central1-a
Don't forget to open port 22 in your firewall policies (it's open by default).
Then install the gcloud CLI via your Dockerfile
Here is an example for a Rails app. The Dockerfile makes use of a script for the entrypoint.
# Use the official lightweight Ruby image.
# https://hub.docker.com/_/ruby
FROM ruby:2.5.5
# Install gcloud
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
RUN mkdir -p /usr/local/gcloud \
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
&& /usr/local/gcloud/google-cloud-sdk/install.sh
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin
# Generate SSH key to be used by the SSH tunnel (see entrypoint.sh)
RUN mkdir -p /home/.ssh && ssh-keygen -b 2048 -t rsa -f /home/.ssh/google_compute_engine -q -N ""
# Install bundler
RUN gem update --system
RUN gem install bundler
# Install production dependencies.
WORKDIR /usr/src/app
COPY Gemfile Gemfile.lock ./
ENV BUNDLE_FROZEN=true
RUN bundle install
# Copy local code to the container image.
COPY . ./
# Run the web service on container startup.
CMD ["bash", "entrypoint.sh"]
Finally open an SSH tunnel to Redis in your entrypoint.sh script
# !/bin/bash
# Memorystore config
MEMORYSTORE_IP=10.0.0.5
MEMORYSTORE_REMOTE_PORT=6379
MEMORYSTORE_LOCAL_PORT=6379
# Forwarder config
FORWARDER_ID=vpc-forwarder
FORWARDER_ZONE=us-central1-a
# Start tunnel to Redis Memorystore in background
gcloud compute ssh \
--zone=${FORWARDER_ZONE} \
--ssh-flag="-N -L ${MEMORYSTORE_LOCAL_PORT}:${MEMORYSTORE_IP}:${MEMORYSTORE_REMOTE_PORT}" \
${FORWARDER_ID} &
# Run migrations and start Puma
bundle exec rake db:migrate && bundle exec puma -p 8080
With the solution above Memorystore will be available to your application on localhost:6379.
There are a few caveats though
This approach requires the service account configured on your Cloud Run service to have the roles/compute.instanceAdmin role, which is quite powerful.
The SSH keys are backed into the image to speedup container boot time. That's not ideal.
There is no failover if your forwarder crashes.
I've written a longer and more elaborated approach in a blog post that improves the overall security and adds failover capabilities. The solution uses plain SSH instead of the gcloud CLI.
If you need something in your VPC, you can also spin up Redis on Compute Engine
It's more costly (especially for a Cluster) than Redis Cloud - but an temp solution if you have to keep the data in your VPC.

Connect to particular GCP account

I have been using the GCP console to connect to a cloud instance and want to switch to using SSH through powershell as that seems to maintain a longer persistence. Transferring my public key through cloud shell into authorized_key file seems to be temporary since once cloud shell disconnects, the file doesn't persist. I've tried using os-login but that generates a completely different user from what I've been using through cloud shell (Cloud shell creates a user: myname while gcloud creates a user: myname_domain_com. Is there a way to continue using the same profile created by cloud shell when logging in through gcloud. I am using the same email and account in both the console and gcloud myname#domain.com. The alternative is to start all over from gcloud and that would be a pain.
If you want to SSH to different instances of a google cloud project (from a mac or Linux), do the following:
Step 1. Install SSH keys without password
Use the following command to generate the keys on your mac
ssh-keygen -t rsa -f ~/.ssh/ -C
For example private-key-name can be bpa-ssh-key. It will create two files with the following names in the ~/.ssh directory
bpa-ssh-key
bpa-ssh-key.pub
Step 2. Update the public key on your GCP project
Goto Google Cloud Console, choose your project, then
VMInstances->Metadata->SSH Keys->Edit->Add Item
Cut and paste the contents of the bpa-ssh-key.pub (from your mac) here and then save
Reset the VM Instance if it is running
Step 3. Edit config file under ~/.ssh on your mac Edit the ~/.ssh/config to add the following lines if not present already
Host *
PubKeyAuthentication yes
IdentityFile ~/.ssh/bpa-ssh-key
Step 4. SSHing to GCP Instance
ssh username#gcloud-externalip
It should create a SSH shell without asking for the password (since you have created the RSA/SSH keys without a password) on the gcloud instance.
Since Metadata is common across all instances under the same project, you can seam-lessly SSH into any of the instances by choosing the respective External IP of the gcloud instance.

Cannot Transfer files from my mac to VM instance on GCP

I have managed to set up a VM instance on Google cloud platform using the following instructions:
https://towardsdatascience.com/running-jupyter-notebook-in-google-cloud-platform-in-15-min-61e16da34d52
I am then able to run a Jupyter notebook as per the instructions.
Now I want to be able to use my own data in the notebook....this is where I am really struggling. I downloaded the Cloud SDK onto my mac and ran this from the terminal (as per https://cloud.google.com/compute/docs/instances/transfer-files)
My-MacBook-Air:~ me$ gcloud compute scp /Users/me/Desktop/my_data.csv aml-test:~/amlfolder
where aml-test is the name of my instance and amlfolder a folder I created on the VM instance. I don't get any error messages and it seems to work (the terminal displays the following after I run it >> 100% 66MB 1.0MB/s 01:03 )
However when I connect to my VM instance via the SSH button on the google console and type
cd amlfolder
ls
I cannot see any files! (nor can I see them from the jupyter notebook homepage)
I cannot figure out how to use my own data in a python jupyter notebook on a GCP VM instance. I have been trying/googling for an entire day. As you might have guessed I'm a complete newbie to GCP (and cd, ls and mkdir is the extent of my linux command knowledge!)
I also tried using Google Cloud Storage - I uploaded the data into a google storage bucket (as per https://cloud.google.com/compute/docs/instances/transfer-files) but don't know how to complete the last step '4. On your instance, download files from the bucket.'
If anyone can figure out what i am doing wrong, or an easier method to get my own data running into a python jupyter notebook on GCP than using gcloud scp command please help!
Definitely try writing
pwd
to verify you're in the path you think you are, there's a chance that your scp command and the console SSH command login as different users.
To copy data from a bucket to the instance, do
gsutil cp gcs://bucket-name/you-file .
As you can see in gcloud compute docs , gcloud compute scp /Users/me/Desktop/my_data.csv aml-test:~/amlfolder will use your local environment username, so the tilde in your command refers to the home directory of a username that is the same name as your local.
But when you SSH from the Browser as you can see from docs that your Gmail username will be used.
So, you should check the home directory of the user used by gcloud compute scp ... command.
The easiest way to check, SSH to your VM and run
ls /home/ --recursive