Importing airflow variables in a json file using gitlab ci/cd

Importing airflow variables in a json file using gitlab ci/cd - google-cloud-platform

How to import variables using gitlab ci/cd yml file.
I have found Importing airflow variables in a json file using the command line but not helping out

You can import a Json file as Airflow variables.
variables.json file :
{
"feature": {
"param1": "param1",
"param2": "param2",
...
}
}
For example, this file can be put in the following structure :
my-project
config
dags
variables
dev
variables.json
prd
variables.json
You can then create a Shell script to deploy these variables and file to Cloud Composer, deploy_dags_config.sh file :
#!/usr/bin/env bash
set -e
set -o pipefail
set -u
export FEATURE_NAME=my_feature
export ENV=dev
export COMPOSER_ENVIRONMENT=my-composer-env
export ENVIRONMENT_LOCATION=europe-west1
export GCP_PROJECT_ID=my-gcp-project
echo "### Deploying the data config variables of module ${FEATURE_NAME} to composer"
# deploy variables
gcloud composer environments storage data import \
--source config/dags/variables/${ENV}/variables.json \
--destination "${FEATURE_NAME}"/config \
--environment ${COMPOSER_ENVIRONMENT} \
--location ${ENVIRONMENT_LOCATION} \
--project ${GCP_PROJECT_ID}
gcloud beta composer environments run ${COMPOSER_ENVIRONMENT} \
--project ${GCP_PROJECT_ID} \
--location ${ENVIRONMENT_LOCATION} \
variables import \
-- /home/airflow/gcs/data/"${FEATURE_NAME}"/config/variables.json
echo "Variables of ${FEATURE_NAME} are well imported in environment ${COMPOSER_ENVIRONMENT} for project ${GCP_PROJECT_ID}"
This Shell script is the used in Gitlab CI yaml file :
deploy_conf:
image: google/cloud-sdk:416.0.0
script:
- . ./authentication.sh
- . ./deploy_dags_config.sh
Your Gitlab have to be authenticated to GCP.
In the Airflow DAG code, the variables can be then retrieved in a Dict as follow :
from typing import Dict
from airflow.models import Variable
variables:Dict = Variable.get("feature", deserialize_json=True)
Because the root node of variables.json file and object is feature (this name should be unique) :
{
"feature": {
"param1": "param1",
"param2": "param2",
...
}
}

Related

BigQuery Table Exports

I am looking for the best pattern to be able to execute and export a BigQuery query result to a cloud storage bucket. I would like this to be executed when the BigQuery table is written to or modified.
I think I would traditionally setup a pubsub topic that would be written to when the table is modified, which would trigger a GCP function that is responsible for executing the query and writing the result to a GCP bucket. I just am not too confident that there isn't a better approach (more straight forward) to do this in GCP.
Thanks in advance.

I propose you an approach based on Eventarc.
The goal is to launch a Cloud Function or Cloud Run action when the data is inserted or updated in a BigQuery table, example with Cloud Run :
SERVICE=bq-cloud-run
PROJECT=$(gcloud config get-value project)
CONTAINER="gcr.io/${PROJECT}/${SERVICE}"
gcloud builds submit --tag ${CONTAINER}
gcloud run deploy ${SERVICE} --image $CONTAINER --platform managed
gcloud eventarc triggers create ${SERVICE}-trigger \
--location ${REGION} --service-account ${SVC_ACCOUNT} \
--destination-run-service ${SERVICE} \
--event-filters type=google.cloud.audit.log.v1.written \
--event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \
--event-filters serviceName=bigquery.googleapis.com
When a BigQuery job was executed, the Cloud Run action will be triggered.
Example of Cloud Run action :
#app.route('/', methods=['POST'])
def index():
# Gets the Payload data from the Audit Log
content = request.json
try:
ds = content['resource']['labels']['dataset_id']
proj = content['resource']['labels']['project_id']
tbl = content['protoPayload']['resourceName']
rows = int(content['protoPayload']['metadata']
['tableDataChange']['insertedRowsCount'])
if ds == 'cloud_run_tmp' and \
tbl.endswith('tables/cloud_run_trigger') and rows > 0:
query = create_agg()
return "table created", 200
except:
# if these fields are not in the JSON, ignore
pass
return "ok", 200
You can apply logic based on the current dataset, table or other elements existing in the current payload.

Google Cloud Storage File System, Python Package Error: AttributeError: 'GCSFile' object has no attribute 'gcsfs'

I am trying to run a python code which will download and stream chunks of data from source URL to destination cloud storage blob.
It is working fine in standalone pc, local function and so on.
But when i try same with GCP Cloud RUN it is throwing weird error.
AttributeError: 'GCSFile' object has no attribute 'gcsfs'
Complete error:
Traceback (most recent call last):
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1683, in __del__
self.close()
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1661, in close
self.flush(force=True)
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1527, in flush
self._initiate_upload()
File "/home/<user>/.local/lib/python3.9/site-packages/gcsfs/core.py", line 1443, in _initiate_upload
self.gcsfs.loop,
AttributeError: 'GCSFile' object has no attribute 'gcsfs'
It consumed my week, any help or direction is highly appriciated, thanks in advance.
The actual code which has been used:
from flask import Flask, request
import os
import gcsfs
import requests
app = Flask(__name__)
#app.route('/urltogcs')
def urltogcs():
try:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "secret.json"
gcp_file_system = gcsfs.GCSFileSystem(project='<project_id>')
session = requests.Session()
url = request.args.get('source', 'temp')
blob_path = request.args.get('destination', 'temp')
with session.get(url, stream=True) as r:
r.raise_for_status()
with gcp_file_system.open(blob_path, 'wb') as f_obj:
for chunk in r.iter_content(chunk_size=1024 * 1024):
f_obj.write(chunk)
return f'Successfully downloaded from {url} to {blob_path} :)'
except Exception as e:
print("Failure")
print(e)
return f'download failed for {url} :('
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))

Your code (with the proposed changes) works for me:
main.py:
from flask import Flask, request
import os
import gcsfs
import requests
app = Flask(__name__)
project = os.getenv("PROJECT")
port = os.getenv("PORT", 8080)
#app.route('/urltogcs')
def urltogcs():
try:
gcp_file_system = gcsfs.GCSFileSystem(project=project)
session = requests.Session()
url = request.args.get('source', 'temp')
blob_path = request.args.get('destination', 'temp')
with session.get(url, stream=True) as r:
r.raise_for_status()
with gcp_file_system.open(blob_path, 'wb') as f_obj:
for chunk in r.iter_content(chunk_size=1024 * 1024):
f_obj.write(chunk)
return f'Successfully downloaded from {url} to {blob_path} :)'
except Exception as e:
print("Failure")
print(e)
return f'download failed for {url}
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(port))
Note: The code requires project from the environment which isn't ideal. It would be better if gcsfs.GCSFileSystem didn't require project. Alternatively project could be obtained from Google's metadata service. For convenience (!), I'm setting it using the environment.
requirements.txt:
Flask==2.2.2
gcsfs==2022.7.1
gunicorn==20.1.0
Dockerfile:
FROM python:3.10-slim
ENV PYTHONUNBUFFERED True
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
RUN pip install --no-cache-dir -r requirements.txt
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
Bash script:
BILLING="[YOUR-BILLING]"
PROJECT="[YOUR-PROJECT]"
REGION="[YOUR-REGION]"
BUCKET="[YOUR-BUCKET]"
# Create Project
gcloud projects create ${PROJECT}
# Associate with Billing Account
gcloud beta billing projects link ${PROJECT} \
--billing-account=${BILLING}
# Enabled services
SERVICES=(
"artifactregistry"
"cloudbuild"
"run"
)
for SERVICE in ${SERVICES[#]}
do
gcloud services enable ${SERVICE}.googleapis.com \
--project=${PROJECT}
done
# Create Bucket
gsutil mb -p ${PROJECT} gs://${BUCKET}
# Service Account
ACCOUNT=tester
EMAIL=${ACCOUNT}#${PROJECT}.iam.gserviceaccount.com
# Create Service Account
gcloud iam service-accounts create ${ACCOUNT} \
--project=${PROJECT}
# Create Service Account key
gcloud iam service-accounts keys create ${PWD}/${ACCOUNT}.json \
--iam-account=${EMAIL} \
--project=${PROJECT}
# Ensure Service Account can write to storage
gcloud projects add-iam-policy-binding ${PROJECT} \
--role=roles/storage.admin \
--member=serviceAccount:${EMAIL}
# Only needed for local testing
export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${ACCOUNT}.json
# Deploy Cloud Run service
# Run service as Service Account
NAME="urltogcs"
gcloud run deploy ${NAME} \
--source=${PWD} \
--set-env-vars=PROJECT=${PROJECT} \
--no-allow-unauthenticated \
--service-account=${EMAIL} \
--region=${REGION} \
--project=${PROJECT}
# Grab the Cloud Run service's endpoint
ENDPOINT=$(gcloud run services describe ${NAME} \
--region=${REGION} \
--project=${PROJECT} \
--format="value(status.url)")
# Cloud Run service requires auth
TOKEN=$(gcloud auth print-identity-token)
# This page
SRC="https://stackoverflow.com/questions/73393808/"
# Generate a GCS Object name by epoch
DST="gs://${BUCKET}/$(date +%s)"
curl \
--silent \
--get \
--header "Authorization: Bearer ${TOKEN}" \
--data-urlencode "source=${SRC}" \
--data-urlencode "destination=${DST}" \
--write-out '%{response_code}' \
--output /dev/null \
${ENDPOINT}/urltogcs
Yields OK:
200
And:
gsutil ls gs://${BUCKET}
gs://${BUCKET}/1660780270

Terraform - output ec2 instance ids to calling shell script

I am using 'terraform apply' in a shell script to create multiple EC2 instances. I need to output the list of generated IPs to a script variable & use the list in another sub-script. I have defined output variables for the ips in a terraform config file - 'instance_ips'
output "instance_ips" {
value = [
"${aws_instance.gocd_master.private_ip}",
"${aws_instance.gocd_agent.*.private_ip}"
]
}
However, the terraform apply command is printing entire EC2 generation output apart from the output variables.
terraform init \
-backend-config="region=$AWS_DEFAULT_REGION" \
-backend-config="bucket=$TERRAFORM_STATE_BUCKET_NAME" \
-backend-config="role_arn=$PROVISIONING_ROLE" \
-reconfigure \
"$TERRAFORM_DIR"
OUTPUT = $( terraform apply <input variables e.g -
var="aws_region=$AWS_DEFAULT_REGION">
-auto-approve \
-input=false \
"$TERRAFORM_DIR"
)
terraform output instance_ips
So the 'OUTPUT' script variable content is
Terraform command: apply Initialising the backend... Successfully
configured the backend "s3"! Terraform will automatically use this
backend unless the backend configuration changes. Initialising provider
plugins... Terraform has been successfully initialised!
.
.
.
aws_route53_record.gocd_agent_dns_entry[2]: Creation complete after 52s
(ID:<zone ............................)
aws_route53_record.gocd_master_dns_entry: Creation complete after 52s
(ID:<zone ............................)
aws_route53_record.gocd_agent_dns_entry[1]: Creation complete after 53s
(ID:<zone ............................)
Apply complete! Resources: 9 added, 0 changed, 0 destroyed. Outputs:
instance_ips = [ 10.39.209.155, 10.39.208.44, 10.39.208.251,
10.39.209.227 ]
instead of just the EC2 ips.
Firing the 'terraform output instance_ips' is throwing a 'Initialisation Required' error which I understand means 'terraform init' is required.
Is there any way to suppress ec2 generation & just print output variables. if not, how to retrieve the IPs using 'terraform output' command w/o needing to do a terraform init ?

If I understood the context correctly, you can actually create a file in that directory & that file can be used by your sub-shell script. You can do it by using a null_resource OR "local_file".
Here is how we can use it in a modularized structure -
Using null_resource -
resource "null_resource" "instance_ips" {
triggers {
ip_file = "${sha1(file("${path.module}/instance_ips.txt"))}"
}
provisioner "local-exec" {
command = "echo ${module.ec2.instance_ips} >> instance_ips.txt"
}
}
Using local_file -
resource "local_file" "instance_ips" {
content = "${module.ec2.instance_ips}"
filename = "${path.module}/instance_ips.txt"
}

Cloning existing EMR cluster into a new one using boto3

When creating a new cluster using boto3, I want to use configuration from existing clusters (which is terminated) and thus clone it.
As far as I know, emr_client.run_job_flow requires all the configuration(Instances, InstanceFleets etc) to be provided as parameters.
Is there any way I can clone from existing cluster like I can do from aws console for EMR.

What i can recommend you, is using the AWS CLI to fire your Cluster.
It permit to versioning your cluster configuration and you can easily load steps configuration with a json file.
aws create-cluster --name "Cluster's name" --ec2-attributes KeyName=SSH_KEY --instance-type m3.xlarge --release-label emr-5.2.1 --log-uri s3://mybucket/logs/ --enable-debugging --instance-count 1 --use-default-roles --applications Name=Spark --steps file://step.json
Where step.json looks like :
[
{
"Name": "Step #1",
"Type":"SPARK",
"Jar":"command-runner.jar",
"Args":
[
"--deploy-mode", "cluster",
"--class", "com.your.data.set.class",
"s3://path/to/your/spark-job.jar",
"-c", "s3://path/to/your/config/or/not",
"--aws-access-key", "ACCESS_KEY",
"--aws-secret-key", "SECRET_KEY"
],
"ActionOnFailure": "CANCEL_AND_WAIT"
}
]
(Multiple steps is okey too)
After that you can always startUp the same configured Cluster.
And for example Schedule the whole Cluster and steps from one AirFlow job.
But if you really want to use Boto3, i suppose that the describe_cluster() method can help you to get the whole informations and use the returned object to Fire Up a new one.

There is no way to get "emr export cli" through command line.
You should parse the parameter what you want to clone, through "describe-cluster".
See below sample,
https://github.com/awslabs/aws-support-tools/tree/master/EMR/Get_EMR_CLI_Export
import boto3
import json
import sys
cluster_id = sys.argv[1]
client = boto3.client('emr')
clst = client.describe_cluster(ClusterId=cluster_id)
...
awscli += ' --steps ' + '\'' + json.dumps(cli_steps) + '\''
...
awscli += ' --instance-groups ' + '\'' + json.dumps(cli_igroups) + '\''
print(awscli)
It works parsing the parameters from “describe-cluster” at first, and make strings to fit “create-cluster” of aws-cli.

Find Google Cloud Platform Operations Performed by a User

Is there a way to track what Google Cloud Platform operations were performed by a user? We want to audit our costs and track usage accordingly.
Edit: there's a Cloud SDK (gcloud) command:
compute operations list
that lists actions taken on Compute Engine instances. Is there a way to see what user performed these actions?

While you can't see a list of gcloud commands executed, you can see a list of API actions. gcloud beta logging surface help with listing/reading logs, but via the console it's a bit harder to use. Try checking the logs on the cloud console.

If you wish to only track Google Cloud Project (GCP) Compute Engine (GCE) operations with the list command for the operations subgroup, you are able to use the --filter flag to see operations performed by a given user $GCE_USER_NAME:
gcloud compute operations list \
--filter="user=$GCE_USER_NAME" \
--limit=1 \
--sort-by="~endTime"
#=>
NAME TYPE TARGET HTTP_STATUS STATUS TIMESTAMP
$GCP_COMPUTE_OPERATION_NAME start $GCP_COMPUTE_INSTANCE_NAME 200 DONE 1970-01-01T00:00:00.001-00:00
Note: feeding the string "~endTime" into the --sort-by flag puts the most recent GCE operation first.
It might help to retrieve the entire log object in JSON:
gcloud compute operations list \
--filter="user=$GCE_USER_NAME" \
--format=json \
--limit=1 \
--sort-by="~endTime"
#=>
[
{
"endTime": "1970-01-01T00:00:00.001-00:00",
. . .
"user": "$GCP_COMPUTE_USER"
}
]
or YAML:
gcloud compute operations list \
--filter="user=$GCE_USER_NAME" \
--format=yaml \
--limit=1 \
--sort-by="~endTime"
#=>
---
endTime: '1970-01-01T00:00:00.001-00:00'
. . .
user: $GCP_COMPUTE_USER
You are also able to use the Cloud SDK (gcloud) to explore all audit logs, not just audit logs for GCE; it is incredibly clunky, as the other existing answer points out. However, for anyone who wants to use gcloud instead of the console:
gcloud logging read \
'logName : "projects/$GCP_PROJECT_NAME/logs/cloudaudit.googleapis.com"
protoPayload.authenticationInfo.principalEmail="GCE_USER_NAME"
severity>=NOTICE' \
--freshness="1d" \
--limit=1 \
--order="desc" \
--project=$GCP_PROJECT_NAME
#=>
---
insertId: . . .
. . .
protoPayload:
'#type': type.googleapis.com/google.cloud.audit.AuditLog
authenticationInfo:
principalEmail: $GCP_COMPUTE_USER
. . .
. . .
The read command defaults to YAML format, but you can also get your audit logs in JSON:
gcloud logging read \
'logName : "projects/$GCP_PROJECT_NAME/logs/cloudaudit.googleapis.com"
protoPayload.authenticationInfo.principalEmail="GCE_USER_NAME"
severity>=NOTICE' \
--format=json \
--freshness="1d" \
--limit=1 \
--order="desc" \
--project=$GCP_PROJECT_NAME
#=>
[
{
. . .
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"authenticationInfo": {
"principalEmail": "$GCE_USER_NAME"
},
. . .
},
. . .
}
]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Importing airflow variables in a json file using gitlab ci/cd - google-cloud-platform

How to import variables using gitlab ci/cd yml file. I have found Importing airflow variables in a json file using the command line but not helping out

Related

BigQuery Table Exports

Google Cloud Storage File System, Python Package Error: AttributeError: 'GCSFile' object has no attribute 'gcsfs'

Terraform - output ec2 instance ids to calling shell script

Cloning existing EMR cluster into a new one using boto3

Find Google Cloud Platform Operations Performed by a User

Categories

Resources