How to automatically back up and version BigQuery code such as stored procs? - google-cloud-platform

What are some of the options to back up BigQuery DDLs - particularly views, stored procedure and function code?
We have a significant amount of code in BigQuery and we want to automatically back this up and preferably version it as well. Wondering how others are doing this.
Appreciate any help.
Thanks!

In order to keep and track our BigQuery structure and code, we're using Terraform to manage every resources in big query.
More specifically to your question, We use google_bigquery_routine resource to make sure the changes are reviewed by other team members and every other benefit you get from working with VCS.
Another important part of our TerraForm code is the fact we version our BigQuery module (via github releases/tags) that includes the Tables structure and Routines, version it and use it across multiple environments.
Looks something like:
main.tf
module "bigquery" {
source = "github.com/sample-org/terraform-modules.git?ref=0.0.2/bigquery"
project_id = var.project_id
...
... other vars for the module
...
}
terraform-modules/bigquery/main.tf
resource "google_bigquery_dataset" "test" {
dataset_id = "dataset_id"
project_id = var.project_name
}
resource "google_bigquery_routine" "sproc" {
dataset_id = google_bigquery_dataset.test.dataset_id
routine_id = "routine_id"
routine_type = "PROCEDURE"
language = "SQL"
definition_body = "CREATE FUNCTION Add(x FLOAT64, y FLOAT64) RETURNS FLOAT64 AS (x + y);"
}
This helps us upgrading our infrastructure across all environments without additional code changes

We finally ended up backing up DDLs and routines using INFORMATION_SCHEMA. A scheduled job extracts the relevant metadata and then uploads the content into GCS.
Example SQLs:
select * from <schema>.INFORMATION_SCHEMA.ROUTINES;
select * from <schema>.INFORMATION_SCHEMA.VIEWS;
select *, DDL from <schema>.INFORMATION_SCHEMA.TABLES;
You have to explicitly specify DDL in the column list for the table DDLs to show up.
Please check the documentation as these things evolve rapidly.

I write a table/views and a routines (stored procedures and functions) definition file nightly to Cloud Storage using Cloud Run. See this tutorial about setting it up. Cloud Run has an HTTP endpoint that is scheduled with Cloud Scheduler. It essentially runs this script:
#!/usr/bin/env bash
set -eo pipefail
GCLOUD_REPORT_BUCKET="myproject-code/backups"
objects_report="gs://${GCLOUD_REPORT_BUCKET}/objects-backup-report-$(date +%s).txt"
routines_report="gs://${GCLOUD_REPORT_BUCKET}/routines-backup-report-$(date +%s).txt"
project_id="myproject-dw"
table_defs=()
routine_defs=()
# get list of datasets and table definitions
datasets=$(bq ls --max_results=1000 | grep -v -e "fivetran*" | awk '{print $1}' | tail +3)
for dataset in $datasets
do
echo ${project_id}:${dataset}
# write tables and views to file
tables=$(bq ls --max_results 1000 ${project_id}:${dataset} | awk '{print $1}' | tail +3)
for table in $tables
do
echo ${project_id}:${dataset}.${table}
table_defs+="$(bq show --format=prettyjson ${project_id}:${dataset}.${table})"
done
# write routines (stored procs and functions) to file
routines=$(bq ls --max_results 1000 --routines=true ${project_id}:${dataset} | awk '{print $1}' | tail +3)
for routine in $routines
do
echo ${project_id}:${dataset}.${routine}
routine_defs+="$(bq show --format=prettyjson --routine=true ${project_id}:${dataset}.${routine})"
done
done
echo $table_defs | jq '.' | gsutil -q cp -J - "${objects_report}"
echo $routine_defs | jq '.' | gsutil -q cp -J - "${routines_report}"
# /dev/stderr is sent to Cloud Logging.
echo "objects-backup-report: wrote to ${objects_report}" >&2
echo "Wrote objects report to ${objects_report}"
echo "routines-backup-report: wrote to ${routines_report}" >&2
echo "Wrote routines report to ${routines_report}"
The output is essentially the same as writing a bq ls and bq show commands for all datasets with the results piped to a text file with a date. I may add this to git, but the file includes a timestamp so you know the state of BigQuery by reviewing the file for a certain date.

Related

How to determine if a string is located in AWS S3 CSV file

I have a CSV file in AWS S3.
The file is very large 2.5 Gigabytes
The file has a single column of strings, over 120 million:
apc.com
xyz.com
ggg.com
dddd.com
...
How can I query the file to determine if the string xyz.com is located in the file?
I only need to know if the string is there or not, I don't need to return the file.
Also it will be great if I can pass multiple strings for search and return only the ones that were found in the file.
For example:
Query => ['xyz.com','fds.com','ggg.com']
Will return => ['xyz.com','ggg.com']
The "S3 Select" SelectObjectContent API enables applications to retrieve only a subset of data from an object by using simple SQL expressions. Here's a Python example:
res = client.select_object_content(
Bucket="my-bucket",
Key="my.csv",
ExpressionType="SQL",
InputSerialization={"CSV": { "FileHeaderInfo": "NONE" }}, # or IGNORE, USE
OutputSerialization={"JSON": {}},
Expression="SELECT * FROM S3Object s WHERE _1 IN ['xyz.com', 'ggg.com']") # _1 refers to the first column
See this AWS blog post for an example with output parsing.
If you use the aws s3 cp command you can send the output to stdout:
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com'
- The dash will send the output to stdout.
this are two examples of grep checking on multiple patterns:
aws s3 cp s3://yourbucket/foo.csv - | grep -e 'apc.com' -e 'dddd.com'
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com\|dddd.com'
To learn more about grep, please look at the manual: GNU Grep 3.7

PCF: How to find the list of all the spaces where you have Developer privileges

If you are not an admin in Pivotal Cloud Foundry, how will you find or list all the orgs/spaces where you have developer privileges? Is there a command or menu to get that, instead of going into each space and verifying it?
Here's a script that will dump the org & space names of which the currently logged in user is a part.
A quick explanation. It will call the /v2/spaces api, which already filters to only show spaces of which the currently logged in user can see (if you run with a user that has admin access, it will list all orgs and spaces). We then iterate over the results & take the space's organization_url field and cf curl that to get the organization name (there's a hashmap to cache results).
This script requires Bash 4+ for the hashmap support. If you don't have that, you can remove that part and it will just be a little slower. It also requires jq, and of course the cf cli.
#!/usr/bin/env bash
#
# List all spaces available to the current user
#
set -e
function load_all_pages {
URL="$1"
DATA=""
until [ "$URL" == "null" ]; do
RESP=$(cf curl "$URL")
DATA+=$(echo "$RESP" | jq .resources)
URL=$(echo "$RESP" | jq -r .next_url)
done
# dump the data
echo "$DATA" | jq .[] | jq -s
}
function load_all_spaces {
load_all_pages "/v2/spaces"
}
function main {
declare -A ORGS # cache org name lookups
# load all the spaces & properly paginate
SPACES=$(load_all_spaces)
# filter out the name & org_url
SPACES_DATA=$(echo "$SPACES" | jq -rc '.[].entity | {"name": .name, "org_url": .organization_url}')
printf "Org\tSpace\n"
for SPACE_JSON in $SPACES_DATA; do
SPACE_NAME=$(echo "$SPACE_JSON" | jq -r '.name')
# take the org_url and look up the org name, cache responses for speed
ORG_URL=$(echo "$SPACE_JSON" | jq -r '.org_url')
ORG_NAME="${ORGS[$ORG_URL]}"
if [ "$ORG_NAME" == "" ]; then
ORG_NAME=$(cf curl "$ORG_URL" | jq -r '.entity.name')
ORGS[$ORG_URL]="$ORG_NAME"
fi
printf "$ORG_NAME\t$SPACE_NAME\n"
done
}
main "$#"

Environment Variables in newest AWS EC2 instance

I am trying to get ENVIRONMENT Variables into the EC2 instance (trying to run a django app on Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type ami-0ff8a91507f77f867 ). How do you get them in the newest version of amazon's linux, or get the logging so it can be traced.
user-data text (modified from here):
#!/bin/bash
#trying to get a file made
touch /tmp/testfile.txt
cat 'This and that' > /tmp/testfile.txt
#trying to log
echo 'Woot!' > /home/ec2-user/user-script-output.txt
#Trying to get the output logged to see what is going wrong
exec > >(tee /var/log/user-data.log|logger -t user-data ) 2>&1
#trying to log
echo "XXXXXXXXXX STARTING USER DATA SCRIPT XXXXXXXXXXXXXX"
#trying to store the ENVIRONMENT VARIABLES
PARAMETER_PATH='/'
REGION='us-east-1'
# Functions
AWS="/usr/local/bin/aws"
get_parameter_store_tags() {
echo $($AWS ssm get-parameters-by-path --with-decryption --path ${PARAMETER_PATH} --region ${REGION})
}
params_to_env () {
params=$1
# If .Ta1gs does not exist we assume ssm Parameteres object.
SELECTOR="Name"
for key in $(echo $params | /usr/bin/jq -r ".[][].${SELECTOR}"); do
value=$(echo $params | /usr/bin/jq -r ".[][] | select(.${SELECTOR}==\"$key\") | .Value")
key=$(echo "${key##*/}" | /usr/bin/tr ':' '_' | /usr/bin/tr '-' '_' | /usr/bin/tr '[:lower:]' '[:upper:]')
export $key="$value"
echo "$key=$value"
done
}
# Get TAGS
if [ -z "$PARAMETER_PATH" ]
then
echo "Please provide a parameter store path. -p option"
exit 1
fi
TAGS=$(get_parameter_store_tags ${PARAMETER_PATH} ${REGION})
echo "Tags fetched via ssm from ${PARAMETER_PATH} ${REGION}"
echo "Adding new variables..."
params_to_env "$TAGS"
Notes -
What i think i know but am unsure
the user-data script is only loaded when it is created, not when I stop and then start mentioned here (although it also says [i think outdated] that the output is logged to /var/log/cloud-init-output.log )
I may not be starting the instance correctly
I don't know where to store the bash script so that it can be executed
What I have verified
the user-data text is on the instance by ssh-ing in and curl http://169.254.169.254/latest/user-data shows the current text (#!/bin/bash …)
What Ive tried
editing rc.local directly to export AWS_ACCESS_KEY_ID='JEFEJEFEJEFEJEFE' … and the like
putting them in the AWS Parameter Store (and can see them via the correct call, I just can't trace getting them into the EC2 instance without logs or confirming if the user-data is getting run)
putting ENV variables in Tags and importing them as mentioned here:
tried outputting the logs to other files as suggested here (Not seeing any log files in the ssh instance or on the system log)
viewing the System Log on the aws webpage to see any errors/logs via selecting the instance -> 'Actions' -> 'Instance Settings' -> 'Get System Log' (not seeing any commands run or log statements [only 1 unrelated word of user])

How to retrieve the most recent file in cloud storage bucket?

Is this something that can be done with gsutil?
https://cloud.google.com/storage/docs/gsutil/commands/ls does not seem to mention any sorting functionality - only filtering by a date - which wouldn't work for my use case.
Hello this still doesn't seems to exists, but there is a solution in this post: enter link description here
The command used is this one:
gsutil ls -l gs://[bucket-name]/ | sort -k 2
As it allow you to filter by date you can get the most recent result in the bucket and recuperating the last line using another pipe if you need.
gsutil ls -l gs://<bucket-name> | sort -k 2 | tail -n 2 | head -1 | cut -d ' ' -f 7
It will not work well if there is less then two objects in the bucket though
By using gsutil from a host machine this will populate the response array:
response=(`gsutil ls -l gs://some-bucket-name|sort -k 2|tail -2|head -1`)
Or by gsutil from docker container:
response=(`docker run --name some-container-name --rm --volumes-from gcloud-config -it google/cloud-sdk:latest gsutil ls -l gs://some-bucket-name|sort -k 2|tail -2|head -1`)
Afterwards, to get the whole response, run:
echo ${response[#]}
will print for example:
33 2021-08-11T09:24:55Z gs://some-bucket-name/filename-37.txt
Or to get separate info from the response, (e.g. filename)
echo ${response[2]}
will print the filename only
gs://some-bucket-name/filename-37.txt
For my use case, I wanted to find the most recent directory in my bucket. I number them in ascending order (with leading zeros), so all I need to get the most recent one is this:
gsutil ls -l gs://[bucket-name] | sort | tail -n 1 | cut -d '/' -f 4
list the directory
sort alphabetically (probably unnecessary)
take the last line
tokenise it with "/" delimiter
get the 4th token, which is the directory name

Regex each line of stdout and push to array in shell/bash

I am using AWS CLI to ls an S3 bucket. The output is:
Austins-MacBook-Pro:~ austin$ aws s3 ls s3://obscured-bucket-name
PRE 2016-02-24-03-42/
PRE 2016-02-25-22-25/
PRE 2016-02-26-00-34/
PRE 2016-02-26-00-42/
PRE 2016-02-26-03-43/
Using either Bash or Shell script I need to take each line and remove the spaces or tabs and the PRE before the prefix names and put each prefix in an array so I can use it to subsequently rm the oldest folder.
TLDR;
I need to turn the output of aws s3 ls s3://obscured-bucket-name to an array of values like this: 2016-02-26-03-43/
Thanks for reading!
Under bash, you could:
mapfile myarray < <(aws s3 ls s3://obscured-bucket-name)
echo ${myarray[#]#*PRE }
2016-02-24-03-42/ 2016-02-25-22-25/ 2016-02-26-00-34/ 2016-02-26-00-42/ 2016-02-26-03-43/
or
mapfile -t myarray < <(aws s3 ls s3://obscured-bucket-name)
myarray=( "${myarray[#]#*PRE }" )
printf '<%s>\n' "${myarray[#]%/}"
<2016-02-24-03-42>
<2016-02-25-22-25>
<2016-02-26-00-34>
<2016-02-26-00-42>
<2016-02-26-03-43>
Nota: -t switch remove a trailing newline from each line read.
See help mapfile and/or man -Pless\ +/readarray bash
mapfile was introduced in 2009 with version 4 of bash.
try this:
aws s3 ls s3://obscured-bucket-name | sed -e "s/[^0-9]*//"
so if you want to get the oldest folder:
aws s3 ls s3://obscured-bucket-name | sed -e "s/[^0-9]*//" | sort | head -n 1
You could also use awk to the rescue
aws s3 ls <s3://obscured-bucket-name>/ | awk '/PRE/ { print $2 }' | tail -n+2
This will remove the last bucket and provide store the folders in the array variable.