BigQuery Table Exports - google-cloud-platform

I am looking for the best pattern to be able to execute and export a BigQuery query result to a cloud storage bucket. I would like this to be executed when the BigQuery table is written to or modified.
I think I would traditionally setup a pubsub topic that would be written to when the table is modified, which would trigger a GCP function that is responsible for executing the query and writing the result to a GCP bucket. I just am not too confident that there isn't a better approach (more straight forward) to do this in GCP.
Thanks in advance.

I propose you an approach based on Eventarc.
The goal is to launch a Cloud Function or Cloud Run action when the data is inserted or updated in a BigQuery table, example with Cloud Run :
SERVICE=bq-cloud-run
PROJECT=$(gcloud config get-value project)
CONTAINER="gcr.io/${PROJECT}/${SERVICE}"
gcloud builds submit --tag ${CONTAINER}
gcloud run deploy ${SERVICE} --image $CONTAINER --platform managed
gcloud eventarc triggers create ${SERVICE}-trigger \
--location ${REGION} --service-account ${SVC_ACCOUNT} \
--destination-run-service ${SERVICE} \
--event-filters type=google.cloud.audit.log.v1.written \
--event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \
--event-filters serviceName=bigquery.googleapis.com
When a BigQuery job was executed, the Cloud Run action will be triggered.
Example of Cloud Run action :
#app.route('/', methods=['POST'])
def index():
# Gets the Payload data from the Audit Log
content = request.json
try:
ds = content['resource']['labels']['dataset_id']
proj = content['resource']['labels']['project_id']
tbl = content['protoPayload']['resourceName']
rows = int(content['protoPayload']['metadata']
['tableDataChange']['insertedRowsCount'])
if ds == 'cloud_run_tmp' and \
tbl.endswith('tables/cloud_run_trigger') and rows > 0:
query = create_agg()
return "table created", 200
except:
# if these fields are not in the JSON, ignore
pass
return "ok", 200
You can apply logic based on the current dataset, table or other elements existing in the current payload.

Related

GCP Dataflow extract JOB_ID

For a DataFlow Job, I need to extract Job_ID from JOB_NAME. I have the below command and the corresponding o/p. Can you please guide how to extract JOB_ID from the below response
$ gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job"
JOB_ID NAME TYPE CREATION_TIME STATE REGION
2020-10-07_10_11_20-15879763245819496196 sample-job Streaming 2020-10-07 17:11:21 Running us-central1
If we can use Python script to achieve it, even that will be fine.
gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job" --format="value(JOB_ID)"
You can use standard command line tools to parse the response of that command, for example
gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job" | tail -n 1 | cut -f 1 -d " "
Alternatively, if this is from a Python program already, you can use the Dataflow API directly instead of using the gcloud tool, like in How to list down all the dataflow jobs using python API
With python, you can retrieve the jobs' list with a REST request to the Dataflow's method https://dataflow.googleapis.com/v1b3/projects/{projectId}/jobs
Then, the json response can be parsed to filter the job name you are searching for by using a if clause:
if job["name"] == 'sample-job'
I tested this approached and it worked:
import requests
import json
base_url = 'https://dataflow.googleapis.com/v1b3/projects/'
project_id = '<MY_PROJECT_ID>'
location = '<REGION>'
response = requests.get(f'{base_url}{project_id}/locations/{location}/jobs', headers = {'Authorization':'Bearer <BEARER_TOKEN_HERE>'})
# <BEARER_TOKEN_HERE> can be retrieved with 'gcloud auth print-access-token' obtained with an account that has access to Dataflow jobs.
# Another authentication mechanism can be found in the link provided by danielm
jobslist = response.json()
for key,jobs in jobslist.items():
for job in jobs:
if job["name"] == 'beamapp-0907191546-413196':
print(job["name"]," Found, job ID:",job["id"])
else:
print(job["name"]," Not matched")
# Output:
# windowedwordcount-0908012420-bd342f98 Not matched
# beamapp-0907200305-106040 Not matched
# beamapp-0907192915-394932 Not matched
# beamapp-0907191546-413196 Found, job ID: 2020-09-07...154989572
Created my GIST with Python script to achieve it.

Submit gcloud ai-platform training job programmatically (from python code)

To submit a training job from gcloud ai-platform (ex gcloud ml-engine) you use the following command from the gcloud SDK:
gcloud ai-platform jobs submit COMMAND [GCLOUD_WIDE_FLAG …]
I want to do this programmatically, i.e. from python code (or any other language). Something like
import gcloud-ai-platform as gap
gap.submit_job(COMMAND)
Is there such a command? And if it does not exist, how can I build a workaround? (using gcloud sdk programmatically)
I found the docs a bit confusing with regards to custom images. Trick is to specify --master-image-uri via masterConfig/imageUri:
training_inputs = {
'scaleTier': 'BASIC',
'packageUris': [
],
'masterConfig': {
'imageUri': settings["AI_SERVER_URI"]
},
'args': [
"java", "-cp", "MY.jar:jars/*", "io.manycore.Test",
"jar positional argument"
],
'region': 'us-central1',
'pythonVersion': '3.7',
'scheduling': {
'maxRunningTime': '3600s'
},
}
job_spec = {'jobId': jobid, 'trainingInput': training_inputs}
project_name = settings["PROJECT_ID"]
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1', credentials=self.credentials)
request = cloudml.projects().jobs().create(body=job_spec, parent=project_id)
For submitting a training job, here you have an example that you can follow.
It has both methods, using gcloud and the equivalent python code.

Does CMLE provides a REST API endpoint for Prediction?

Is there a way I can access a REST API endpoint for a Model created by Cloud ML Engine? I only see:
gcloud ml-engine jobs submit prediction $JOB_NAME \
--model census \
--version v1 \
--data-format TEXT \
--region $REGION \
--runtime-version 1.10 \
--input-paths gs://cloud-samples-data/ml-engine/testdata/prediction/census.json \
--output-path $GCS_JOB_DIR/predictions
Yes, in fact their are two APIs available to do this.
The projects.predict call is the simplest method. You pass in a request as described here, and it returns with the result. This cannot take input from GCS like your gsutil command.
The projects.jobs.create call with the predictionInput and predictionOutput fields allows batch prediction, with input from GCS.
The equivalent for your command is:
POST https://ml.googleapis.com/v1/projects/$PROJECT_ID/jobs
{
"jobId" : "$JOB_NAME",
"predictionInput": {
"dataFormat": "TEXT",
"inputPaths": "gs://cloud-samples-data/ml-engine/testdata/prediction/census.json",
"region": "REGION",
"runtimeVersion": "1.10",
"modelName": "projects/$PROJECT_ID/models/census"
},
"predictionOutput": {
"outputPath": "$GCS_JOB_DIR/predictions"
}
}
This returns immediately. use projects.jobs.get to check for success/failure.

How to run a Google Cloud Build trigger via cli / rest api / cloud functions?

Is there such an option? My use case would be running a trigger for a production build (deploys to production). Ideally, that trigger doesn't need to listen to any change since it is invoked manually via chatbot.
I saw this video CI/CD for Hybrid and Multi-Cloud Customers (Cloud Next '18) announcing there's an API trigger support, I'm not sure if that's what I need.
I did same thing few days ago.
You can submit your builds using gcloud and rest api
gcloud:
gcloud builds submit --no-source --config=cloudbuild.yaml --async --format=json
Rest API:
Send you cloudbuild.yaml as JSON with Auth Token to this url https://cloudbuild.googleapis.com/v1/projects/standf-188123/builds?alt=json
example cloudbuild.yaml:
steps:
- name: 'gcr.io/cloud-builders/docker'
id: Docker Version
args: ["version"]
- name: 'alpine'
id: Hello Cloud Build
args: ["echo", "Hello Cloud Build"]
example rest_json_body:
{"steps": [{"args": ["version"], "id": "Docker Version", "name": "gcr.io/cloud-builders/docker"}, {"args": ["echo", "Hello Cloud Build"], "id": "Hello Cloud Build", "name": "alpine"}]}
This now seems to be possible via API:
https://cloud.google.com/cloud-build/docs/api/reference/rest/v1/projects.triggers/run
request.json:
{
"projectId": "*****",
"commitSha": "************"
}
curl request (with using a gcloud command):
PROJECT_ID="********" TRIGGER_ID="*******************"; curl -X POST -T request.json -H "Authorization: Bearer $(gcloud config config-helper \
--format='value(credential.access_token)')" \
https://cloudbuild.googleapis.com/v1/projects/"$PROJECT_ID"/triggers/"$TRIGGER_ID":run
You can use google client api to create build jobs with python:
import operator
from functools import reduce
from typing import Dict, List, Union
from google.oauth2 import service_account
from googleapiclient import discovery
class GcloudService():
def __init__(self, service_token_path, project_id: Union[str, None]):
self.project_id = project_id
self.service_token_path = service_token_path
self.credentials = service_account.Credentials.from_service_account_file(self.service_token_path)
class CloudBuildApiService(GcloudService):
def __init__(self, *args, **kwargs):
super(CloudBuildApiService, self).__init__(*args, **kwargs)
scoped_credentials = self.credentials.with_scopes(['https://www.googleapis.com/auth/cloud-platform'])
self.service = discovery.build('cloudbuild', 'v1', credentials=scoped_credentials, cache_discovery=False)
def get(self, build_id: str) -> Dict:
return self.service.projects().builds().get(projectId=self.project_id, id=build_id).execute()
def create(self, image_name: str, gcs_name: str, gcs_path: str, env: Dict = None):
args: List[str] = self._get_env(env) if env else []
opt_params: List[str] = [
'-t', f'gcr.io/{self.project_id}/{image_name}',
'-f', f'./{image_name}/Dockerfile',
f'./{image_name}'
]
build_cmd: List[str] = ['build'] + args + opt_params
body = {
"projectId": self.project_id,
"source": {
'storageSource': {
'bucket': gcs_name,
'object': gcs_path,
}
},
"steps": [
{
"name": "gcr.io/cloud-builders/docker",
"args": build_cmd,
},
],
"images": [
[
f'gcr.io/{self.project_id}/{image_name}'
]
],
}
return self.service.projects().builds().create(projectId=self.project_id, body=body).execute()
def _get_env(self, env: Dict) -> List[str]:
env: List[str] = [['--build-arg', f'{key}={value}'] for key, value in env.items()]
# Flatten array
return reduce(operator.iconcat, env, [])
Here is the documentation so that you can implement more functionality: https://cloud.google.com/cloud-build/docs/api
Hope this helps.
If you just want to create a function that you can invoke directly, you have two choices:
An HTTP trigger with a standard API endpoint
A pubsub trigger that you invoke by sending a message to a pubsub topic
The first is the more common approach, as you are effectively creating a web API that any client can call with an HTTP library of their choice.
You should be able to manually trigger a build using curl and a json payload.
For details see: https://cloud.google.com/cloud-build/docs/running-builds/start-build-manually#running_builds.
Given that, you could write a Python Cloud function to replicate the curl call via the requests module.
I was in search of the same thing (Fall 2022) and while I haven't tested yet I wanted to answer before I forget. It appears to be available now in gcloud beta builds triggers run TRIGGER
you can trigger a function via
gcloud functions call NAME --data 'THING'
inside your function you can do pretty much anything possibile within Googles Public API's
if you just want to directly trigger Google Cloud Builder from git then its probably advisable to use Release version tags - so your chatbot might add a release tag to your release branch in git at which point cloud-builder will start the build.
more info here https://cloud.google.com/cloud-build/docs/running-builds/automate-builds

How to know RDS free storage

I've created a RDS postgres instance with size of 65GB initially.
Is it possible to get free space available using any postgres query.
If not, how can I achieve the same?
Thank you in advance.
A couple ways to do it
Using the AWS Console
Go to the RDS console and select the region your database is in. Click on the Show Monitoring button and pick your database instance. There will be a graph (like below image) that shows Free Storage Space.
This is documented over at AWS RDS documentation.
Using the API via AWS CLI
Alternatively, you can use the AWS API to get the information from cloudwatch.
I will show how to do this with the AWS CLI.
This assumes you have set up the AWS CLI credentials. I export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in my environment variables, but there are multiple ways to configure the CLI (or SDKS).
REGION="eu-west-1"
START="$(date -u -d '5 minutes ago' '+%Y-%m-%dT%T')"
END="$(date -u '+%Y-%m-%dT%T')"
INSTANCE_NAME="tstirldbopgs001"
AWS_DEFAULT_REGION="$REGION" aws cloudwatch get-metric-statistics \
--namespace AWS/RDS --metric-name FreeStorageSpace \
--start-time $START --end-time $END --period 300 \
--statistics Average \
--dimensions "Name=DBInstanceIdentifier, Value=${INSTANCE_NAME}"
{
"Label": "FreeStorageSpace",
"Datapoints": [
{
"Timestamp": "2017-11-16T14:01:00Z",
"Average": 95406264320.0,
"Unit": "Bytes"
}
]
}
Using the API via Java SDK
Here's a rudimentary example of how to get the same data via the Java AWS SDK, using the Cloudwatch API.
build.gradle contents
apply plugin: 'java'
apply plugin: 'application'
sourceCompatibility = 1.8
repositories {
jcenter()
}
dependencies {
compile 'com.amazonaws:aws-java-sdk-cloudwatch:1.11.232'
}
mainClassName = 'GetRDSInfo'
Java class
Again, I rely on the credential chain to get AWS API credentials (I set them in my environment). You can change the call to the builder to change this behavior (see Working with AWS Credentials documentation).
import java.util.Calendar;
import java.util.Date;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.cloudwatch.AmazonCloudWatch;
import com.amazonaws.services.cloudwatch.AmazonCloudWatchClientBuilder;
import com.amazonaws.services.cloudwatch.model.GetMetricStatisticsRequest;
import com.amazonaws.services.cloudwatch.model.GetMetricStatisticsResult;
import com.amazonaws.services.cloudwatch.model.StandardUnit;
import com.amazonaws.services.cloudwatch.model.Dimension;
import com.amazonaws.services.cloudwatch.model.Datapoint;
public class GetRDSInfo {
public static void main(String[] args) {
final long GIGABYTE = 1024L * 1024L * 1024L;
// calculate our endTime as now and startTime as 5 minutes ago.
Calendar cal = Calendar.getInstance();
Date endTime = cal.getTime();
cal.add(Calendar.MINUTE, -5);
Date startTime = cal.getTime();
String dbIdentifier = "tstirldbopgs001";
Regions region = Regions.EU_WEST_1;
Dimension dim = new Dimension()
.withName("DBInstanceIdentifier")
.withValue(dbIdentifier);
final AmazonCloudWatch cw = AmazonCloudWatchClientBuilder.standard()
.withRegion(region)
.build();
GetMetricStatisticsRequest req = new GetMetricStatisticsRequest()
.withNamespace("AWS/RDS")
.withMetricName("FreeStorageSpace")
.withStatistics("Average")
.withStartTime(startTime)
.withEndTime(endTime)
.withDimensions(dim)
.withPeriod(300);
GetMetricStatisticsResult res = cw.getMetricStatistics(req);
for (Datapoint dp : res.getDatapoints()) {
// We requested only the average free space over the last 5 minutes
// so we only have one datapoint
double freespaceGigs = dp.getAverage() / GIGABYTE;
System.out.println(String.format("Free Space: %.2f GB", freespaceGigs));
}
}
}
Example Java Code Execution
> gradle run
> Task :run
Free Space: 88.85 GB
BUILD SUCCESSFUL in 7s
The method using the AWS Management Console has changed.
Now you have to go:
RDS > Databases > [your_db_instance]
From there, scroll down, and click on "Monitoring"
There you should be able to see your db's "Free Storage Space" (in MB/Second)