Is there a way to get the parameters that were passed to a GCP Dataflow job from the CLI/API - google-cloud-platform

I have tried the describe command listed here and I don't see the parameters. Is there another command that I should use to get this information, or some other API that would provide it?

TL;DR - You're missing the --full argument to the gcloud dataflow jobs describe command.
FLAGS
--full
Retrieve the full Job rather than the summary view
View full job info
If you're using gcloud to view the information about the GCP Dataflow job, this command will show the full info (which is actually quite a lot of info) about the job including any parameters that were passed to the job:
gcloud dataflow jobs describe JOB_ID --full
All the options are under the hierarchy environment.sdkPipelineOptions.options
View all options as JSON
To view all the options passed to the job (which prints actually more than just the command line arguments BTW) as a JSON, you can do:
$ gcloud dataflow jobs describe JOB_ID --full --format='json(environment.sdkPipelineOptions.options)'
{
"environment": {
"sdkPipelineOptions": {
"options": {
"apiRootUrl": "https://dataflow.googleapis.com/",
"appName": "WordCount",
"credentialFactoryClass": "com.google.cloud.dataflow.sdk.util.GcpCredentialFactory",
"dataflowEndpoint": "",
"enableCloudDebugger": false,
"enableProfilingAgent": false,
"firstArg": "foo",
"inputFile": "gs://dataflow-samples/shakespeare/kinglear.txt",
"jobName": "wordcount-tuxdude-12345678",
"numberOfWorkerHarnessThreads": 0,
"output": "gs://BUCKET_NAME/dataflow/output",
"pathValidatorClass": "com.google.cloud.dataflow.sdk.util.DataflowPathValidator",
"project": "PROJECT_NAME",
"runner": "com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner",
"secondArg": "bar",
"stableUniqueNames": "WARNING",
"stagerClass": "com.google.cloud.dataflow.sdk.util.GcsStager",
"stagingLocation": "gs://BUCKET_NAME/dataflow/staging/",
"streaming": false,
"tempLocation": "gs://BUCKET_NAME/dataflow/staging/"
}
}
}
}
View all options as a table
$ gcloud dataflow jobs describe JOB_ID --full --format='flattened(environment.sdkPipelineOptions.options)'
environment.sdkPipelineOptions.options.apiRootUrl: https://dataflow.googleapis.com/
environment.sdkPipelineOptions.options.appName: WordCount
environment.sdkPipelineOptions.options.credentialFactoryClass: com.google.cloud.dataflow.sdk.util.GcpCredentialFactory
environment.sdkPipelineOptions.options.dataflowEndpoint:
environment.sdkPipelineOptions.options.enableCloudDebugger: False
environment.sdkPipelineOptions.options.enableProfilingAgent: False
environment.sdkPipelineOptions.options.firstArg: foo
environment.sdkPipelineOptions.options.inputFile: gs://dataflow-samples/shakespeare/kinglear.txt
environment.sdkPipelineOptions.options.jobName: wordcount-tuxdude-12345678
environment.sdkPipelineOptions.options.numberOfWorkerHarnessThreads: 0
environment.sdkPipelineOptions.options.output: gs://BUCKET_NAME/dataflow/output
environment.sdkPipelineOptions.options.pathValidatorClass: com.google.cloud.dataflow.sdk.util.DataflowPathValidator
environment.sdkPipelineOptions.options.project: PROJECT_NAME
environment.sdkPipelineOptions.options.runner: com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner
environment.sdkPipelineOptions.options.secondArg: bar
environment.sdkPipelineOptions.options.stableUniqueNames: WARNING
environment.sdkPipelineOptions.options.stagerClass: com.google.cloud.dataflow.sdk.util.GcsStager
environment.sdkPipelineOptions.options.stagingLocation: gs://BUCKET_NAME/dataflow/staging/
environment.sdkPipelineOptions.options.streaming: False
environment.sdkPipelineOptions.options.tempLocation: gs://BUCKET_NAME/dataflow/staging/
Get the value of just a single option
To get the value of just a single option named --argName (whose value BTW is MY_ARG_VALUE), you can do:
$ gcloud dataflow jobs describe JOB_ID --full --format='value(environment.sdkPipelineOptions.options.argName)'
MY_ARG_VALUE
gcloud formatting
gcloud in general supports a wide range of formatting options in the output which is applicable to most gcloud commands which pull info from the server. You can read about them here.

Related

BigQuery Table Exports

I am looking for the best pattern to be able to execute and export a BigQuery query result to a cloud storage bucket. I would like this to be executed when the BigQuery table is written to or modified.
I think I would traditionally setup a pubsub topic that would be written to when the table is modified, which would trigger a GCP function that is responsible for executing the query and writing the result to a GCP bucket. I just am not too confident that there isn't a better approach (more straight forward) to do this in GCP.
Thanks in advance.
I propose you an approach based on Eventarc.
The goal is to launch a Cloud Function or Cloud Run action when the data is inserted or updated in a BigQuery table, example with Cloud Run :
SERVICE=bq-cloud-run
PROJECT=$(gcloud config get-value project)
CONTAINER="gcr.io/${PROJECT}/${SERVICE}"
gcloud builds submit --tag ${CONTAINER}
gcloud run deploy ${SERVICE} --image $CONTAINER --platform managed
gcloud eventarc triggers create ${SERVICE}-trigger \
--location ${REGION} --service-account ${SVC_ACCOUNT} \
--destination-run-service ${SERVICE} \
--event-filters type=google.cloud.audit.log.v1.written \
--event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \
--event-filters serviceName=bigquery.googleapis.com
When a BigQuery job was executed, the Cloud Run action will be triggered.
Example of Cloud Run action :
#app.route('/', methods=['POST'])
def index():
# Gets the Payload data from the Audit Log
content = request.json
try:
ds = content['resource']['labels']['dataset_id']
proj = content['resource']['labels']['project_id']
tbl = content['protoPayload']['resourceName']
rows = int(content['protoPayload']['metadata']
['tableDataChange']['insertedRowsCount'])
if ds == 'cloud_run_tmp' and \
tbl.endswith('tables/cloud_run_trigger') and rows > 0:
query = create_agg()
return "table created", 200
except:
# if these fields are not in the JSON, ignore
pass
return "ok", 200
You can apply logic based on the current dataset, table or other elements existing in the current payload.

GCP Dataflow extract JOB_ID

For a DataFlow Job, I need to extract Job_ID from JOB_NAME. I have the below command and the corresponding o/p. Can you please guide how to extract JOB_ID from the below response
$ gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job"
JOB_ID NAME TYPE CREATION_TIME STATE REGION
2020-10-07_10_11_20-15879763245819496196 sample-job Streaming 2020-10-07 17:11:21 Running us-central1
If we can use Python script to achieve it, even that will be fine.
gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job" --format="value(JOB_ID)"
You can use standard command line tools to parse the response of that command, for example
gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job" | tail -n 1 | cut -f 1 -d " "
Alternatively, if this is from a Python program already, you can use the Dataflow API directly instead of using the gcloud tool, like in How to list down all the dataflow jobs using python API
With python, you can retrieve the jobs' list with a REST request to the Dataflow's method https://dataflow.googleapis.com/v1b3/projects/{projectId}/jobs
Then, the json response can be parsed to filter the job name you are searching for by using a if clause:
if job["name"] == 'sample-job'
I tested this approached and it worked:
import requests
import json
base_url = 'https://dataflow.googleapis.com/v1b3/projects/'
project_id = '<MY_PROJECT_ID>'
location = '<REGION>'
response = requests.get(f'{base_url}{project_id}/locations/{location}/jobs', headers = {'Authorization':'Bearer <BEARER_TOKEN_HERE>'})
# <BEARER_TOKEN_HERE> can be retrieved with 'gcloud auth print-access-token' obtained with an account that has access to Dataflow jobs.
# Another authentication mechanism can be found in the link provided by danielm
jobslist = response.json()
for key,jobs in jobslist.items():
for job in jobs:
if job["name"] == 'beamapp-0907191546-413196':
print(job["name"]," Found, job ID:",job["id"])
else:
print(job["name"]," Not matched")
# Output:
# windowedwordcount-0908012420-bd342f98 Not matched
# beamapp-0907200305-106040 Not matched
# beamapp-0907192915-394932 Not matched
# beamapp-0907191546-413196 Found, job ID: 2020-09-07...154989572
Created my GIST with Python script to achieve it.

Submit gcloud ai-platform training job programmatically (from python code)

To submit a training job from gcloud ai-platform (ex gcloud ml-engine) you use the following command from the gcloud SDK:
gcloud ai-platform jobs submit COMMAND [GCLOUD_WIDE_FLAG …]
I want to do this programmatically, i.e. from python code (or any other language). Something like
import gcloud-ai-platform as gap
gap.submit_job(COMMAND)
Is there such a command? And if it does not exist, how can I build a workaround? (using gcloud sdk programmatically)
I found the docs a bit confusing with regards to custom images. Trick is to specify --master-image-uri via masterConfig/imageUri:
training_inputs = {
'scaleTier': 'BASIC',
'packageUris': [
],
'masterConfig': {
'imageUri': settings["AI_SERVER_URI"]
},
'args': [
"java", "-cp", "MY.jar:jars/*", "io.manycore.Test",
"jar positional argument"
],
'region': 'us-central1',
'pythonVersion': '3.7',
'scheduling': {
'maxRunningTime': '3600s'
},
}
job_spec = {'jobId': jobid, 'trainingInput': training_inputs}
project_name = settings["PROJECT_ID"]
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1', credentials=self.credentials)
request = cloudml.projects().jobs().create(body=job_spec, parent=project_id)
For submitting a training job, here you have an example that you can follow.
It has both methods, using gcloud and the equivalent python code.

Does CMLE provides a REST API endpoint for Prediction?

Is there a way I can access a REST API endpoint for a Model created by Cloud ML Engine? I only see:
gcloud ml-engine jobs submit prediction $JOB_NAME \
--model census \
--version v1 \
--data-format TEXT \
--region $REGION \
--runtime-version 1.10 \
--input-paths gs://cloud-samples-data/ml-engine/testdata/prediction/census.json \
--output-path $GCS_JOB_DIR/predictions
Yes, in fact their are two APIs available to do this.
The projects.predict call is the simplest method. You pass in a request as described here, and it returns with the result. This cannot take input from GCS like your gsutil command.
The projects.jobs.create call with the predictionInput and predictionOutput fields allows batch prediction, with input from GCS.
The equivalent for your command is:
POST https://ml.googleapis.com/v1/projects/$PROJECT_ID/jobs
{
"jobId" : "$JOB_NAME",
"predictionInput": {
"dataFormat": "TEXT",
"inputPaths": "gs://cloud-samples-data/ml-engine/testdata/prediction/census.json",
"region": "REGION",
"runtimeVersion": "1.10",
"modelName": "projects/$PROJECT_ID/models/census"
},
"predictionOutput": {
"outputPath": "$GCS_JOB_DIR/predictions"
}
}
This returns immediately. use projects.jobs.get to check for success/failure.

How to use packer export_opts?

I build a VirtualBox VM using Packer and I would like to set some VM meta data (e.g. description, version) using the export_opts parameter. The docs say
export_opts (array of strings) - Additional options to pass to the VBoxManage export. This can be useful for passing product information to include in the resulting appliance file.
I am trying to do this in a bash script calling packer:
desc=' ... some ...'
desc+=' ... multiline ...'
desc+=' ... description ...'
# this is actually done using printf, shortened for clarity
export_opts='[ "version", "0.2.0", "description", "${desc}" ]'
# the assembled string looks OK
echo "export_opts: ${export_opts}"
packer build \
... (more options) ...
-var "export_opts=${export_opts}" \
... (more options) ...
<packer configuration file>
I also tried --version instead of version and putting version and the value into the same string, but none of this works; once exported and re-imported, the VM description is empty.
Does anyone have some working sample code or can help me out with what I'm doing wrong ?
Thank you very much.
Update:
Following Anthony Staunton's approach, I figured out that adding
"export_opts": [ "--vsys", "0", "--version", "0.2.0", "--description", "some test description" ],
to the Packer JSON file does work; passing the same string as --var to Packer does not work.
Fixed the problem at long last, updated the packer documentation with the example below, pull requests pending:
Packer JSON configuration file example:
{
"type": "virtualbox-ovf",
"export_opts":
[
"--manifest",
"--vsys", "0",
"--description", "{{user `vm_description`}}",
"--version", "{{user `vm_version`}}"
],
"format": "ova",
}
A VirtualBox VM description may contain arbitrary strings; the GUI interprets HTML formatting. However, the JSON format does not allow arbitrary newlines within a value. Add a multi-line description by preparing the string in the shell before the packer call like this (shell > continuation character snipped for easier copy & paste):
vm_description='some
multiline
description'
vm_version='0.2.0'
packer build \
-var "vm_description=${vm_description}" \
-var "vm_version=${vm_version}" \
"packer_conf.json"
You may have to specify the data as
in your packer json file
"export_opts": [ "--vsys 0 --version \"0.2.0\"", "{{.Name}} --description \"${desc}\" " ],