GCP Dataflow extract JOB_ID - google-cloud-platform

For a DataFlow Job, I need to extract Job_ID from JOB_NAME. I have the below command and the corresponding o/p. Can you please guide how to extract JOB_ID from the below response
$ gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job"
JOB_ID NAME TYPE CREATION_TIME STATE REGION
2020-10-07_10_11_20-15879763245819496196 sample-job Streaming 2020-10-07 17:11:21 Running us-central1
If we can use Python script to achieve it, even that will be fine.

gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job" --format="value(JOB_ID)"

You can use standard command line tools to parse the response of that command, for example
gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job" | tail -n 1 | cut -f 1 -d " "
Alternatively, if this is from a Python program already, you can use the Dataflow API directly instead of using the gcloud tool, like in How to list down all the dataflow jobs using python API

With python, you can retrieve the jobs' list with a REST request to the Dataflow's method https://dataflow.googleapis.com/v1b3/projects/{projectId}/jobs
Then, the json response can be parsed to filter the job name you are searching for by using a if clause:
if job["name"] == 'sample-job'
I tested this approached and it worked:
import requests
import json
base_url = 'https://dataflow.googleapis.com/v1b3/projects/'
project_id = '<MY_PROJECT_ID>'
location = '<REGION>'
response = requests.get(f'{base_url}{project_id}/locations/{location}/jobs', headers = {'Authorization':'Bearer <BEARER_TOKEN_HERE>'})
# <BEARER_TOKEN_HERE> can be retrieved with 'gcloud auth print-access-token' obtained with an account that has access to Dataflow jobs.
# Another authentication mechanism can be found in the link provided by danielm
jobslist = response.json()
for key,jobs in jobslist.items():
for job in jobs:
if job["name"] == 'beamapp-0907191546-413196':
print(job["name"]," Found, job ID:",job["id"])
else:
print(job["name"]," Not matched")
# Output:
# windowedwordcount-0908012420-bd342f98 Not matched
# beamapp-0907200305-106040 Not matched
# beamapp-0907192915-394932 Not matched
# beamapp-0907191546-413196 Found, job ID: 2020-09-07...154989572

Created my GIST with Python script to achieve it.

Related

BigQuery Table Exports

I am looking for the best pattern to be able to execute and export a BigQuery query result to a cloud storage bucket. I would like this to be executed when the BigQuery table is written to or modified.
I think I would traditionally setup a pubsub topic that would be written to when the table is modified, which would trigger a GCP function that is responsible for executing the query and writing the result to a GCP bucket. I just am not too confident that there isn't a better approach (more straight forward) to do this in GCP.
Thanks in advance.
I propose you an approach based on Eventarc.
The goal is to launch a Cloud Function or Cloud Run action when the data is inserted or updated in a BigQuery table, example with Cloud Run :
SERVICE=bq-cloud-run
PROJECT=$(gcloud config get-value project)
CONTAINER="gcr.io/${PROJECT}/${SERVICE}"
gcloud builds submit --tag ${CONTAINER}
gcloud run deploy ${SERVICE} --image $CONTAINER --platform managed
gcloud eventarc triggers create ${SERVICE}-trigger \
--location ${REGION} --service-account ${SVC_ACCOUNT} \
--destination-run-service ${SERVICE} \
--event-filters type=google.cloud.audit.log.v1.written \
--event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \
--event-filters serviceName=bigquery.googleapis.com
When a BigQuery job was executed, the Cloud Run action will be triggered.
Example of Cloud Run action :
#app.route('/', methods=['POST'])
def index():
# Gets the Payload data from the Audit Log
content = request.json
try:
ds = content['resource']['labels']['dataset_id']
proj = content['resource']['labels']['project_id']
tbl = content['protoPayload']['resourceName']
rows = int(content['protoPayload']['metadata']
['tableDataChange']['insertedRowsCount'])
if ds == 'cloud_run_tmp' and \
tbl.endswith('tables/cloud_run_trigger') and rows > 0:
query = create_agg()
return "table created", 200
except:
# if these fields are not in the JSON, ignore
pass
return "ok", 200
You can apply logic based on the current dataset, table or other elements existing in the current payload.

How to determine if a string is located in AWS S3 CSV file

I have a CSV file in AWS S3.
The file is very large 2.5 Gigabytes
The file has a single column of strings, over 120 million:
apc.com
xyz.com
ggg.com
dddd.com
...
How can I query the file to determine if the string xyz.com is located in the file?
I only need to know if the string is there or not, I don't need to return the file.
Also it will be great if I can pass multiple strings for search and return only the ones that were found in the file.
For example:
Query => ['xyz.com','fds.com','ggg.com']
Will return => ['xyz.com','ggg.com']
The "S3 Select" SelectObjectContent API enables applications to retrieve only a subset of data from an object by using simple SQL expressions. Here's a Python example:
res = client.select_object_content(
Bucket="my-bucket",
Key="my.csv",
ExpressionType="SQL",
InputSerialization={"CSV": { "FileHeaderInfo": "NONE" }}, # or IGNORE, USE
OutputSerialization={"JSON": {}},
Expression="SELECT * FROM S3Object s WHERE _1 IN ['xyz.com', 'ggg.com']") # _1 refers to the first column
See this AWS blog post for an example with output parsing.
If you use the aws s3 cp command you can send the output to stdout:
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com'
- The dash will send the output to stdout.
this are two examples of grep checking on multiple patterns:
aws s3 cp s3://yourbucket/foo.csv - | grep -e 'apc.com' -e 'dddd.com'
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com\|dddd.com'
To learn more about grep, please look at the manual: GNU Grep 3.7

How to pass and access arguements to pyspark job submit from console?

Currently we have sample.py file on google storage and we need to pass arguements to this script from console.
#sample.py
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import sys
reg = (sys.argv[1])
month = (sys.argv[2])
current_date = (sys.argv[3])
And we are trying to submit job using the following command:-
gcloud dataproc jobs submit pyspark --project=my_project --cluster=my_cluster --region=region_1 gs://shashi/python-scripts/sample.py abc 11 2019-12-05
And it gives the following error:-
ERROR: (gcloud.dataproc.jobs.submit.pyspark) argument --properties: Bad syntax for dict arg: [spark.driver.memory]. Please see `gcloud topic flags-file` or `gcloud topic escaping` for information on providing list or dictionary flag values with special characters.
Usage: gcloud dataproc jobs submit pyspark PY_FILE --cluster=CLUSTER [optional flags] [-- JOB_ARGS ...]
optional flags may be --archives | --async | --bucket | --driver-log-levels |
--files | --help | --jars | --labels |
--max-failures-per-hour | --properties | --py-files |
--region
you have forgotten to include -- before arguments
gcloud dataproc jobs submit pyspark --project=my_project --cluster=my_cluster --region=region_1 gs://shashi/python-scripts/sample.py -- abc 11 2019-12-05

Submit gcloud ai-platform training job programmatically (from python code)

To submit a training job from gcloud ai-platform (ex gcloud ml-engine) you use the following command from the gcloud SDK:
gcloud ai-platform jobs submit COMMAND [GCLOUD_WIDE_FLAG …]
I want to do this programmatically, i.e. from python code (or any other language). Something like
import gcloud-ai-platform as gap
gap.submit_job(COMMAND)
Is there such a command? And if it does not exist, how can I build a workaround? (using gcloud sdk programmatically)
I found the docs a bit confusing with regards to custom images. Trick is to specify --master-image-uri via masterConfig/imageUri:
training_inputs = {
'scaleTier': 'BASIC',
'packageUris': [
],
'masterConfig': {
'imageUri': settings["AI_SERVER_URI"]
},
'args': [
"java", "-cp", "MY.jar:jars/*", "io.manycore.Test",
"jar positional argument"
],
'region': 'us-central1',
'pythonVersion': '3.7',
'scheduling': {
'maxRunningTime': '3600s'
},
}
job_spec = {'jobId': jobid, 'trainingInput': training_inputs}
project_name = settings["PROJECT_ID"]
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1', credentials=self.credentials)
request = cloudml.projects().jobs().create(body=job_spec, parent=project_id)
For submitting a training job, here you have an example that you can follow.
It has both methods, using gcloud and the equivalent python code.

Is there a way to get the parameters that were passed to a GCP Dataflow job from the CLI/API

I have tried the describe command listed here and I don't see the parameters. Is there another command that I should use to get this information, or some other API that would provide it?
TL;DR - You're missing the --full argument to the gcloud dataflow jobs describe command.
FLAGS
--full
Retrieve the full Job rather than the summary view
View full job info
If you're using gcloud to view the information about the GCP Dataflow job, this command will show the full info (which is actually quite a lot of info) about the job including any parameters that were passed to the job:
gcloud dataflow jobs describe JOB_ID --full
All the options are under the hierarchy environment.sdkPipelineOptions.options
View all options as JSON
To view all the options passed to the job (which prints actually more than just the command line arguments BTW) as a JSON, you can do:
$ gcloud dataflow jobs describe JOB_ID --full --format='json(environment.sdkPipelineOptions.options)'
{
"environment": {
"sdkPipelineOptions": {
"options": {
"apiRootUrl": "https://dataflow.googleapis.com/",
"appName": "WordCount",
"credentialFactoryClass": "com.google.cloud.dataflow.sdk.util.GcpCredentialFactory",
"dataflowEndpoint": "",
"enableCloudDebugger": false,
"enableProfilingAgent": false,
"firstArg": "foo",
"inputFile": "gs://dataflow-samples/shakespeare/kinglear.txt",
"jobName": "wordcount-tuxdude-12345678",
"numberOfWorkerHarnessThreads": 0,
"output": "gs://BUCKET_NAME/dataflow/output",
"pathValidatorClass": "com.google.cloud.dataflow.sdk.util.DataflowPathValidator",
"project": "PROJECT_NAME",
"runner": "com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner",
"secondArg": "bar",
"stableUniqueNames": "WARNING",
"stagerClass": "com.google.cloud.dataflow.sdk.util.GcsStager",
"stagingLocation": "gs://BUCKET_NAME/dataflow/staging/",
"streaming": false,
"tempLocation": "gs://BUCKET_NAME/dataflow/staging/"
}
}
}
}
View all options as a table
$ gcloud dataflow jobs describe JOB_ID --full --format='flattened(environment.sdkPipelineOptions.options)'
environment.sdkPipelineOptions.options.apiRootUrl: https://dataflow.googleapis.com/
environment.sdkPipelineOptions.options.appName: WordCount
environment.sdkPipelineOptions.options.credentialFactoryClass: com.google.cloud.dataflow.sdk.util.GcpCredentialFactory
environment.sdkPipelineOptions.options.dataflowEndpoint:
environment.sdkPipelineOptions.options.enableCloudDebugger: False
environment.sdkPipelineOptions.options.enableProfilingAgent: False
environment.sdkPipelineOptions.options.firstArg: foo
environment.sdkPipelineOptions.options.inputFile: gs://dataflow-samples/shakespeare/kinglear.txt
environment.sdkPipelineOptions.options.jobName: wordcount-tuxdude-12345678
environment.sdkPipelineOptions.options.numberOfWorkerHarnessThreads: 0
environment.sdkPipelineOptions.options.output: gs://BUCKET_NAME/dataflow/output
environment.sdkPipelineOptions.options.pathValidatorClass: com.google.cloud.dataflow.sdk.util.DataflowPathValidator
environment.sdkPipelineOptions.options.project: PROJECT_NAME
environment.sdkPipelineOptions.options.runner: com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner
environment.sdkPipelineOptions.options.secondArg: bar
environment.sdkPipelineOptions.options.stableUniqueNames: WARNING
environment.sdkPipelineOptions.options.stagerClass: com.google.cloud.dataflow.sdk.util.GcsStager
environment.sdkPipelineOptions.options.stagingLocation: gs://BUCKET_NAME/dataflow/staging/
environment.sdkPipelineOptions.options.streaming: False
environment.sdkPipelineOptions.options.tempLocation: gs://BUCKET_NAME/dataflow/staging/
Get the value of just a single option
To get the value of just a single option named --argName (whose value BTW is MY_ARG_VALUE), you can do:
$ gcloud dataflow jobs describe JOB_ID --full --format='value(environment.sdkPipelineOptions.options.argName)'
MY_ARG_VALUE
gcloud formatting
gcloud in general supports a wide range of formatting options in the output which is applicable to most gcloud commands which pull info from the server. You can read about them here.