How to pass and access arguements to pyspark job submit from console? - google-cloud-platform

Currently we have sample.py file on google storage and we need to pass arguements to this script from console.
#sample.py
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import sys
reg = (sys.argv[1])
month = (sys.argv[2])
current_date = (sys.argv[3])
And we are trying to submit job using the following command:-
gcloud dataproc jobs submit pyspark --project=my_project --cluster=my_cluster --region=region_1 gs://shashi/python-scripts/sample.py abc 11 2019-12-05
And it gives the following error:-
ERROR: (gcloud.dataproc.jobs.submit.pyspark) argument --properties: Bad syntax for dict arg: [spark.driver.memory]. Please see `gcloud topic flags-file` or `gcloud topic escaping` for information on providing list or dictionary flag values with special characters.
Usage: gcloud dataproc jobs submit pyspark PY_FILE --cluster=CLUSTER [optional flags] [-- JOB_ARGS ...]
optional flags may be --archives | --async | --bucket | --driver-log-levels |
--files | --help | --jars | --labels |
--max-failures-per-hour | --properties | --py-files |
--region

you have forgotten to include -- before arguments
gcloud dataproc jobs submit pyspark --project=my_project --cluster=my_cluster --region=region_1 gs://shashi/python-scripts/sample.py -- abc 11 2019-12-05

Related

gcloud command - set variable for project to run gcloud command on multiple project ids

I am very new to gcloud command line and new to scripting altogether. I'm cleaning up a GCP org with multiple stray projects. I am trying to run a gcloud command to find the creator of all my projects so I can reach out to each project creator and ask them to clean up a few things.
I found a command to search logs for a project and find the original project creator, provided the project isn't older than 400 days.
gcloud logging read --project [PROJECT] \
--order=asc --limit=1 \
--format='table(protoPayload.methodName, protoPayload.authenticationInfo.principalEmail)'
My problem is this: I have over 300 projects in my org currently. I have a .csv of all project names and IDs via (gcloud projects list).
Using the above command. How can I make [project] a variable and call/import the project name field from my .csv as the variable.
What I hope to accomplish is this: The gcloud command line provided the output for each project name in the .csv file and outputs it all to a another .csv file. I hope this all made sense.
Thanks.
I haven't tried anything yet. I don't want to run the same command for each of the 300 projects manually.
I have put together this bash script, however I've been unable to properly test as I don't currently have access to any GCP project, but hopefully it will work.
Input:
This is how the CSV file should look like
| ids |
|------|
| 1234 |
| 4567 |
| 7890 |
| 0987 |
Output: what the script will generate
| project_id | owner |
|------------|-------|
| 1234 | john |
| 4567 | doe |
| 7890 | test |
| 0987 | user |
#! /bin/bash
touch output.csv
echo "project_id, owner;" >>> output.csv
while IFS="," read -r data
do
echo "Fetching project creator for: $data"
creator=$(gcloud logging read --project ${data} --order=asc --limit=1 --format='table(protoPayload.methodName, protoPayload.authenticationInfo.principalEmail)')
echo "${data},${creator};" >>> output.csv
done < <(cut -d ";" -f1 input.csv | tail -n +2)

GCP vertex - a direct way to get deployed model ID

Is there a way to directly acquire the model ID from the gcloud ai models upload command?
Either using JSON output or value output, need to manipulate by splitting and extracting. If there is a way to directly get the model ID without manipulation, please advise.
output = !gcloud ai models upload \
--region=$REGION \
--display-name=$JOB_NAME \
--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest \
--artifact-uri=$GCS_URL_FOR_SAVED_MODEL \
--format="value(model)"
output
-----
['Using endpoint [https://us-central1-aiplatform.googleapis.com/]',
'projects/xxxxxxxx/locations/us-central1/models/1961937762277916672',
'Waiting for operation [8951184153827606528]...',
'...................................done.']
Since you already have values for $REGION and $JOB_NAME, you can use execute gcloud ai models list after you uploaded the model to get the model id with minimal manipulation.
See command below:
export REGION=us-central1
export JOB_NAME=test_training
export PROJECT_ID=your-project-name
gcloud ai models list --region=$REGION --filter="DISPLAY_NAME: $JOB_NAME" | grep "MODEL_ID" | cut -f2 -d: | sed 's/\s//'
Output:
If you want to form the actual string returned by gcloud ai models upload you can just concatenate your variables.
MODEL_ID=$(gcloud ai models list --region=$REGION --filter="DISPLAY_NAME: $JOB_NAME" | grep "MODEL_ID" | cut -f2 -d: | sed 's/\s//')
echo projects/${PROJECT_ID}/locations/${REGION}/models/${MODEL_ID}
Output:

How to automatically back up and version BigQuery code such as stored procs?

What are some of the options to back up BigQuery DDLs - particularly views, stored procedure and function code?
We have a significant amount of code in BigQuery and we want to automatically back this up and preferably version it as well. Wondering how others are doing this.
Appreciate any help.
Thanks!
In order to keep and track our BigQuery structure and code, we're using Terraform to manage every resources in big query.
More specifically to your question, We use google_bigquery_routine resource to make sure the changes are reviewed by other team members and every other benefit you get from working with VCS.
Another important part of our TerraForm code is the fact we version our BigQuery module (via github releases/tags) that includes the Tables structure and Routines, version it and use it across multiple environments.
Looks something like:
main.tf
module "bigquery" {
source = "github.com/sample-org/terraform-modules.git?ref=0.0.2/bigquery"
project_id = var.project_id
...
... other vars for the module
...
}
terraform-modules/bigquery/main.tf
resource "google_bigquery_dataset" "test" {
dataset_id = "dataset_id"
project_id = var.project_name
}
resource "google_bigquery_routine" "sproc" {
dataset_id = google_bigquery_dataset.test.dataset_id
routine_id = "routine_id"
routine_type = "PROCEDURE"
language = "SQL"
definition_body = "CREATE FUNCTION Add(x FLOAT64, y FLOAT64) RETURNS FLOAT64 AS (x + y);"
}
This helps us upgrading our infrastructure across all environments without additional code changes
We finally ended up backing up DDLs and routines using INFORMATION_SCHEMA. A scheduled job extracts the relevant metadata and then uploads the content into GCS.
Example SQLs:
select * from <schema>.INFORMATION_SCHEMA.ROUTINES;
select * from <schema>.INFORMATION_SCHEMA.VIEWS;
select *, DDL from <schema>.INFORMATION_SCHEMA.TABLES;
You have to explicitly specify DDL in the column list for the table DDLs to show up.
Please check the documentation as these things evolve rapidly.
I write a table/views and a routines (stored procedures and functions) definition file nightly to Cloud Storage using Cloud Run. See this tutorial about setting it up. Cloud Run has an HTTP endpoint that is scheduled with Cloud Scheduler. It essentially runs this script:
#!/usr/bin/env bash
set -eo pipefail
GCLOUD_REPORT_BUCKET="myproject-code/backups"
objects_report="gs://${GCLOUD_REPORT_BUCKET}/objects-backup-report-$(date +%s).txt"
routines_report="gs://${GCLOUD_REPORT_BUCKET}/routines-backup-report-$(date +%s).txt"
project_id="myproject-dw"
table_defs=()
routine_defs=()
# get list of datasets and table definitions
datasets=$(bq ls --max_results=1000 | grep -v -e "fivetran*" | awk '{print $1}' | tail +3)
for dataset in $datasets
do
echo ${project_id}:${dataset}
# write tables and views to file
tables=$(bq ls --max_results 1000 ${project_id}:${dataset} | awk '{print $1}' | tail +3)
for table in $tables
do
echo ${project_id}:${dataset}.${table}
table_defs+="$(bq show --format=prettyjson ${project_id}:${dataset}.${table})"
done
# write routines (stored procs and functions) to file
routines=$(bq ls --max_results 1000 --routines=true ${project_id}:${dataset} | awk '{print $1}' | tail +3)
for routine in $routines
do
echo ${project_id}:${dataset}.${routine}
routine_defs+="$(bq show --format=prettyjson --routine=true ${project_id}:${dataset}.${routine})"
done
done
echo $table_defs | jq '.' | gsutil -q cp -J - "${objects_report}"
echo $routine_defs | jq '.' | gsutil -q cp -J - "${routines_report}"
# /dev/stderr is sent to Cloud Logging.
echo "objects-backup-report: wrote to ${objects_report}" >&2
echo "Wrote objects report to ${objects_report}"
echo "routines-backup-report: wrote to ${routines_report}" >&2
echo "Wrote routines report to ${routines_report}"
The output is essentially the same as writing a bq ls and bq show commands for all datasets with the results piped to a text file with a date. I may add this to git, but the file includes a timestamp so you know the state of BigQuery by reviewing the file for a certain date.

GCP Dataflow extract JOB_ID

For a DataFlow Job, I need to extract Job_ID from JOB_NAME. I have the below command and the corresponding o/p. Can you please guide how to extract JOB_ID from the below response
$ gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job"
JOB_ID NAME TYPE CREATION_TIME STATE REGION
2020-10-07_10_11_20-15879763245819496196 sample-job Streaming 2020-10-07 17:11:21 Running us-central1
If we can use Python script to achieve it, even that will be fine.
gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job" --format="value(JOB_ID)"
You can use standard command line tools to parse the response of that command, for example
gcloud dataflow jobs list --region=us-central1 --status=active --filter="name=sample-job" | tail -n 1 | cut -f 1 -d " "
Alternatively, if this is from a Python program already, you can use the Dataflow API directly instead of using the gcloud tool, like in How to list down all the dataflow jobs using python API
With python, you can retrieve the jobs' list with a REST request to the Dataflow's method https://dataflow.googleapis.com/v1b3/projects/{projectId}/jobs
Then, the json response can be parsed to filter the job name you are searching for by using a if clause:
if job["name"] == 'sample-job'
I tested this approached and it worked:
import requests
import json
base_url = 'https://dataflow.googleapis.com/v1b3/projects/'
project_id = '<MY_PROJECT_ID>'
location = '<REGION>'
response = requests.get(f'{base_url}{project_id}/locations/{location}/jobs', headers = {'Authorization':'Bearer <BEARER_TOKEN_HERE>'})
# <BEARER_TOKEN_HERE> can be retrieved with 'gcloud auth print-access-token' obtained with an account that has access to Dataflow jobs.
# Another authentication mechanism can be found in the link provided by danielm
jobslist = response.json()
for key,jobs in jobslist.items():
for job in jobs:
if job["name"] == 'beamapp-0907191546-413196':
print(job["name"]," Found, job ID:",job["id"])
else:
print(job["name"]," Not matched")
# Output:
# windowedwordcount-0908012420-bd342f98 Not matched
# beamapp-0907200305-106040 Not matched
# beamapp-0907192915-394932 Not matched
# beamapp-0907191546-413196 Found, job ID: 2020-09-07...154989572
Created my GIST with Python script to achieve it.

detect-custom-labels missing from AWS CLI (Windows)

I am following instructions both here - https://docs.aws.amazon.com/cli/latest/reference/rekognition/detect-custom-labels.html and on the AWS Console itself in order to test recognition against a model / dataset I have built using custom labels. The console advises on using the aws CLI to make requests against your model, however when I try the suggested commands, specifically
PS C:\Users\james> aws rekognition start-project-version
And
PS C:\Users\james> aws rekognition detect-custom-labels
I get the error:
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:
aws help
aws <command> help
aws <command> <subcommand> help
aws.exe: error: argument operation: Invalid choice, valid choices are:
compare-faces | create-collection
create-stream-processor | delete-collection
delete-faces | delete-stream-processor
describe-stream-processor | detect-faces
detect-labels | detect-moderation-labels
detect-text | get-celebrity-info
get-celebrity-recognition | get-content-moderation
get-face-detection | get-face-search
get-label-detection | get-person-tracking
index-faces | list-collections
list-faces | list-stream-processors
recognize-celebrities | search-faces
search-faces-by-image | start-celebrity-recognition
start-content-moderation | start-face-detection
start-face-search | start-label-detection
start-person-tracking | start-stream-processor
stop-stream-processor | help
My first thought was that my CLI was out of date. I updated it, and the version is now:
PS C:\Users\james> aws --version
aws-cli/1.14.53 Python/2.7.9 Windows/8 botocore/1.9.6
PS C:\Users\james>
Yet still these commands for rekognition custom labels / projects do not appear. Where am I going wrong here? :/
EDIT: Updated CLI, which lets me run the commend, but now I get this error:
Command:
aws rekognition detect-custom-labels --project-version-arn "arn:aws:rekognition:us-west-2:xxxxxxxxxxxxxxx:project/api-dev-rtest/version/api-dev-rtest.2019-12-07T16.35.53/xxxxxxxxxxxxxx" --image "{"S3Object": {"Bucket": "xxxxxxxxxxxxx","Name": "James/yes.JPG"}}" --endpoint-url https://rekognition.us-west-2.amazonaws.com --region us-west-2
Error:
Unknown options: S3Object: {Bucket: xxxxxxxxxxxxx,Name: James/yes.JPG}}
Try to put --image parameter into single quotes:
... --image '{"S3Object": {"Bucket": "xxxxxxxxxxxxx","Name": "James/yes.JPG"}}'
aws rekognition detect-custom-labels --project-version-arn "arn:aws:rekognition:us-west-2:xxxxxxxxxxxxxxx:project/api-dev-rtest/version/api-dev-rtest.2019-12-07T16.35.53/xxxxxxxxxxxxxx" --image '{"S3Object": {"Bucket": "xxxxxxxxxxxxx","Name": "James/yes.JPG"}}' --region us-west-2
You need to update your boto3 version to 1.10.34,
try to use command $sudo pip install --upgrade boto3