How to Get the "templateLocation" parameter value of an existing Job in google dataflow - google-cloud-platform

I have a list of an existing jobs running in a google dataflow. I would like to list the jobs running for last x number of days and would like to recycle those programmatically. To achieve this I need the name of the template used for a particular job. We can easily get this information from Console in a Job Info view. However i would like to know if there is any way to get this info from Gcloud command or from a API.
Your early response will be appreciated.
Thanks
Sarang

- Solution 1 :
You can use GCloud sdk and a shell script to achieve your need :
https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/list
Filter jobs with the given name:
gcloud dataflow jobs list --filter="name=my-wordcount"
List jobs with from a given region:
gcloud dataflow jobs list --region="europe-west1"
List jobs created this year:
gcloud dataflow jobs list --created-after=2018-01-01
List jobs created more than a week ago:
gcloud dataflow jobs list --created-before=-P1W
Many filters and parameters are proposed to apply your use case.
- Solution 2
You can use the rest api for Dataflow jobs :
https://cloud.google.com/dataflow/docs/reference/rest
Example :
GET /v1b3/projects/{projectId}/locations/{location}/jobs
List the jobs of a project.

There is no direct way of getting the template location and name. To achieve above requirement I have defined the pattern for the template name and job name and based on the defined pattern, while recycling the job, template name has been computed and passed on to the API call.

Related

Getting details of a BigQuery job using gcloud CLI on local machine

I am trying to process the billed bytes of each bigquery job runned by all user. I was able to find the details in BigQuery UI under Project History. Also running bq --location=europe-west3 show --job=true --format=prettyjson JOB_ID on Google Cloud Shell gives the exact information that I want (BQ SQL query, billed bytes, run time for each bigquery job).
For the next step, I want to access the json that returned by above script on local machine. I have already configured gcloud cli properly, and able to find bigquery jobs using gcloud alpha bq jobs list --show-all-users --limit=10.
I select a job id and run the following script: gcloud alpha bq jobs describe JOB_ID --project=PROJECT_ID,
I get (gcloud.alpha.bq.jobs.describe) NOT_FOUND: Not found: Job PROJECT_ID:JOB_ID--toyFH. It is possibly because of creation and end times
as shown here
What am I doing wrong? Is there another way to get details of a bigquery job using gcloud cli (maybe there is a way to get billed bytes with query details using Python SDK)?
You can get job details with diff APIs or as you are doing, but first, why are you using the alpha version of the bq?
To do it in python, you can try something like this:
from google.cloud import bigquery
def get_job(
client: bigquery.Client,
location: str = "us",
job_id: str = << JOB_ID >>,
) -> None:
job = client.get_job(job_id, location=location)
print(f"{job.location}:{job.job_id}")
print(f"Type: {job.job_type}")
print(f"State: {job.state}")
print(f"Created: {job.created.isoformat()}")
There are more properties that you can get with some kind of command from the job. Also check the status of the job in the console first, to compare between them
You can find more details here: https://cloud.google.com/bigquery/docs/managing-jobs#python

Error with gcloud beta command for streaming assets to bigquery

This might be a bit bleeding edge but hopefully someone can help. The problem is a catch 22.
So what we're trying to do is create a continuous stream of inventory changes in each GCP project to BigQuery dataset tables that we can create reports from and get a better idea of what we're paying for, what's turned on what's in use what isn't, etc.
Error: Error running command 'gcloud beta asset feeds create asset_change_feed --project=project_id --pubsub-topic=asset_change_feed': exit status 2. Output: ERROR: (gcloud.beta.asset.feeds.create) argument (--asset-names --asset-types): Must be specified.
Usage: gcloud beta asset feeds create FEED_ID --pubsub-topic=PUBSUB_TOPIC (--asset-names=[ASSET_NAMES,...] --asset-types=[ASSET_TYPES,...]) (--folder=FOLDER_ID | --organization=ORGANIZATION_ID | --project=PROJECT_ID) [optional flags]
optional flags may be --asset-names | --asset-types | --content-type |
--folder | --help | --organization | --project
For detailed information on this command and its flags, run:
gcloud beta asset feeds create --help
Using terraform we tried creating a dataflow job and a pubsub topic called asset_change_feed.
We get an error trying to create the pubsub topic because the gcloud beta asset feeds create command wants a parameter that includes all the asset names monitor...
Well... this kind of defeats the purpose. The whole point is to monitor all the asset names that change, appear and disappear. It's like creating a feed that monitors all the new baby names that appear over the next year but the feed command requires that we know them in advance somehow. WTF? What's the point then? Are we re-inventing the wheel here?
We were going by this documentation here:
https://cloud.google.com/asset-inventory/docs/monitoring-asset-changes#creating_a_feed
As per the gcloud beta asset feeds create documentation it is required to specify at least one of --asset-names and --asset-types:
At least one of these must be specified:
--asset-names=[ASSET_NAMES,…] A comma-separated list of the full names of the assets to receive updates. For example:
//compute.googleapis.com/projects/my_project_123/zones/zone1/instances/instance1.
See
https://cloud.google.com/apis/design/resource_names#full_resource_name
for more information.
--asset-types=[ASSET_TYPES,…] A comma-separated list of types of the assets types to receive updates. For example:
compute.googleapis.com/Disk,compute.googleapis.com/Network See
https://cloud.google.com/resource-manager/docs/cloud-asset-inventory/overview
for all supported asset types.
Therefore, when we don't know the names a priori we can monitor all resources of the desired types by only passing --asset-types. You can see the list of supported asset types here or use the exportAssets API method (gcloud asset export) to retrieve the types used at an organization, folder or project level.

DataFlow gcloud CLI - "Template metadata was too large"

I've honed my transformations in DataPrep, and am now trying to run the DataFlow job directly using gcloud CLI.
I've exported my template and template metadata file, and am trying to run them using gcloud dataflow jobs run and passing in the input & output locations as parameters.
I'm getting the error:
Template metadata regex '[ \t\n\x0B\f\r]*\{[ \t\n\x0B\f\r]*((.|\r|\n)*".*"[ \t\n\x0B\f\r]*:[ \t\n\x0B\f\r]*".*"(.|\r|\n)*){17}[ \t\n\x0B\f\r]*\}[ \t\n\x0B\f\r]*' was too large. Max size is 1000 but was 1187.
I've not specified this at the command line, so I know it's getting it from the metadata file - which is straight from DataPrep, unedited by me.
I have 17 input locations - one containing source data, all the others are lookups. There is a regex for each one, plus one extra.
If it's running when prompted by DataPrep, but won't run via CLI, am I missing something?
In this case I'd suspect the root cause is a limitation in gcloud that is not present in the Dataflow API or Dataprep. The best thing to do in this case is to open a new Cloud Dataflow issue in the public tracker and provide details there.

What is the best way to activate sequentially 2 or more data pipeline on AWS?

I have two distinct pipelines (A and B). When A has terminated I would like to kick off immediately the second one (B).
So far, to accomplish that I have added a ShellCommandActivity with the following command:
aws datapipeline activate-pipeline --pipeline-id <my pipeline id>
Are there other better ways to do that?
You can use a combination of indicator files (zero byte files) & Lambda to loosely couple the two data pipelines. You need to make the following changes -
Data Pipeline - Using a shell command touch a zero byte file as the last step in the data-pipeline in any of the given s3 path
Create a lambda function to watch for the indicator file and activate the Data Pipeline2
Note - This may not be very helpful if you are looking at a simple scenario of just executing two data-pipelines sequentially. However, it's helpful when you want to create an intricate dependency between pipelines viz. you have a set of Staging jobs (each corresponding to one pipeline) and you want to trigger your Data-mart Jobs or derived table jobs after all the staging jobs are completed.

Dataproc client : googleapiclient : method to get list of all jobs(runnng, stopped .. etc) in a cluster

We are using Google Cloud Dataproc to run sparkJobs.
We have a requirement to get a list of all jobs and its states corresponding to a cluster.
I can get the status of a job, if I know the job_id, as below
res = dpclient.dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId="ab4f5d05-e890-4ff5-96ef-017df2b5c0bc").execute()
But , what if I dont know the job_id, and want to know the status of all the Jobs
To list jobs in a cluster, you can use the list() method:
clusterName = 'cluster-1'
res = dpclient.dataproc.projects().regions().jobs().list(
projectId=project,
region=region,
clusterName=clusterName).execute()
However, note that this only currently supports listing by clusters which still exist; even though you pass in a clusterName, this is resolved to a unique cluster_uuid under the hood; this also means if you create multiple clusters of the same name, each incarnation is still considered a different cluster, so job listing is only performed on the currently running version of the clusterName. This is by design, since clusterName is often reused by people for different purposes (especially if using the default generated names created in cloud.google.com/console), and logically the jobs submitted to different actual cluster instances may not be related to each other.
In the future there will be more filter options for job listings.