DataFlow gcloud CLI - "Template metadata was too large" - google-cloud-platform

I've honed my transformations in DataPrep, and am now trying to run the DataFlow job directly using gcloud CLI.
I've exported my template and template metadata file, and am trying to run them using gcloud dataflow jobs run and passing in the input & output locations as parameters.
I'm getting the error:
Template metadata regex '[ \t\n\x0B\f\r]*\{[ \t\n\x0B\f\r]*((.|\r|\n)*".*"[ \t\n\x0B\f\r]*:[ \t\n\x0B\f\r]*".*"(.|\r|\n)*){17}[ \t\n\x0B\f\r]*\}[ \t\n\x0B\f\r]*' was too large. Max size is 1000 but was 1187.
I've not specified this at the command line, so I know it's getting it from the metadata file - which is straight from DataPrep, unedited by me.
I have 17 input locations - one containing source data, all the others are lookups. There is a regex for each one, plus one extra.
If it's running when prompted by DataPrep, but won't run via CLI, am I missing something?

In this case I'd suspect the root cause is a limitation in gcloud that is not present in the Dataflow API or Dataprep. The best thing to do in this case is to open a new Cloud Dataflow issue in the public tracker and provide details there.

Related

How to Get the "templateLocation" parameter value of an existing Job in google dataflow

I have a list of an existing jobs running in a google dataflow. I would like to list the jobs running for last x number of days and would like to recycle those programmatically. To achieve this I need the name of the template used for a particular job. We can easily get this information from Console in a Job Info view. However i would like to know if there is any way to get this info from Gcloud command or from a API.
Your early response will be appreciated.
Thanks
Sarang
- Solution 1 :
You can use GCloud sdk and a shell script to achieve your need :
https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/list
Filter jobs with the given name:
gcloud dataflow jobs list --filter="name=my-wordcount"
List jobs with from a given region:
gcloud dataflow jobs list --region="europe-west1"
List jobs created this year:
gcloud dataflow jobs list --created-after=2018-01-01
List jobs created more than a week ago:
gcloud dataflow jobs list --created-before=-P1W
Many filters and parameters are proposed to apply your use case.
- Solution 2
You can use the rest api for Dataflow jobs :
https://cloud.google.com/dataflow/docs/reference/rest
Example :
GET /v1b3/projects/{projectId}/locations/{location}/jobs
List the jobs of a project.
There is no direct way of getting the template location and name. To achieve above requirement I have defined the pattern for the template name and job name and based on the defined pattern, while recycling the job, template name has been computed and passed on to the API call.

GCP Vertex AI Training Custom Job : User does not have bigquery.jobs.create permission

I'm struggling to execute a query with Bigquery python client from inside a training custom job of Vertex AI from Google Cloud Platform.
I have built a Docker image which contains this python code then I have pushed it to Container Registry (eu.gcr.io)
I am using this command to deploy
gcloud beta ai custom-jobs create --region=europe-west1 --display-name="$job_name" \
--config=config_custom_container.yaml \
--worker-pool-spec=machine-type=n1-standard-4,replica-count=1,container-image-uri="$docker_img_path" \
--args="${model_type},${env},${now}"
I have even tried to use the option --service-account to specify a service account with admin Bigquery role, it did not work.
According to this link
https://cloud.google.com/vertex-ai/docs/general/access-control?hl=th#granting_service_agents_access_to_other_resources
the Google-managed service accounts for AI Platform Custom Code Service Agent (Vertex AI) have already the right to access to BigQuery, so I do not understand why my job fails with this error
google.api_core.exceptions.Forbidden: 403 POST https://bigquery.googleapis.com/bigquery/v2/projects/*******/jobs?prettyPrint=false:
Access Denied: Project *******:
User does not have bigquery.jobs.create permission in project *******.
I have replaced the id with *******
Edit:
I have tried several configuration, my last config YAML file only contents this
baseOutputDirectory:
outputUriPrefix:
Using the field serviceAccount does not seem to edit the actual configuration unlike --service-account option
Edit 14-06-2021 : Quick Fix
like #Ricco.D said
try explicitly defining the project_id in your bigquery code if you
have not done this yet.
bigquery.Client(project=[your-project])
has fixed my problem. I still do not know about the causes.
To fix the issue it is needed to explicitly specify the project ID in the Bigquery code.
Example:
bigquery.Client(project=[your-project], credentials=credentials)

Log Buckets from Google

Is it possible to download a Log Storage (Log bucket) from Google Cloud Platform, specifically the one created by default? In case someone knows they can explain how to do it.
The possible solution for the question is you need to choose the required logs and then get the logs for the time period of 1 day to download them in JSON or CSV format.
Step1- From the logging console goto advanced filtering mode
Step2- To choose the log type use filtering query, for example
resource.type="audited_resource"
logName="projects/xxxxxxxx/logs/cloudaudit.googleapis.com%2Fdata_access"
resource.type="audited_resource"
logName="organizations/xxxxxxxx/logs/cloudaudit.googleapis.com%2Fpolicy"
Step3- You can download them as JSON and CSV format
If you have a huge number of audit logs generated per day then above one will not work out. So, you need to export logs to Cloud storage and a big query for further analysis. Please note that cloud logging doesn’t charge to export logs but destination charges might apply.
Another option, you can use the following gcloud command to download the logs.
gcloud logging read "logName : projects/Your_Project/logs/cloudaudit.googleapis.com%2Factivity" --project=Project_ID --freshness=1d >> test.txt

Error with gcloud beta command for streaming assets to bigquery

This might be a bit bleeding edge but hopefully someone can help. The problem is a catch 22.
So what we're trying to do is create a continuous stream of inventory changes in each GCP project to BigQuery dataset tables that we can create reports from and get a better idea of what we're paying for, what's turned on what's in use what isn't, etc.
Error: Error running command 'gcloud beta asset feeds create asset_change_feed --project=project_id --pubsub-topic=asset_change_feed': exit status 2. Output: ERROR: (gcloud.beta.asset.feeds.create) argument (--asset-names --asset-types): Must be specified.
Usage: gcloud beta asset feeds create FEED_ID --pubsub-topic=PUBSUB_TOPIC (--asset-names=[ASSET_NAMES,...] --asset-types=[ASSET_TYPES,...]) (--folder=FOLDER_ID | --organization=ORGANIZATION_ID | --project=PROJECT_ID) [optional flags]
optional flags may be --asset-names | --asset-types | --content-type |
--folder | --help | --organization | --project
For detailed information on this command and its flags, run:
gcloud beta asset feeds create --help
Using terraform we tried creating a dataflow job and a pubsub topic called asset_change_feed.
We get an error trying to create the pubsub topic because the gcloud beta asset feeds create command wants a parameter that includes all the asset names monitor...
Well... this kind of defeats the purpose. The whole point is to monitor all the asset names that change, appear and disappear. It's like creating a feed that monitors all the new baby names that appear over the next year but the feed command requires that we know them in advance somehow. WTF? What's the point then? Are we re-inventing the wheel here?
We were going by this documentation here:
https://cloud.google.com/asset-inventory/docs/monitoring-asset-changes#creating_a_feed
As per the gcloud beta asset feeds create documentation it is required to specify at least one of --asset-names and --asset-types:
At least one of these must be specified:
--asset-names=[ASSET_NAMES,…] A comma-separated list of the full names of the assets to receive updates. For example:
//compute.googleapis.com/projects/my_project_123/zones/zone1/instances/instance1.
See
https://cloud.google.com/apis/design/resource_names#full_resource_name
for more information.
--asset-types=[ASSET_TYPES,…] A comma-separated list of types of the assets types to receive updates. For example:
compute.googleapis.com/Disk,compute.googleapis.com/Network See
https://cloud.google.com/resource-manager/docs/cloud-asset-inventory/overview
for all supported asset types.
Therefore, when we don't know the names a priori we can monitor all resources of the desired types by only passing --asset-types. You can see the list of supported asset types here or use the exportAssets API method (gcloud asset export) to retrieve the types used at an organization, folder or project level.

How to process only delta files in aws datapipeline and EMR

How to process only new files using AWS data pipeline and EMR? I may get different number of files in my source directory. I want to process them using AWS data pipeline and EMR as one file after another file. I'm not sure how pre condition "exists" or "Shell Command activity" can solve this issue. Please suggest a way to process a delta list of files by adding EMR steps or creating EMR clusters for each file.
The way this is usually done in datapipeline is to use schedule expressions when referring to the source directory. For example,
if your pipeine is scheduled to run hourly and you specify "s3://bucket/#{format(minusMinutes(#scheduledStartTime,60),'YYYY-MM-dd hh')}"
as the input directory, datapipeline will resolve that to "s3://bucket/2016-10-23-16" when its running at hour 17. So the job will only read data corresponding to hour 16. If you can structure your input to produce data in this manner, this can be used. See http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-expressions.html for more examples of expressions.
Unfortunately, there is no built-n support "get data since last processed".