How do I specify multiple shell scripts as initialization actions for Dataproc cluster creation?

How do I specify multiple shell scripts as initialization actions for Dataproc cluster creation? - google-cloud-platform

Google's documentation says that --initialization-actions takes a list of GCS URLs. If I specify one:
--initialization-actions 'gs://my-project/myscript.sh'
This works fine.
--initialization-actions 'gs://my-project/myscript.sh', 'gs://my-project/myscript2.sh'
Gives the following error:
INVALID_ARGUMENT: Google Cloud Storage object does not exist 'gs://my-project/myscript.sh gs://my-project/myscript2.sh'
Same without quotes, and with or without a space after the comma.
I tried encapsulating in square brackets:
--initialization-actions ['gs://my-project/myscript.sh', 'gs://my-project/myscript2.sh']
And the error this time is:
Executable '['gs://my-project/myscript.sh', 'gs://my-project/myscript2.sh']' URI must begin with 'gs://'
I can confirm one million percent that the paths I am using are valid, and that both objects are valid shell scripts. Is there something obvious I am missing?

You should remove the space between the scripts:
--initialization-actions gs://my-project/myscript.sh,gs://my-project/myscript2.sh

Just figured it out, the format needs to be:
--initialization-actions 'gs://my-project/myscript.sh, gs://my-project/myscript2.sh'
ie both scripts in a single set of quotes, with comma separating.

Related

SDK Gcloud logging timestamp filter

I am having problem while trying to filter my log files to a specific period of time. Everything beside timestamps works fine when i'm trying to write a command. The moment everything looks good:
gcloud logging read "resource.type=XXX logName=projects/YYY/logs/appengine.googleapis.com%2Fnginx.health_check" > test.txt
All other things like --limit or --freshness are working without problems, but when i'm trying to get a period of time in my text file the command stops working. I'm getting information:
The file name, directory name, or volume label syntax is incorrect.
I've tried many things and this is command which gives me some error at least:
gcloud logging read "resource.type=XXX logName=projects/YYY/logs/appengine.googleapis.com%2Fnginx.health_check timestamp='2020-01-22T14:02:41.41Z'"
Please help me with correct syntax of specifying timestamps to get any period of time as a result.

I got it!
gcloud logging read "resource.type=XXX logName=projects/YYY/logs/appengine.googleapis.com%2Fnginx.health_check timestamp^>=""2020-01-21T11:00:00Z"" timestamp^<=""2020-01-22T11:00:00Z""" >t.txt
I found it here: Find logs between timestamps using stackdriver CLI
Thank you Braulio Baron for your help!

The syntax error is in the timestamp, it should be with \ before the date's quotes:
gcloud logging read "resource.type=gae_app AND logName=projects/xxxx/logs/syslog AND timestamp=\"2020-01-22T14:02:41.41Z\""
For more detail have a look into gcloud logging read

How to download folder containing brackets in name?

I have many folders in Google Cloud Storage that contain square brackets in the name. gsutil sees square brackets as wild cards and I am unable to download the project. Can I download folders another way?
I tried using the escape character and quotes. These do not work.
gsutil cp gs://myBucket/[Projects]Number1 /Volumes/DriveName/Desktop
The result is to download files from Google Cloud Storage to my local computer.

gsutil doesn't have a way to escape wildcard characters in file / object names. There's an open issue about this: https://github.com/GoogleCloudPlatform/gsutil/issues/220
Basically, you'll have to use a different tool (or write some code) to handle such files/objects.

GCP Dataflow Error: Path "gs://..." is not a valid filepattern. The pattern must be of the form "gs://<bucket>/path/to/file"

I am trying to create a dataflow from Pub-Sub to BigQuery in GCP console.
In the "Create job from template" screen, I am having a trouble what to enter for "Temporary Location" box. It says "Path and filename prefix for writing temporary files. ex: gs://MyBucket/tmp".
So I specified something like this: "gs://${GOOGLE_CLOUD_PROJECT}-test/dataflow/tmp"
But I am getting this error (dataflow folder is there BTW):
Path "gs://${GOOGLE_CLOUD_PROJECT}-test/dataflow/tmp" is not a valid filepattern. The pattern must be of the form "gs://<bucket>/path/to/file".
I tried different patterns but to no avail. Any idea how to resolve this?

it complains that it wants a bucket ...
The pattern must be of the form "gs:// [bucket] /path/to/file".
export PROJECT_ID=$(gcloud config list --format 'value(core.project)')
export BUCKET_NAME="${PROJECT_ID}-test"
gsutil "gs://${BUCKET_NAME}/dataflow/tmp"
wondered about the -test suffix and I've just tried to reflect that in code.
one can obtain all valid BUCKET_NAMEs with gsutil ls.

DataFlow gcloud CLI - "Template metadata was too large"

I've honed my transformations in DataPrep, and am now trying to run the DataFlow job directly using gcloud CLI.
I've exported my template and template metadata file, and am trying to run them using gcloud dataflow jobs run and passing in the input & output locations as parameters.
I'm getting the error:
Template metadata regex '[ \t\n\x0B\f\r]*\{[ \t\n\x0B\f\r]*((.|\r|\n)*".*"[ \t\n\x0B\f\r]*:[ \t\n\x0B\f\r]*".*"(.|\r|\n)*){17}[ \t\n\x0B\f\r]*\}[ \t\n\x0B\f\r]*' was too large. Max size is 1000 but was 1187.
I've not specified this at the command line, so I know it's getting it from the metadata file - which is straight from DataPrep, unedited by me.
I have 17 input locations - one containing source data, all the others are lookups. There is a regex for each one, plus one extra.
If it's running when prompted by DataPrep, but won't run via CLI, am I missing something?

In this case I'd suspect the root cause is a limitation in gcloud that is not present in the Dataflow API or Dataprep. The best thing to do in this case is to open a new Cloud Dataflow issue in the public tracker and provide details there.

AWS Data Pipeline Escaping Comma in emr activity step section

I am creating an aws datapipline using the architect provided in the aws web console.
Everything is setup ok, my emrcluster is configured and successfully started.
But when I am trying to submit a emr activity I come across following problem:
In the step section of the emr activity my requirement is to provide --packages argument with 3 packages
But as far as I understand steps in emractivity is a comma separated value and commas (,) are replaced with spaces in the resultant step argument.
On the other hand --packages argument is also a comma separated value in case of multiple packages.
Now when I am trying to pass this as argument commas get replaced with spaces that make the step invalid.
This is the statement I required as it is in the resultant emr step:
--packages com.amazonaws:aws-java-sdk-s3:1.11.228,org.apache.hadoop:hadoop-aws:2.6.0,org.postgresql:postgresql:42.1.4
Any solution to escape the comma?
So far i try the \\\\ way as mentioned in http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emractivity.html
Not worked.

when u will be using \\\\, it will escape the slashes and comma will get replaced.
You can try using Three slashes, same has worked for me . Like \\\, .
I hope that works

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I specify multiple shell scripts as initialization actions for Dataproc cluster creation? - google-cloud-platform

You should remove the space between the scripts: --initialization-actions gs://my-project/myscript.sh,gs://my-project/myscript2.sh

Just figured it out, the format needs to be: --initialization-actions 'gs://my-project/myscript.sh, gs://my-project/myscript2.sh' ie both scripts in a single set of quotes, with comma separating.

Related

SDK Gcloud logging timestamp filter

How to download folder containing brackets in name?

GCP Dataflow Error: Path "gs://..." is not a valid filepattern. The pattern must be of the form "gs://<bucket>/path/to/file"

DataFlow gcloud CLI - "Template metadata was too large"

AWS Data Pipeline Escaping Comma in emr activity step section

Categories

Resources