Executing a Dataflow job with multiple inputs/outputs using gcloud cli - google-cloud-platform

I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input & location. They look like this (line breaks added for easy reading):
{
"location1": "project:bq_dataset.bq_table1",
#...
"location10": "project:bq_dataset.bq_table10",
"location17": "project:bq_dataset.bq_table17"
}
I have 17 inputs (mostly lookups) and 2 outputs (one csv, one bigquery). I'm passing these to the gcloud CLI like this:
gcloud dataflow jobs run job-201807301630 /
--gcs-location=gs://bucketname/dataprep/dataprep_template /
--parameters inputLocations={"location1":"project..."},outputLocations={"location1":"gs://bucketname/output.csv"}
But I'm getting an error:
ERROR: (gcloud.dataflow.jobs.run) unrecognized arguments:
inputLocations=location1:project:bq_dataset.bq_table1,outputLocations=location2:project:bq_dataset.bq_output1
inputLocations=location10:project:bq_dataset.bq_table10,outputLocations=location1:gs://bucketname/output.csv
From the error message, it looks to be merging the inputs and outputs so that as I have two outputs, each two inputs are paired with the two outputs:
input1:output1
input2:output2
input3:output1
input4:output2
input5:output1
input6:output2
...
I've tried quoting the input/output objects (single and double, plus removing the quotes in the object), wrapping them in [], using tildes but no joy. Has anyone managed to execute a dataflow job with multiple inputs?

I finally found a solution for this via a huge process of trial and error. There are several steps involved.
Format of --parameters
The --parameters argument is a dictionary-type argument. There are details on these in a document you can read by typing gcloud topic escaping in the CLI, but in short it means you'll need an = between --parameters and the arguments, and then the format is key=value pairs with the value enclosed in quote marks ("):
--parameters=inputLocations="object",outputLocations="object"
Escape the objects
Then, the objects need the quotes escaping to avoid ending the value prematurely, so
{"location1":"gcs://bucket/whatever"...
Becomes
{\"location1\":\"gcs://bucket/whatever\"...
Choose a different separator
Next, the CLI gets confused because while the key=value pairs are separated by a comma, the values also have commas in the objects. So you can define a different separator by putting it between carats (^) at the start of the argument and between the key=value pairs:
--parameters=^*^inputLocations="{"\location1\":\"...\"}"*outputLocations="{"\location1\":\"...\"}"
I used * because ; didn't work - maybe because it marks the end of the CLI command? Who knows.
Note also that the gcloud topic escaping info says:
In cmd.exe and PowerShell on Windows, ^ is a special character and
you must escape it by repeating it. In the following examples, every time
you see ^, replace it with ^^^^.
Don't forget customGcsTempLocation
After all that, I'd forgotten that customGcsTempLocation needs adding to the key=value pairs in the --parameters argument. Don't forget to separate it from the others with a * and enclose it in quote marks again:
...}*customGcsTempLocation="gs://bucket/whatever"
Pretty much none of this is explained in the online documentation, so that's several days of my life I won't get back - hopefully I've helped someone else with this.

Related

cloud zsh `=` notation vs. string notation

Why does this work gcloud config list project --format 'value(core.project)' but not this gcloud config list project --format=value(core.project)? The documentation uses the = notation. I get the error number expected when using it. My guess is that it's trying to evaluate the projection value(core.project) as a number and the `` tells it to evaluate as a string.
I'm unfamiliar with zsh but the bash returns a syntax error unrecognized token '(' if you do this.
This is not an issue with gcloud per se.
The issue is that zsh (and bash) shells have their own interpretation of (...). In bash, this is the command for a subshell.
The solution is to ensure that the flag values are passed to the command as-is rather than be evaluated by the shell.
As #hobbs points out correctly, you can use '...' or "..." to wrap the flag value correctly without issue. My preference is --flag=value and, when using bash, '...' when no variable expansion is ever desired and "..." when it is.
My personal preference is to always use = and default to "..."
gcloud config list project \
--format="value(core.project)"

Environment variables in Google Cloud Build

We want to migrate from Bitbucket Pipelines to Google Cloud Build to test, build and push Docker images.
How can we use environment variables without a CryptoKey? For example:
- printf "https://registry.npmjs.org/:_authToken=${NPM_TOKEN}\nregistry=https://registry.npmjs.org" > ~/.npmrc
To use environment variables in the args portion of your build steps you need:
"a shell to resolve environment variables with $$" (as mentioned in the example code here)
and you also need to be careful with your usage of quotes (use single quotes)
See below the break for a more detailed explanation of these two points.
While the Using encrypted resources docs that David Bendory also linked to (and which you probably based your assumption on) show how to do this using an encrypted environment variable specified via secretEnv, this is not a requirement and it works with normal environment variables too.
In your specific case you'll need to modify your build step to look something like this:
# you didn't show us which builder you're using - this is just one example of
# how you can get a shell using one of the supported builder images
- name: 'gcr.io/cloud-builders/docker'
entrypoint: 'bash'
args: ['-c', 'printf "https://registry.npmjs.org/:_authToken=%s\nregistry=https://registry.npmjs.org" $$NPM_TOKEN > ~/.npmrc']
Note the usage of %s in the string to be formatted and how the environment variable is passed as an argument to printf. I'm not aware of a way that you can include an environment variable value directly in the format string.
Alternatively you could use echo as follows:
args: ['-c', 'echo "https://registry.npmjs.org/:_authToken=$${NPM_TOKEN}\nregistry=https://registry.npmjs.org" > ~/.npmrc']
Detailed explanation:
My first point at the top can actually be split in two:
you need a shell to resolve environment variables, and
you need to escape the $ character so that Cloud Build doesn't try to perform a substitution here
If you don't do 2. your build will fail with an error like: Error merging substitutions and validating build: Error validating build: key in the template "NPM_TOKEN" is not a valid built-in substitution
You should read through the Substituting variable values docs and make sure that you understand how that works. Then you need to realise that you are not performing a substitution here, at least not a Cloud Build substitution. You're asking the shell to perform a substitution.
In that context, 2. is actually the only useful piece of information that you'll get from the Substituting variable values docs (that $$ evaluates to the literal character $).
My second point at the top may be obvious if you're used to working with the shell a lot. The reason for needing to use single quotes is well explained by these two questions. Basically: "You need to use single quotes to prevent interpolation happening in your calling shell."
That sounds like you want to use Encrypted Secrets: https://cloud.google.com/cloud-build/docs/securing-builds/use-encrypted-secrets-credentials

Detecting semicolon as command line argument in linux

I am trying to run a C++ application where I am passing some command line arguments to it as follows:
./startServer -ip 10.78.242.4 tcpip{ldap=no;port=2435}
The application is getting crashed because it is not able to get the correct port. Searching over the web, I found that ";" is treated an end of command character (Semicolon on command line in linux) so everything after that is getting ignored. I also understand the putting it inside the quotes will work fine. However, I do not want to force this restriction of putting the arguments in the quotes on the users. So, I want to know is there a way I can process the ";" character with the argv array?
The semicolon separates two commands so your command line is equivalent to
./startServer -ip 10.78.242.4 tcpip{ldap=no
port=2435}
Your application will never know anything about either the semi colon or the second command, these will be completely handled by the shell. You need to escape the colon with a back slash or enclose it in quotes. Other characters which may cause similar issues include: $,\-#`'":*?()&|
Complex strings are much easier to pass either from a file or through stdin.
You need to quote not only the ; but in the general case also the { and }:
./startServer -ip 10.78.242.4 'tcpip{ldap=no;port=2435}'
If your users are required to type in that complicated last argument, then they can also be made to quote it.

Amazon Redshift - COPY from CSV - single Double Quote in row - Invalid quote formatting for CSV Error

I'm loading a CSV file from S3 into Redshift. This CSV file is analytics data which contains the PageUrl (which may contain user search info inside a query string for example).
It chokes on rows where there is a single, double-quote character, for example if there is a page for a 14" toy then the PageUrl would contain:
http://www.mywebsite.com/a-14"-toy/1234.html
Redshift understandably can't handle this as it is expecting a closing double quote character.
The way I see it my options are:
Pre-process the input and remove these characters
Configure the COPY command in Redshift to ignore these characters but still load the row
Set MAXERRORS to a high value and sweep up the errors using a separate process
Option 2 would be ideal, but I can't find it!
Any other suggestions if I'm just not looking hard enough?
Thanks
Duncan
It's 2017 and I run into the same problem, happy to report there is now a way to get redshift to load csv files with the odd " in the data.
The trick is to use the ESCAPE keyword, and also to NOT use the CSV keyword.
I don't know why, but having the CSV and ESCAPE keywords together in a copy command resulted in failure with the error message "CSV is not compatible with ESCAPE;"
However with no change to the loaded data I was able to successfully load once I removed the CSV keyword from the COPY command.
You can also refer to this documentation for help:
http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-escape
Unfortunately, there is no way to fix this. You will need to pre-process the file before loading it into Amazon Redshift.
The closest options you have are CSV [ QUOTE [AS] 'quote_character' ] to wrap fields in an alternative quote character, and ESCAPE if the quote character is preceded by a slash. Alas, both require the file to be in a particular format before loading.
See:
Redshift COPY Data Conversion Parameters
Redshift COPY Data Format Parameters
I have done this using ---> DELIMITER ',' IGNOREHEADER 1; at the replacement for 'CSV' at the end of COPY command. Its working really fine.

Use regex with grep to filter data from the output of a verbose command

I am working with a cloud environment and there is a command that will display all available information about VMs running. here is an example of some of the lines that pertain to one VM.
RESERVATION r-6D0F464B 170506678332 GroupD
INSTANCE i-E9B444A9 emi-376642D8 999.99.999.999 88.888.88.888 running lock_key 0 c1.xlarge 2013-06-17T18:40:56.270Z cluster01 eki-E7E242A3 monitoring-disabled 999.99.999.999 88.888.88.888 ebs
I need to be able to pull the i-********, emi-********, both IP address, its status, the lock_key, the c1.xlarge, and the monitoring-disabled/enabled.
I have been able to pull the whole line with some super simple regex but all of this is well beyond me. If there is another easier method of grabbing this data any suggestions are welcome.
Let's go by parts. Best way I can think of is redirecting the output to a file, in unix-like environments you do it like:
cat your-command > filename.txt
Second, you need to read the file line by line, I would use a python script or a perl script if you know any of those, or whatever language fits you.
Third, you can get values two different ways:
Read columns by position, you can get colums with a regex like: [^\s]+
Write regular expressions for every specific column, so for IP you could have something like this: ([0-9]{1,3}\.){4} for monitoring monitoring-([^\s]+) and so on.
As long as the fields will always be in the same order, all you need to is split on whitespace.
Pseudocode (well, it's ruby, but hopefully you get the idea):
vms = {}
File.open('vm-info').readlines.each do |line|
fields = line.split('\s+')
field_map = {}
vm_name = fields[<index_of_vm_name>]
field_map['emi'] = fields[<index_of_emi>]
field_map['ip_address'] = fields[<index_of_ip_address]
.
.
.
vms[vm_name] = field_map
end
After this, vms will be initialized to contain information about each vm. You can simply print them all out at this point, or continue running data manipulation on them.