Sed | Variable containing regex causes invalid reference error - regex

I'm having problems with sed and the back-referencig when using variables containing regexes.
It is a parser written in bash. At a very earlier point, I want to use sed to clean every line into the needed data: the indentation, a key and a value (colon separated). The data is similar to yaml but using an equals.
A basic example of the data:
overview = peparing 2016-10-22
license= sorted 2015-11-01
The function I'm having problems with does the logic in a while loop:
function prepare_parsing () {
local file=$1
# regex components:
local s='[[:space:]]*' \
w='[a-zA-Z0-9_]*' \
fs=':'
# regexes(NoQuotes, SingleQuotes, DoubleQuotes):
local searchNQ='^('$s')('$w')'$s'='$s'(.*)'$s'$' \
searchSQ='^('$s')('$w')'$s'='$s\''(.*)'\'$s'\$' \
searchDQ='^('$s')('$w')'$s'='$s'"(.*)"'$s'\$' \
replace="\1$fs\2$fs\3"
while IFS="$fs" read -r indentation key value; do
...
SOME CUSTOM LOGIC
...
done < <(sed -n "s/${searchNQ}/${replace}/p" $file)
}
When trying to call the function, I receive the known invalid reference error into \3: invalid reference \3 on s' command's RHS
To debug this, after the vars definition, I've printed their values using the printf and the %q option.
printf "%q\n" $searchNQ $searchSQ $searchDQ $replace
Getting these values:
\^\(\[\[:space:\]\]\*\)\(\[a-zA-Z0-9_\]\*\)\[\[:space:\]\]\*=\[\[:space:\]\]\*\(.\*\)\[\[:space:\]\]\*\$
\^\(\[\[:space:\]\]\*\)\(\[a-zA-Z0-9_\]\*\)\[\[:space:\]\]\*=\[\[:space:\]\]\*\'\(.\*\)\'\[\[:space:\]\]\*\\\$
\^\(\[\[:space:\]\]\*\)\(\[a-zA-Z0-9_\]\*\)\[\[:space:\]\]\*=\[\[:space:\]\]\*\"\(.\*\)\"\[\[:space:\]\]\*\\\$
$'\\1\034\\2\034\\3'
And maybe here's the problem, the excessive escape sequences when the shell (bash) expand the variables (for example, it seems to be escaping the *, the [], ...).
If I pass the -r option to sed, it works perfectly, but I have to avoid this since the system that will execute the script won't have this sed implementation: I have to use basic sed.
Do you have any idea on how to store the regex into variables and make them usable for the backreferencing on the RHS?
It works in these two cases:
When using a plain regex string:
sed -n "s/^\([[:space:]]*\)\([a-zA-Z0-9_]*\)[[:space:]]*=[[:space:]]*\(.*\)[[:space:]]*\$/\1:\2:\3/p" $file
And when I use just the vars s, w and fs:
sed -n "s/^\($s\)\($w\)$s=$s\(.*\)$s\$/\1$fs\2$fs\3/p" $file
Many thanks for the help!

perl that supports extended RegExps may be used instead of sed, like
perl -n -e "s/${searchNQ}/${replace}/; print"

Related

How to use sed in shell script to replace all environment value occurrences with their current values

I would like to have a shell script to iterate over all the occurrences of environment variable names in a file and replace them with their current values. I am not sure how this can be done by using sed command.
The file content:
values:
value1:
name: "something"
value: "$ENV_VAR1" # this could be any variable name
value2:
name: "something"
value: "$ENV_VAR2"
...
First, I need to find all occurrences of any variable (Using regex "\$(.*?)" ). Then, somehow, I need to replace it with the variable value from the shell. I am not sure how I can use the sed command to achieve the second part as the variable name is specified in the file itself.
Something like the following command:
sed -i "s/\"\$(.*?)\"/${Some_How_Get_Var_Name}/g" file.yaml
This is a problem that comes up often. envsubst is commonly given as a solution, but I find it's easier to just stick with perl and do something like:
perl -pe 'while (my ($k, $v) = each %ENV) { s/\$$k/$v/g }'
This is almost certainly not a robust solution (it will replace $FOO, but it won't do replacements of the form ${FOO}), but I find I'm always disappointed that envsubst doesn't do ${FOO-bar}, and envsubst seems less ubiquitous than perl.
Or, rather than doing the replacement for everything in the environment, you might prefer something like:
perl -pe 's/\$([[:alpha:]_][_[:alnum:]]+)/$ENV{$1}/g'
or
perl -pe 's/\$([[:alpha:]_][\w]+)/$ENV{$1}/g'
These last two will replace '$FOO' with the empty string if FOO is not defined, while the first leaves it unreplaced. Which behavior you desire may drive the decision as to which to use.
I won't claim these are completely correct, but they are a reasonable approximation.
If You are using bash and the envsubst command is avaiable you can do:
envsubst < inputfile
E.g. (creating a temp input for demonstrating it:
$ env | tail -2 | sed 's_^_$_'
$MANPATH=/home/linuxbrew/.linuxbrew/share/man:
$INFOPATH=/home/linuxbrew/.linuxbrew/share/info:
Then running this through envsubst:
$ env | tail -2 | sed 's_^_$_' | envsubst
/home/linuxbrew/.linuxbrew/share/man:=/home/linuxbrew/.linuxbrew/share/man:
/home/linuxbrew/.linuxbrew/share/info:=/home/linuxbrew/.linuxbrew/share/info:
This might work for you (GNU sed):
sed '/value:/{y/"/\n/;s/^.*/printf "&"/e;y/\n/"/}' file
On any line containing the string value: convert any "'s to newlines, use printf to convert the environmental variables to their real values and reconvert the introduced newlines back to "'s.
N.B. If the environmental variable can contain "'s, these will need to be quoted following the printf command, i.e. insert s/"/\\"/g before the last y command.

sed regular expression does not work as expected. Differs on pipe and file

I have a string in text file where i want to replace the version number. Quotation marks can vary from ' to ". Also spaces around = can be there and can be not as well:
$data['MODULEXXX_VERSION'] = "1.0.0";
For testing i use
echo "_VERSION'] = \"1.1.1\"" | sed "s/\(_VERSION.*\)[1-9]\.[1-9]\.[1-9]/\11.1.2/"
which works perfectly.
When i change it to search in the file (the file has the same string):
sed "s/\(_VERSION.*\)[1-9]\.[1-9]\.[1-9]/\11.1.2/" -i test.php
, it does not find anything.
After after playing with the search part of regex, i found one more odd thing:
sed "s/\(_VERSION.*\)[1-9]\./\1***/" -i test.php
works and changes the string to $data['MODULEXXX_VERSION'] = "***0.0";, but
sed "s/\(_VERSION.*\)[1-9]\.[1-9]/\1***/" -i test.php
does not find anything anymore. Why?
I am using Ubuntu 17.04 desktop.
Anyone can explain what am I doing wrong? What would be the best command for replacing version numbers in the file for the string $data['MODULEXXX_VERSION'] = "***0.0";?
The main problem is that [1-9] doesn't match the 0s in the version number. You need to use [0-9].
Besides that, you may use the following sed command:
sed -r 's/(.*_VERSION['\''"]]\s*=\s*).*/\1"1.0.1";/' conf.php
This doesn't look at the current value, it simply replaces everything after the =.
I've used -r which enables extended posix regular expressions which makes it a bit simpler to formulate the pattern.
Another, probably cleaner attempt is to store the conf.php as a template like conf.php.tpl and then use a template engine to render the file. Or if you really want to use sed, the file may look like:
$data['FOO_VERSION'] = "FOO_VERSION_TPL";
Then just use:
sed 's/FOO_VERSION_TPL/1.0.1/' conf.php.tpl > conf.php
If there are multiple values to replace:
sed \
-e 's/FOO/BAR/' \
-e 's/HELLO/WORLD/' \
conf.php.tpl > conf.php
But I recommend a template engine instead of sed. That becomes more important when the content of the variables to replace may contain characters special to regular expressions.

How does bash expand escaped characters when dereferencing variables

If I quit using variables, and just write the regexes directly into to the last sed command,
everything works. But as it is here, no substitutions are done?
#!/bin/bash
#html substitutions
ampP="\&"
ampR="&"
ltP="\<"
ltR="<"
gtP="\>"
gtR=">"
quotP="\""
quotP2='\โ€œ'
quotP3="\โ€"
quotR="\""
tripDotP="\&#8230"
tripDotR="..."
tickP="\โ€™"
tickR="\ยด"
#get a random page, and filter out the quotes
#pick a random quote
#translate wierd html symbols
curl "www.yodaquotes.net/page/$((RANDOM % 9 +1))/" -Ls | sed -nr 's/.*data-text=\"([^\"]+)\".*/\1/p' \
| sort -R | head -n1 \
| sed 's/"$ampP"/"$ampR"/g; s/$ltP/$ltR/g; s/$gtP/$gtR/g; s/$quotP/$quotR/g; s/"$quotP2"/"$quotR"/g; s/$quotP3/$quotR/g; s/$tripDotP/$tripDotR/g; s/$stickP/$stickR/g'
This sed isn't going to work:
sed 's/"$ampP"/"$ampR"/g'
because of wrong shell quoting. Your shell variables won't be expanded at all in single quotes. Try using this form:
sed "s~$ampP~$ampR~g"
Debugging 101: Let's just echo what sed receives:
echo 's/"$ampP"/"$ampR"/g; s/$ltP/$ltR/g; s/$gtP/$gtR/g; s/$quotP/$quotR/g; s/"$quotP2"/"$quotR"/g; s/$quotP3/$quotR/g; s/$tripDotP/$tripDotR/g; s/$stickP/$stickR/g'
s/"$ampP"/"$ampR"/g; s/$ltP/$ltR/g; s/$gtP/$gtR/g; s/$quotP/$quotR/g; s/"$quotP2"/"$quotR"/g; s/$quotP3/$quotR/g; s/$tripDotP/$tripDotR/g; s/$stickP/$stickR/g
That doesn't look right now, does it?
There's no variable substitution within single quotes in bash. That's why we have two different quotes, so you can decide which one is more appropriate for the task.
For readability I would suggest putting each sed command within double quotes.
Like this: "s/$ampP/$ampR/g"

GREP: variable in regular expression

If I want to look whether a string is alphanumeric and shorter than a certain value, say 10, I would do like this (in BASH+GREP):
if grep '^[0-9a-zA-Z]\{1,10\}$' <<<$1 ; then ...
(BTW: I'm checking for $1, i.e. the first argument)
What if I want the value 10 to be written on a variable, e.g.
UUID_LEN=10
if grep '^[0-9a-zA-Z]\{1,$UUID_LEN\}$' <<<$1 ; then ...
I tried all sort of escapes, braces and so on, but could not avoid the error message
grep: Invalid content of \{\}
After googling and reading bash and grep tutorials I'm pretty convinced it can't be done. Am I wrong? Any way to go around this?
You need to use double quotes so that the shell expands the parameter before passing the resulting argument to grep:
if grep "^[0-9a-zA-Z]\{1,$UUID_LEN\}$" <<<$1 ; then ...
bash can perform regular expression matching itself, without having to start another process to run grep:
if [[ $1 =~ ^[0-9a-zA-Z]{1,$UUID_LEN}$ ]]; then

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill