How do I perform a regex test in bash that starts with spaces and includes quotation marks? - regex

I'm trying to write a bash script that will change the fill color of certain elements within SVG files. I'm inexperienced with shell scripting, but I'm good with regexes (...in JS).
Here's the SVG tag I want to modify:
<!-- is the target because its ID is exactly "the.target" -->
<path id="the.target" d="..." style="fill:#000000" />
Here's the bash code I've got so far:
local newSvg="" # will hold newly-written SVG file content
while IFS="<$IFS" read tag
do
if [[ "${tag}" =~ +id *= *"the\.target" ]]; then
tag=$(echo "${tag}" | sed 's/fill:[^;];/fill:${color};/')
fi
newSvg="${newSvg}${tag}"
done < ${iconSvgPath} # is an argument to the script
Explained: I'm using read (splitting the file on < via custom IFS) to read the SVG content tag by tag. For each tag, I test to see if it includes an id property with the exact value I want. If it doesn't, I add this tag as-is to a newSvg string that I will later write to a file. If the tag does have the desired ID, I'll used sed to replace fill:STUFF; with fill:${myColor};. (Note that my sed is also failing, but that's not what I'm asking about here.)
It fails to find the right line with the test [[ "${tag}" =~ +id *= *"the\.target" ]].
It succeeds if I change the test to [[ "${tag}" =~ \"the\.target\" ]].
I'm not happy with the working version because it's too brittle. While I don't intend to support all the flexibility of XML, I would like to be tolerant of semantically irrelevant whitespace, as well as the id property being anywhere within the tag. Ideally, the regex I'd like to write would express:
id (preceded by at least one whitespace)
followed by zero or more whitespaces
followed by =
followed by zero or more whitespaces
followed by "the.target"
I think I'm not delimiting the regex properly inside the [[ ... =~ REGEX ]] construction, but none of the answers I've seen online use any delimiters whatsoever. In javascript, regex literals are bounded (e.g. / +id *= *"the\.target"/), so it's straightforward beginning a regex with a whitespace character that you care about. Also, JS doesn't have any magic re: *, whereas bash is 50% magic-handling-of-asterisks.
Any help is appreciated. My backup plan is maybe to try to use awk instead (which I'm no better at).
EDIT: My sed was really close. I forgot to add + after the [^;] set. Oof.

It would be much easier if you define regular expression pattern in a variable :
tag=' id = "the.target"'
pattern=' +id *= *"the\.target"'
if [[ $tag =~ $pattern ]]; then
echo matched.
fi

Thank you for giving us such a clear example that regex is not the way to solve this problem.
A SVG file is an XML file, and a possible tool to modify these is xmlstarlet.
Try this script I called modifycolor:
#!/bin/bash
# invoke as: modifycolor <svg.file> <target_id> <new_color>
xmlstarlet edit \
--update "//path[#id = '$2']/#style" --value "fill:#$3" \
"$1"
Assuming the svg file is test.svg, invoke it as:
./modifycolor test.svg the.target ff0000
You will be astonished by the result.
If you want to paste a piece of code inside your bash script, try this:
target="the.target"
newSvg=$(xmlstarlet edit \
--update "//path[#id = '${target}']/#style" --value "fill:#${myColor}" \
"${iconSvgPath}")

Thanks to folks for pointing out the mistakes in my bash-fu, I came up with this code which does what I said I wanted. I will not be marking this as the accepted answer because, as folks have observed, regex is a bad way to operate on XML. Sharing this for posterity.
local newSvg="" # will hold newly-written SVG code
while IFS="<$IFS" read tag
do
if [[ "${tag}" =~ \ +id\ *=\ *\"the\.target\" ]]; then
tag=$(echo "${tag}" | sed -E 's/fill:[^;]+;/fill:'"${color}"';/')
fi
newSvg="${newSvg}${tag}"
done < ${iconSvgPath}
Fixes:
escape the whitespace in the regex: =~ \ +id\ *=\ *
for sed, switch to double-quotes for the variable in the pattern
also for sed, I added the -E extended regex flag in order to support the negated set [^;]
Re: XML, I'll be comparing the list of available CLI-friendly XML parsers to the set of tools commonly available on my users' machines.

Related

"sed" special characters handling

we have an sed command in our script to replace the file content with values from variables
for example..
export value="dba01upc\Fusion_test"
sed -i "s%{"sara_ftp_username"}%$value%g" /home_ldap/user1/placeholder/Sara.xml
the sed command ignores the special characters like '\' and replacing with string "dba01upcFusion_test" without '\'
It works If I do the export like export value='dba01upc\Fusion_test' (with '\' surrounded with ‘’).. but unfortunately our client want to export the original text dba01upc\Fusion_test with single/double quotes and he don’t want to add any extra characters to the text.
Can any one let me know how to make sed to place the text with special characters..
Before Replacement : Sara.xml
<?xml version="1.0" encoding="UTF-8"?>
<ser:service-account >
<ser:description/>
<ser:static-account>
<con:username>{sara_ftp_username}</con:username>
</ser:static-account>
</ser:service-account>
After Replacement : Sara.xml
<?xml version="1.0" encoding="UTF-8"?>
<ser:service-account>
<ser:description/>
<ser:static-account>
<con:username>dba01upcFusion_test</con:username>
</ser:static-account>
</ser:service-account>
Thanks in advance
You cannot robustly solve this problem with sed. Just use awk instead:
awk -v old="string1" -v new="string2" '
idx = index($0,old) {
$0 = substr($0,1,idx-1) new substr($0,idx+length(old))
}
1' file
Ah, #mklement0 has a good point - to stop escapes from being interpreted you need to pass in the values in the arg list along with the file names and then assign the variables from that, rather than assigning values to the variables with -v (see the summary I wrote a LONG time ago for the comp.unix.shell FAQ at http://cfajohnson.com/shell/cus-faq-2.html#Q24 but apparently had forgotten!).
The following will robustly make the desired substitution (a\ta -> e\tf) on every search string found on every line:
$ cat tst.awk
BEGIN {
old=ARGV[1]; delete ARGV[1]
new=ARGV[2]; delete ARGV[2]
lgthOld = length(old)
}
{
head = ""; tail = $0
while ( idx = index(tail,old) ) {
head = head substr(tail,1,idx-1) new
tail = substr(tail,idx+lgthOld)
}
print head tail
}
$ cat file
a\ta a a a\ta
$ awk -f tst.awk 'a\ta' 'e\tf' file
e\tf a a e\tf
The white space in file is tabs. You can shift ARGV[3] down and adjust ARGC if you like but it's not necessary in most cases.
Update with the benefit of hindsight, to present options:
Update 2: If you're intent on using sed, see the - somewhat cumbersome, but now robust and generic - solution below.
If you want a robust, self-contained awk solution that also properly handles both arbitrary search and replacement strings (but cannot incorporate regex features such as word-boundary assertions), see Ed Morton's answer.
If you want a pure bash solution and your input files are small and preserving multiple trailing newlines is not important, see Charles Duffy's answer.
If you want a full-fledged third-party templating solution, consider, for instance, j2cli, a templating CLI for Jinja2 - if you have Python and pip, install with sudo pip install j2cli.
Simple example (note that since the replacement string is provided via a file, this may not be appropriate for sensitive data; note the double braces ({{...}})):
value='dba01upc\Fusion_test'
echo "sara_ftp_username=$value" >data.env
echo '<con:username>{{sara_ftp_username}}</con:username>' >tmpl.xml
j2 tmpl.xml data.env # -> <con:username>dba01upc\Fusion_test</con:username>
If you use sed, careful escaping of both the search and the replacement string is required, because:
As Ed Morton points out in a comment elsewhere, sed doesn't support use of literal strings as replacement strings - it invariably interprets special characters/sequences in the replacement string.
Similarly, the search string literal must be escaped in a way that its characters aren't mistaken for special regular-expression characters.
The following uses two generic helper functions that perform this escaping (quoting) that apply techniques explained at "Is it possible to escape regex characters reliably with sed?":
#!/usr/bin/env bash
# SYNOPSIS
# quoteRe <text>
# DESCRIPTION
# Quotes (escapes) the specified literal text for use in a regular expression,
# whether basic or extended - should work with all common flavors.
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# '
# SYNOPSIS
# quoteSubst <text>
# DESCRIPTION
# Quotes (escapes) the specified literal string for safe use as the substitution string (the 'new' in `s/old/new/`).
quoteSubst() {
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
printf %s "${REPLY%$'\n'}"
}
# The search string.
search='{sara_ftp_username}'
# The replacement string; a demo value with characters that need escaping.
value='&\1%"'\'';<>/|dba01upc\Fusion_test'
# Use the appropriately escaped versions of both strings.
sed "s/$(quoteRe "$search")/$(quoteSubst "$value")/g" <<<'<el>{sara_ftp_username}</el>'
# -> <el>&\1%"';<>/|dba01upc\Fusion_test</el>
Both quoteRe() and quoteSubst() correctly handle multi-line strings.
Note, however, given that sed reads a single line at at time by default, use of quoteRe() with multi-line strings only makes sense in sed commands that explicitly read multiple (or all) lines at once.
quoteRe() is always safe to use with a command substitution ($(...)), because it always returns a single-line string (newlines in the input are encoded as '\n').
By contrast, if you use quoteSubst() with a string that has trailing newlines, you mustn't use $(...), because the latter will remove the last trailing newline and therefore break the encoding (since quoteSubst() \-escapes actual newlines, the string returned would end in a dangling \).
Thus, for strings with trailing newlines, use IFS= read -d '' -r escapedValue < <(quoteSubst "$value") to read the escaped value into a separate variable first, then use that variable in the sed command.
This can be done with bash builtins alone -- no sed, no awk, etc.
orig='{sara_ftp_username}' # put the original value into a variable
new='dba01upc\Fusion_test' # ...no need to 'export'!
contents=$(<Sara.xml) # read the file's content into
new_contents=${contents//"$orig"/$new} # use parameter expansion to replace
printf '%s' "$new_contents" >Sara.xml # write new content to disk
See the relevant part of BashFAQ #100 for information on using parameter expansion for string substitution.

change ampersands in href

I know s/&/\&/g replaces all escaped ampersands and replaces them with ampersands. I want to be more picky. I want to only replace those escaped ampersands if they are in an href. I can't figure it out.
I was trying the following but it wasn't working:
echo "Link" | sed -E 's/^href="(.*)&/\1&/g'
It didn't work. I also see another problem being it would only do the first instance of an escaped ampersand and not all. Anyone know what the solution might be?
Not sure how to do it with sed, but here's Ruby:
echo 'Link' | ruby -pe '$_.gsub!(/href="([^"]*)"/) { |h| h.gsub("&", "&") }'
However, I fully support #muistooshort's comment: unless you're doing something weird, you should want the & in there.
perl -e '$url=$ARGV[0]; while ( $url =~ s/(Link'
Easily amended to run through a file

Regular Expression to strip comments from Bash script

This is deceptively complex. I need a regular expression to strip comments from Bash shell scripts.
Bear in mind that $#, ${#foo}, string="this # string", string='that # string', ${foo#bar}, ${foo##baar}, and
string="really complex args=$# ${applejack##"jack"} $(echo "$#, again")"; `echo this is a ${#nasty[*]} example`
are all valid shell expressions that should not be stripped.
Edit:
Note that:
# This is a comment in bash
# But so is this
echo "foo bar" # This is also a comment
Edit:
Note that lines that might be misconstrued as comments may be tucked inside HEREDOCs but since it is multi-line I can live without handling/accounting for it:
cat<<EOF>>out.txt
This is just a heredoc
# This line looks like a comment, but it isn't
EOF
You cannot do that with regular expressions.
echo ${baz/${foo/${foo/#bar/foo}/bar}/qux}
You need to match nested braces. Regular expressions can't do that, unless you're willing to consider PCREs "regular expressions", in which case it would be simpler to just write the parser in Perl.
Just for fun ...
I don't believe you can do this without using/implementing a parser but it's fun seeing how far you can get without doing that.
The closest I gotten is to use a simple regex with sed. It preserves the hash bang which is a definite must but can't cope with the HEREDOC. You could go further but then it might not be fun anymore.
Sample bash script (called doit)
#!/bin/bash
#This
# is a
echo $1 #comment
Running that ...
cat doit | sed -e 's/#[^!].*$//'
#!/bin/bash
echo $1
But obviously there are blank lines produced which you don't want AND it doesn't handle HERE docs.
Again, not a serious suggestion but please play around with it.
EDITED: I admit it! sed won't work for the reasons given in comments - sed doesn't handle lookaheads/lookbehinds. Thanks for pointing that out!
I thought a comment in bash was a line that started with a #. If so, here's your regex:
^#
And here's the sed command that will strip them:
sed -i '' -e 's/^\s*#(?!!).*$//' myfile.sh
EDITED to factor in downvoter's comments: ie
allow whitespace before the # using \s*
exclude lines that have a ! following the # using negative lookahead (?!!)

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.
If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.
But in general, awk is the wrong tool for this job.
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.
By your script, if you can get what you want (it means <li> and <a> tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.
There are several issues that I see:
The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.
It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.

How can I make this Perl one-liner to toggle character in line in a file?

I am attempting to write a one-line Perl script that will toggle a line in a configuration file from "commented" to not and back. I have the following so far:
perl -pi -e 's/^(#?)(\tDefaultServerLayout)/ ... /e' xorg.conf
I am trying to figure out what code to put in the replacement (...) section. I would like the replacement to insert a '#' if one was not matched on, and remove it if it was matched on.
pseudo code:
if ( $1 == '#' ) then
print $2
else
print "#$2"
My Perl is very rusty, and I don't know how to fit that into a s///e replacement.
My reason for this is to create a single script that will change (toggle) my display settings between two layouts. I would prefer to have this done in only one script.
I am open to suggestions for alternate methods, but I would like to keep this a one-liner that I can just include in a shell script that is doing other things I want to happen when I change layouts.
perl -pi -e 's/^(#?)(?=\tDefaultServerLayout)/ ! $1 && "#" /e' foo
Note the addition of ?= to simplify the replacement string by using a look-ahead assertion.
Some might prefer s/.../ $1 ? "" : "#" /e.