Regular Expression to strip comments from Bash script - regex

This is deceptively complex. I need a regular expression to strip comments from Bash shell scripts.
Bear in mind that $#, ${#foo}, string="this # string", string='that # string', ${foo#bar}, ${foo##baar}, and
string="really complex args=$# ${applejack##"jack"} $(echo "$#, again")"; `echo this is a ${#nasty[*]} example`
are all valid shell expressions that should not be stripped.
Edit:
Note that:
# This is a comment in bash
# But so is this
echo "foo bar" # This is also a comment
Edit:
Note that lines that might be misconstrued as comments may be tucked inside HEREDOCs but since it is multi-line I can live without handling/accounting for it:
cat<<EOF>>out.txt
This is just a heredoc
# This line looks like a comment, but it isn't
EOF

You cannot do that with regular expressions.
echo ${baz/${foo/${foo/#bar/foo}/bar}/qux}
You need to match nested braces. Regular expressions can't do that, unless you're willing to consider PCREs "regular expressions", in which case it would be simpler to just write the parser in Perl.

Just for fun ...
I don't believe you can do this without using/implementing a parser but it's fun seeing how far you can get without doing that.
The closest I gotten is to use a simple regex with sed. It preserves the hash bang which is a definite must but can't cope with the HEREDOC. You could go further but then it might not be fun anymore.
Sample bash script (called doit)
#!/bin/bash
#This
# is a
echo $1 #comment
Running that ...
cat doit | sed -e 's/#[^!].*$//'
#!/bin/bash
echo $1
But obviously there are blank lines produced which you don't want AND it doesn't handle HERE docs.
Again, not a serious suggestion but please play around with it.

EDITED: I admit it! sed won't work for the reasons given in comments - sed doesn't handle lookaheads/lookbehinds. Thanks for pointing that out!
I thought a comment in bash was a line that started with a #. If so, here's your regex:
^#
And here's the sed command that will strip them:
sed -i '' -e 's/^\s*#(?!!).*$//' myfile.sh
EDITED to factor in downvoter's comments: ie
allow whitespace before the # using \s*
exclude lines that have a ! following the # using negative lookahead (?!!)

Related

How do I perform a regex test in bash that starts with spaces and includes quotation marks?

I'm trying to write a bash script that will change the fill color of certain elements within SVG files. I'm inexperienced with shell scripting, but I'm good with regexes (...in JS).
Here's the SVG tag I want to modify:
<!-- is the target because its ID is exactly "the.target" -->
<path id="the.target" d="..." style="fill:#000000" />
Here's the bash code I've got so far:
local newSvg="" # will hold newly-written SVG file content
while IFS="<$IFS" read tag
do
if [[ "${tag}" =~ +id *= *"the\.target" ]]; then
tag=$(echo "${tag}" | sed 's/fill:[^;];/fill:${color};/')
fi
newSvg="${newSvg}${tag}"
done < ${iconSvgPath} # is an argument to the script
Explained: I'm using read (splitting the file on < via custom IFS) to read the SVG content tag by tag. For each tag, I test to see if it includes an id property with the exact value I want. If it doesn't, I add this tag as-is to a newSvg string that I will later write to a file. If the tag does have the desired ID, I'll used sed to replace fill:STUFF; with fill:${myColor};. (Note that my sed is also failing, but that's not what I'm asking about here.)
It fails to find the right line with the test [[ "${tag}" =~ +id *= *"the\.target" ]].
It succeeds if I change the test to [[ "${tag}" =~ \"the\.target\" ]].
I'm not happy with the working version because it's too brittle. While I don't intend to support all the flexibility of XML, I would like to be tolerant of semantically irrelevant whitespace, as well as the id property being anywhere within the tag. Ideally, the regex I'd like to write would express:
id (preceded by at least one whitespace)
followed by zero or more whitespaces
followed by =
followed by zero or more whitespaces
followed by "the.target"
I think I'm not delimiting the regex properly inside the [[ ... =~ REGEX ]] construction, but none of the answers I've seen online use any delimiters whatsoever. In javascript, regex literals are bounded (e.g. / +id *= *"the\.target"/), so it's straightforward beginning a regex with a whitespace character that you care about. Also, JS doesn't have any magic re: *, whereas bash is 50% magic-handling-of-asterisks.
Any help is appreciated. My backup plan is maybe to try to use awk instead (which I'm no better at).
EDIT: My sed was really close. I forgot to add + after the [^;] set. Oof.
It would be much easier if you define regular expression pattern in a variable :
tag=' id = "the.target"'
pattern=' +id *= *"the\.target"'
if [[ $tag =~ $pattern ]]; then
echo matched.
fi
Thank you for giving us such a clear example that regex is not the way to solve this problem.
A SVG file is an XML file, and a possible tool to modify these is xmlstarlet.
Try this script I called modifycolor:
#!/bin/bash
# invoke as: modifycolor <svg.file> <target_id> <new_color>
xmlstarlet edit \
--update "//path[#id = '$2']/#style" --value "fill:#$3" \
"$1"
Assuming the svg file is test.svg, invoke it as:
./modifycolor test.svg the.target ff0000
You will be astonished by the result.
If you want to paste a piece of code inside your bash script, try this:
target="the.target"
newSvg=$(xmlstarlet edit \
--update "//path[#id = '${target}']/#style" --value "fill:#${myColor}" \
"${iconSvgPath}")
Thanks to folks for pointing out the mistakes in my bash-fu, I came up with this code which does what I said I wanted. I will not be marking this as the accepted answer because, as folks have observed, regex is a bad way to operate on XML. Sharing this for posterity.
local newSvg="" # will hold newly-written SVG code
while IFS="<$IFS" read tag
do
if [[ "${tag}" =~ \ +id\ *=\ *\"the\.target\" ]]; then
tag=$(echo "${tag}" | sed -E 's/fill:[^;]+;/fill:'"${color}"';/')
fi
newSvg="${newSvg}${tag}"
done < ${iconSvgPath}
Fixes:
escape the whitespace in the regex: =~ \ +id\ *=\ *
for sed, switch to double-quotes for the variable in the pattern
also for sed, I added the -E extended regex flag in order to support the negated set [^;]
Re: XML, I'll be comparing the list of available CLI-friendly XML parsers to the set of tools commonly available on my users' machines.

Perl Regex Command Line Issue

I'm trying to use a negative lookahead in perl in command line:
echo 1.41.1 | perl -pe "s/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g"
to get an incremented version that looks like this:
1.41.2
but its just returning me:
![0-9]+\.[0-9]+\.: event not found
i've tried it in regex101 (PCRE) and it works fine, so im not sure why it doesn't work here
In Bash, ! is the "history expansion character", except when escaped with a backslash or single-quotes. (Double-quotes do not disable this; that is, history expansion is supported inside double-quotes. See Difference between single and double quotes in Bash)
So, just change your double-quotes to single-quotes:
echo 1.41.1 | perl -pe 's/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g'
and voilĂ :
1.41.2
I'm guessing that this expression also might work:
([0-9.]+)\.([0-9]+)
Test
perl -e'
my $name = "1.41.1";
$name =~ s/([0-9.]+)\.([0-9]+)/$1\.2/;
print "$name\n";
'
Output
1.41.2
Please see the demo here.
If you want to "increment" a number then you can't hard-code the new value but need to capture what is there and increment that
echo "1.41.1" | perl -pe's/[0-9]+\.[0-9]+\.\K([0-9]+)/$1+1/e'
Here /e modifier makes it so that the replacement side is evaluated as code, and we can +1 the captured number, what is then substituted. The \K drops previous matches so we don't need to put them back; see "Lookaround Assertions" in Extended Patterns in perlre.
The lookarounds are sometimes just the thing you want, but they increase the regex complexity (just by being there), can be tricky to get right, and hurt efficiency. They aren't needed here.
The strange output you get is because the double quotes used around the Perl program "invite" the shell to look at what's inside whereby it interprets the ! as history expansion and runs that, as explained in ruakh's post.
As an alternate to lookahead, we can use capture groups, e.g. the following will capture the version number into 3 capture groups.
(\d+)\.(\d+)\.(\d+)
If you wanted to output the captured version number as is, it would be:
\1.\2.\3
And to just replace the 3rd part with the number "2" would be:
\1.\2.2
To adapt this to the OP's question, it would be:
$ echo 1.14.1 | perl -pe 's/(\d+)\.(\d+)\.(\d+)/\1.\2.2/'
1.14.2
$

Using ampersand in sed

I have a csv file full of lines like the following:
Aity Chel Jenni,Hendaland 229,2591 TE Amsterdam
I want to create a sed pattern for in an automated batch script that changes the info in this kind of formatting into the following formatting:
Aity Chel Jenni,Hendaland 30,2591 TE, Amsterdam
With a bit of research, I found out that I had to create a regex, then use an ampersand (&) character to have it change things around using the & to define the location of the regex.
I have tried the following:
sed 's/([1-9] [A-Z]{2}/&,/' file1 >file2
And have been trying variants of that trying to get the regexes down, but it doesn't seem to change anything.
Am I making a mistake in the usage of the ampersand or is my regex wrong?
Reading through the internet I can't seem to wrap my head around this function, can someone give me any examples/explain to me how to properly do this?
You are saying
sed 's/([1-9] [A-Z]{2}/&,/' file1 >file2
^
But you don't have to capture with () to use &. Instead, just say:
sed 's/[1-9] [A-Z]\{2\}/&,/' file
Note you need to escape the elements in the { } quantifier, unless you use -r:
sed -r 's/[1-9] [A-Z]{2}/&,/' file
Try the following:
sed -r 's:[0-9] [A-Z]{2}\b:&,:' file > out
About your own pattern, you're missing the closing parenthesis. And, iirc, you need to escape ( inside sed patterns to not match them literally.
The -r option enabled sed to use extended regex, which provides the {2} expansion.

change ampersands in href

I know s/&/\&/g replaces all escaped ampersands and replaces them with ampersands. I want to be more picky. I want to only replace those escaped ampersands if they are in an href. I can't figure it out.
I was trying the following but it wasn't working:
echo "Link" | sed -E 's/^href="(.*)&/\1&/g'
It didn't work. I also see another problem being it would only do the first instance of an escaped ampersand and not all. Anyone know what the solution might be?
Not sure how to do it with sed, but here's Ruby:
echo 'Link' | ruby -pe '$_.gsub!(/href="([^"]*)"/) { |h| h.gsub("&", "&") }'
However, I fully support #muistooshort's comment: unless you're doing something weird, you should want the & in there.
perl -e '$url=$ARGV[0]; while ( $url =~ s/(Link'
Easily amended to run through a file

Using sed to remove all console.log from javascript file

I'm trying to remove all my console.log, console.dir etc. from my JS file before minifying it with YUI (on osx).
The regex I got for the console statements looks like this:
console.(log|debug|info|warn|error|assert|dir|dirxml|trace|group|groupEnd|time|timeEnd|profile|profileEnd|count)\((.*)\);?
and it works if I test it with the RegExr.
But it won't work with sed.
What do I have to change to get this working?
sed 's/___???___//g' <$RESULT >$RESULT_STRIPPED
update
After getting the first answer I tried
sed 's/console.log(.*)\;//g' <test.js >result.js
and this works, but when I add an OR
sed 's/console.\(log\|dir\)(.*)\;//g' <test.js >result.js
it doesn't replace the "logs":
Your original expression looks fine. You just need to pass the -E flag to sed, for extended regular expressions:
sed -E 's/console.(log|debug|info|...|count)\((.*)\);?//g'
The difference between these types of regular expressions is explained in man re_format.
To be honest I have never read that page, but instead simply tack on an -E when things don't work as expected. =)
You must escape ( (for grouping) and | (for oring) in sed's regex syntax. E.g.:
sed 's/console.\(log\|debug\|info\|warn\|error\|assert\|dir\|dirxml\|trace\|group\|groupEnd\|time\|timeEnd\|profile\|profileEnd\|count\)(.*);\?//g'
UPDATE example:
$ sed 's/console.\(log\|debug\|info\|warn\|error\|assert\|dir\|dirxml\|trace\|group\|groupEnd\|time\|timeEnd\|profile\|profileEnd\|count\)(.*);\?//g'
console.log # <- input line, not matches, no replacement printed on next line
console.log
console.log() # <- input line, matches, no printing
console.log(blabla); # <- input line, matches, no printing
console.log(blabla) # <- input line, matches, no printing
console.debug(); # <- input line, matches, no printing
console.debug(BAZINGA) # <- input line, matches, no printing
DATA console.info(ditto); DATA2 # <- input line, matches, printing of expected data
DATA DATA2
HTH
I also find the way to remove all the console.log ,
and i am trying to use python to do this,
but i find the Regex is not work for.
my writing like this:
var re=/^console.log(.*);?$/;
but it will match the following string:
'console.log(23);alert(234dsf);'
does it work? with the
"s/console.(log|debug|info|...|count)((.*));?//g"
I try this:
sed -E 's/console.(log|debug|info)( ?| +)\([^;]*\);//g'
See the test:
Regex Tester
Here's my implementation
for i in $(find ./dir -name "*.js")
do
sed -E 's/console\.(log|warn|error|assert..timeEnd)\((.*)\);?//g' $i > ${i}.copy && mv ${i}.copy $i
done
took the sed thing from github
I was feeling lazy and hoping to find a script to copy & paste. Alas there wasn't one, so for the lazy like me, here is mine. It goes in a file named something like 'minify.sh' in the same directory as the files to minify. It will overwrite the original file and it needs to be executable.
#!/bin/bash
for f in *.js
do
sed -Ei 's/console.(log|debug|info)\((.*)\);?//g' $f
yui-compressor $f -o $f
done
I'd just like to add here that I was running into issues with namespaced console.logs such as window.console.log. Also Tweenmax.js has some interesting uses of console.log in some parts such as
window.console&&console.log(t)
So I used this
sed -i.bak s/[^\&a-zA-Z0-9\.]console.log\(/\\/\\//g js/combined.js
The regex effectively says replace all console.logs that don't start with &, alphanumerics, and . with a '//' comment, which uglify later takes out.
Rodrigocorsi's works with nested parentheses. I added a ? after the ; because yuicompressor was omitting some semicolons.
It is probable that the reason this is not working is that you are not 'limiting'
the regex to not include a closing parenthesises ()) in the method parameters.
Try this regular expression:
console\.(log|trace|error)\(([^)]+)\);
Remember to include the rest of your method names in the capture group.