AWK regex for gsubs pattern - regex

I am trying to define a gsub awk statement to find all non escaped $ chars and escape them.
so following input -> results should be handled:
$ -> \$
$a -> \$a
\$ -> \$
$$$ -> \$\$\$
So basically I am looking for the correct pattern to put in this statement:
gsub(pattern,"\\\$", input_string);

Using $ as a Field separator, awk splits the input and it would add a backslash at the end only if the string is empty or ends with any character but not of a backslash character.
$ cat file
$
$a$$$
\$
$$$\$$
$ awk -F$ -v OFS="$" '{for(i=1;i<NF;i++){if($i == "" || $i ~/[^\\]$/) $i=$i"\\"}}1' file
\$
\$a\$\$\$
\$
\$\$\$\$\$
You could also try the below perl solution.
perl -pe 's/(?<!\\)\$/\\\$/g' file
The substitution matches all $ that are not preceded by backslash and adds a backslash before them. The backslashes themselves all need escaping, as does the $, as it has a special meaning in regular expressions.

Related

Replace all non-alphanumeric characters in a string with an underscore

I want to replace special characters (regex \W) with _ (underscore)
But I don't want to replace whitespace with underscore
Also replace multiple consecutive special characters with single underscore
Example
String: The/Sun is red#
Output: The_Sun is red_
String: .//hack Moon
Output: _hack Moon
I have tried echo 'string' | sed 's/\W/_/g'
But it's not accurate
Use tr for that:
echo -n "The/Sun is red#" | tr -c -s '[:alnum:][:blank:]' '_'
[:alnum:][:blank:] represents alphanumeric characters and whitespace
-c (or --complement) means "use the opposite of that"
Use -s (or --squeeze-repeats) to squeeze duplicate underscores into one
sed approach:
s="The/Sun is red# .//hack Moon"
sed -E 's/[^[:alnum:][:space:]]+/_/g' <<<"$s"
The_Sun is red_ _hack Moon
[^[:alnum:][:space:]]+ - match any character sequence except alphanumeric and whitespace
Just with bash parameter expansion, similar pattern to other answers:
shopt -s extglob
for str in "The/Sun is red#" ".//hack Moon"; do
echo "${str//+([^[:alnum:][:blank:]])/_}"
# .........^^........................^ replace all
# ...........^^.....................^ one or more
# .............^^^^^^^^^^^^^^^^^^^^^ non-alnum, non-space character
done
The_Sun is red_
_hack Moon

line anchor behavior with perl regex

I recently wrote a little Perl script to trim whitespace from the end of lines and ran into unexpected behavior. I decided that Perl must include line-end characters when breaking up lines, so tested that theory and got even more unexpected behavior. I do not should either match \s+$ or t$...Not both. Very confused. Can anyone enlighten me?
£ cat example
I have space after me
I do not
£ perl -ne 'print if /\s+$/' example
I have a space after me
I do not
£ perl -ne 'print if /t$/' example
I do not
£
PCRE tester gives expected results. I've also tried the /m suffix with no change in behavior.
edit. for completeness:
£ perl -ne 'print if /e$/' example
£
Expected behavior from perl -ne 'print if...' was the same as grep -P:
£ grep -P '\s+$' example
I have a space after me
£
Can repro under Ubuntu 16.04 perl v5.22.1 (both 60 and 68 patch version) and MINGW perl v5.26.1.
You see your current behavior because in example file the second line has \n character at the end. \n is the space which matched by \s
perlretut
no modifiers: Default behavior. ... '$' matches only at the end or before a newline at the end.
At your regex \s matches a whitespace character, the set [\ \t\v\r\n\f]. In other words it matches the spaces and \n character. Then $ matches the end of line (no characters, just the position itself). Like word anchor \b matches word boundary, and ^ matches the beginning of the line and not the first character
You could rewrite your regex like this:
/[\t ]+$/
The content of example would look like this if second line didn't end with a \n character:
£ cat example
I have space after me
I do not£
NOTICE that shell prompt £ is not on next line
The results are different because grep abstracts out line endings like Perl's -l flag. (grep -P '\n' will return no results on a text file where grep -Pz '\n' will.)
Your problems stem from the -n option and the use of \s. The -n flag feeds the input to Perl line by line into $_, then it calls the print if match statement.
In your match you use the $ anchor to match the end of the line. The anchor is purely positional and does not consume the newline or any other character.
Check it yourself here with \s+: Whether your add a $ or not, the regex matches the same number of characters.
This is because \s is equal to [\r\n\t\f\v ] and matches any whitespace character and you have added the + quantifier. So, it matches between one and unlimited times, as many times as possible (greedy).
If you searched just for trailing space characters instead you are good: [ ]+$ (here escaped with a group):
£ perl -ne 'print if /[ ]+$/' example
That way it does not match the \n like \s does. Try it yourself here.
Bonus:
Here are some common Perl one-liners to trim spaces:
# Strip leading whitespace (spaces, tabs) from the beginning of each line
perl -ple 's/^[ \t]+//'
perl -ple 's/^\s+//'
# Strip trailing whitespace (space, tabs) from the end of each line
perl -ple 's/[ \t]+$//'
# Strip whitespace from the beginning and end of each line
perl -ple 's/^[ \t]+|[ \t]+$//g'

Extract strings that lie outside the brackets using sed or awk

I have a string of the format abc(something that should be removed) is bad(another thing to remove): basically a string that has some words which are not in parenthesis and some that are in parenthesis. I want to extract the words that are not parenthesized. For eg. in the above example, output should be abc is bad.
You could try the below sed command,
sed 's/([^()]*)//g' file
Example:
$ cat file
abc(something that should be removed) is bad(another thing to remove)
$ sed 's/([^()]*)//g' file
abc is bad
Default sed uses BRE (Basic Regular Expressions) so you don't need to escape ( or ) to match a literal (, ) symbols.

UNIX grep with $

I have a quick question:
Suppose I have a file contains:
abc$
$
$abc
and then I use grep "c\$" filename, then I got abc$ only. But if I use grep "c\\$", I got abc$.
I am pretty confused, doesn't back slash already turn off the special meaning of $? So grep "c\$" filename return me the line abc$?
Really hope who can kindly give me some suggestion.
Many thanks in advance.
The double quotes are throwing you off. That allows the shell to expand meta-characters. On my Linux box using single quotes only:
$ grep 'abc$' <<<'abc$'
$ grep 'abc\$' <<<'abc$'
$ grep 'abc\$' <<<"abc$"
abc$
$ grep 'abc$' <<<'abc$'
$ grep 'abc\\$' <<<'abc$'
$
Note that the only grep in the five commands above that found the pattern (and printed it out) was abc\$. If I didn't escape the $, it assumed I was looking for the string abc that was anchored to the end of the line. When I put a single backslash before the $, it recognized the $ as a literal character and not as a end of line anchor.
Note that the $ as an end of line anchor has some intelligence. If I put the $ in the middle of a regular expression, it's a regular character:
$ grep 'a$bc' <<<'a$bc'
a$bc
$ grep 'a\$bc' <<<'a$bc'
a$bc
Here, it found the literal string a$bc whether or not i escaped the $.
Tried things with double quotes:
$ grep "abc\$" <<<'abc$'
$ grep "abc\\$" <<<'abc$'
abc$
The single \ escaped the $ as a end of line anchor. Putting two \\ in front escaped the $ as a non-shell meta-character and as a regular expression literal.
If you're tempted to think that $ need to be escaped, then it's not so.
From the GNU grep manual, you'd figure:
The meta-characters that need to be escaped while using basic regular expressions are ?, +, {, |, (, and ).
I would suggest using fgrep if you want to search for literal $ and avoid escaping $ (which means end of line):
fgrep 'abc$' <<< 'abc$'
gives this output:
abc$
PS: fgrep is same as grep -F and as per the man grep
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
Sign $ has special meaning in regexp patterns as the end of line, so when you use double quotes
grep "c\$"
the string expanded as two characters c and $ and grep thinks that it is regexp clause c'mon, find all lines with 'c' at the end.
In case of singe quotes, all characters treated as each one, i.e.
grep 'c\$'
command will have three characters c, \ and $. So grep will got all those symbols at its input and therefore he gets escaped special $ symbol, i.e. as \$ and do what you have expected.

using sed to replace ^[(s3B with blank space

I'm trying to use sed with perl to replace ^[(s3B with an empty string in several files.
s/^[(s3B// isn't working though, so I'm wondering what else I could try.
You need to quote the special characters:
$ echo "^[(s3B AAA ^[(s3B"|sed 's/\^\[[(]s3B//g'
AAA
$ echo "^[(s3B AAA ^[(s3B" >file.txt
$ perl -p -i -e 's/\^\[[(]s3B//g' file.txt
$ cat file.txt
AAA
The problem is that there are several characters that have a special meaning in regular expressions. ^ is a start-of-line anchor, [ opens a character class, and ( opens a capture.
You can escape all non-alphanumerics in a Perl string by preceding it with \Q, so you can safely use
s/\Q^[(s3B//
which is equivalent to, and more readable than
s/\^\[\(s3B//
If you're dealing with ANSI sequences (xterm color sequences, escape sequences), then ^[ is not '^' followed by '[' but rather an unprintable character ESC, ASCII code 0x1B.
To put that character into a sed expression you need to use \x1B in GNU sed, or see http://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/ . You can also insert special characters directly into your command line using ctrl+v in Bash line editing.
In regex "^", "[" and "(" (and many others) are special characters used for special regex features, if you are referencing the characters themselves you should preceed them with "\".
The correct substitution reges would be:
$string =~ s/\^\[\(3B//g
if you want to replace all occurences.