How to match any non white space character except a particular one? - regex

In Perl \S matches any non-whitespace character.
How can I match any non-whitespace character except a backslash \?

You can use a character class:
/[^\s\\]/
matches anything that is not a whitespace character nor a \. Here's another example:
[abc] means "match a, b or c"; [^abc] means "match any character except a, b or c".

You can use a lookahead:
/(?=\S)[^\\]/

This worked for me using sed [Edit: comment below points out sed doesn't support \s]
[^ ]
while
[^\s]
didn't
# Delete everything except space and 'g'
echo "ghai ghai" | sed "s/[^\sg]//g"
gg
echo "ghai ghai" | sed "s/[^ g]//g"
g g

On my system: CentOS 5
I can use \s outside of collections but have to use [:space:] inside of collections. In fact I can use [:space:] only inside collections. So to match a single space using this I have to use [[:space:]]
Which is really strange.
echo a b cX | sed -r "s/(a\sb[[:space:]]c[^[:space:]])/Result: \1/"
Result: a b cX
first space I match with \s
second space I match alternatively with [[:space:]]
the X I match with "all but no space" [^[:space:]]
These two will not work:
a[:space:]b instead use a\sb or a[[:space:]]b
a[^\s]b instead use a[^[:space:]]b

If using regular expressions in bash or grep or something instead of just in perl, \S doesn't work to match all non-whitespace chars. The equivalent of \S, however, is [^\r\n\t\f\v ].
So, instead of this:
[^\s\\]
...you'll have to do this instead, to match no whitespace chars (regex: \r\n\t\f\v ) and no backslash (\; regex: \\)
[^\r\n\t\f\v \\]
References:
[my answer] Unix & Linux: Any non-whitespace regular expression

In this case, it's easier to define the problem of "non-whitespace without the backslash" to be not "whitespace or backslash", as the accepted answer shows:
/[^\s\\]/
However, for tricker problems, the regex set feature might be handy. You can perform set operations on character classes to get what you want. This one subtracts the set that is just the backslash from the set that is the non-whitespace characters:
use v5.18;
use experimental qw(regex_sets);
my $regex = qr/abc(?[ [\S] - [\\] ])/;
while( <DATA> ) {
chomp;
say "[$_] ", /$regex/ ? 'Matched' : 'Missed';
}
__DATA__
abcd
abc d
abc\d
abcxyz
abc\\xyz
The output shows that neither whitespace nor the backslash matches after c:
[abcd] Matched
[abc d] Missed
[abc\d] Missed
[abcxyz] Matched
[abc\\xyz] Missed
This gets more interesting when the larger set would be difficult to express gracefully and set operations can refine it. I'd rather see the set operation in this example:
[b-df-hj-np-tv-z]
(?[ [a-z] - [aeiou] ])

Related

Replace "advanced" pattern in sed

I cant figure out how to change this:
\usepackage{scrpage2}
\usepackage{pgf} \usepackage[latin1]{inputenc}\usepackage{times}\usepackage[T1]{fontenc}
\usepackage[colorlinks,citecolor=black,filecolor=black,linkcolor=black,urlcolor=black]{hyperref}
to this using sed only
REPLACED
REPLACED REPLACEDREPLACEDREPLACED
REPLACED
Im trying stuff like sed 's!\\.*\([.*]\)\?{.\+}!REPLACED!g' FILE
but that gives me
REPLACED
REPLACED
REPLACED
I think .* gets used and everything else in my pattern is just ignored, but I can't figure out how to go about this.
After I learned how to format a regex like that, my next step would be to change it to this:
\usepackage{scrpage2}
\usepackage{pgf}
\usepackage[latin1]{inputenc}
\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage[colorlinks,citecolor=black,filecolor=black,linkcolor=black,urlcolor=black]{hyperref}
So I would appreciate any pointers in that direction too.
Here's some code that happens to work for the example you gave:
sed 's/\\[^\\[:space:]]\+/REPLACED/g'
I.e. match a backslash followed by one or more characters that are not whitespace or another backslash.
To make things more specific, you can use
sed 's/\\[[:alnum:]]\+\(\[[^][]*\]\)\?{[^{}]*}/REPLACED/g'
I.e. match a backslash followed by one or more alphanumeric characters, followed by an optional [ ] group, followed by a { } group.
The [ ] group matches [, followed by zero or more non-bracket characters, followed by ].
The { } group matches {, followed by zero or more non-brace characters, followed by }.
Perl to the rescue! It features the "frugal quantifiers":
perl -pe 's!\\.*?\.?{.+?}!REPLACED!g' FILE
Note that I removed the capturing group as you didn't use it anywhere. Also, [.*] matches either a dot or an asterisk, but you probably wanted to match a literal dot instead.

Regex command is replacing two characters instead of one

I am attempting to replace the spaces in my string with an under-bar. With my limited coding experience, I have come up with this -
s/\b[ ]\D/_/g
This command works in finding all of the appropriate selections of my file however, it replaces the space and the proceeding character rather than only the space. How can I insure it only replaces the whitespaces and no additional characters?
Also, I would not like this to affect number characters (hence the \D).
The regex \b[ ]\D (which could also be written as \b \D, by the way) matches the space and the following non-digit character, so that's what's replaced with an underscore.
There are two (well, there are more, but these two are the straightforward ones) ways go go about fixing this in Perl:
With a capture group and back reference:
s/\b (\D)/_\1/g
Here the regex will still match the space and the non-digit character, but the non-digit character will be remembered as \1 and used as part of the replacement.
With a lookahead zero-length assertion:
s/\b (?=\D)/_/g
(?=\D) matches the empty string if (and only if) it is followed by something matching \D, so the non-digit character is no longer part of the match and is not replaced.
Addendum: By the way, I suspect you meant to use \b\D instead of just \D. \D matches spaces (because they are not digits), therefore
$ echo 'foo 123 bar baz qux' | perl -pe 's/\b (?=\D)/_/g'
foo 123_bar_ baz_qux
as opposed to
$ echo 'foo 123 bar baz qux' | perl -pe 's/\b (?=\b\D)/_/g'
foo 123_bar baz_qux
Try
s/\s/_/g
The \s is the character that will match all whitespace.
If you are worried about abutting spaces use \s+
the + means 1 or more whitespace characters.

\S Regular Expression

Can someone give me an example of \S in a regular expression working? My understanding is that it should match any line that does not begin with \t, \n, etc.
If this is my file:
test
\ttesting
cat testfile | awk '/\S/ {print}'
Produces no output but I'd expect it to output the \ttesting. I haven't found a good example of what \S is supposed to do or how to get it to work.
As written, /\S/ matches if there is a non-whitespace character anywhere in the line. Thus it matches both lines. It sounds like you want to match on the beginning of the line:
$ cat testfile | awk '/^\S/ {print}'
test
$ cat testfile | awk '/^\s/ {print}'
testing
The caret ^ matches only at the beginning of a line. From the first example above, /^\S/matches on any line whose first character after the beginning of the line is a non-whitespace character. Thus, it matches the first line in your test file.
The second example does the opposite: it matches if the first character after the start of the line is a whitespace character (\s is the opposite of \S: it matches whitespace). Thus, it matches the line that starts with a tab.
The behavior of \S and \s are documented in section 3.5 of the GNU awk manual which states:
\s
Matches any whitespace character. Think of it as shorthand for [[:space:]].
\S
Matches any character that is not whitespace. Think of it as shorthand for [^[:space:]].
I do not think the \S flag is supported in all implementations of awk. It is not listed under Regular Expression Operators in the documentation. Your version of awk may or may not support it.
Another easy command line tool which does support it is grep. However, for your purposes, you need to specify that you only want to match non-whitespace at the beginning of string, so you need to use the ^ operator to do beginning of string.
cat testfile | grep '^\S'
Output:
testing
\S
When the UNICODE flags is not specified, matches any non-whitespace
character; this is equivalent to the set [^ \t\n\r\f\v] The LOCALE
flag has no extra effect on non-whitespace match. If UNICODE is set,
then any character not marked as space in the Unicode character
properties database is matched.
https://docs.python.org/2/library/re.html
Here is the sample:
cat -A file
sdf$
$
test$
^Itesting$
$
$
^I^I^I^I$
asdf$
afd afd$
so after run in gnu awk v4.1
awk '/\S/' file
sdf
test
testing
asdf
afd afd
It removes all empty lines or while space line (line with only space, tab, or enter, etc)
here is my awk version in cygwin
awk --version |head -1
GNU Awk 4.1.0, API: 1.0 (GNU MPFR 3.1.2, GNU MP 4.3.2)
refer link: The GNU Awk User's Guide
3.5 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\s
Matches any whitespace character. Think of it as shorthand for [[:space:]].
\S
Matches any character that is not whitespace. Think of it as shorthand for [^[:space:]].
\w
Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for [[:alnum:]_].
\W
Matches any character that is not word-constituent. Think of it as shorthand for [^[:alnum:]_].
\S is everything excluded by \s
\s means [\r\n\t\f ] so better watch out. If you dont want to print out the strings beginning with \t then only use \S
for strings beginning with any of \r\t\n\f you need \s
so NOT \s is \S
so you can guess it: \s + \S means everything i.e. equivalent to .*

Wildcard beginning of a line in perl

How to use wildcard for beginning of a line?
Example, I want to replace abc with def.
This is what my file looks like
abc
abc
abc
hg abc
Now I want that abc should be replaced in only first 3 lines. How to do it?
$_ =~ s/['\s'] * abc ['\s'] * /def/g;
What condition to be put before beginning of first space?
Thanks
What about:
s/(^ *)abc/$1def/g
(^ *) -> zero or morespaces at start of line
This will strictly replace abc with def.
Also note I've used a real space and not \s because you said "beginning of first space". \s matches more characters than only space.
You are making a couple of mistakes in your regex
$_ =~ s/['\s'] * abc ['\s'] * /def/g;
You don't need /g (global, match as many times as possible) if you only want to replace from the beginning of the string (since that can only match once).
Inside a character class bracket all characters are literal except ], - and ^, so ['\s'] means "match whitespace or apostrophe '"
Spaces inside the regex is interpreted literally, unless the /x modifier is used (which it is not)
Quantifiers apply to whatever they immediately precede, so \s* means "zero or more whitespace", but \s * means "exactly one whitespace, followed by zero or more space". Again, unless /x is used.
You do not need to supply $_ =~, since that is the variable any regex uses unless otherwise specified.
If you want to replace abc, and only abc when it is the first non-whitespace in a line, you can do this:
s/^\s*\Kabc/def/
An alternate for the \K (keep) escape is to capture and put back
s/^(\s*)abc/$1def/
If you want to keep the whitespace following the target string abc, you do not need to do anything. If you want it removed, just add \s* at the end
s/^\s*\Kabc\s*/def/
Also note that this is simply a way to condense logic into one statement. You can also achieve the same by using very simple building blocks:
if (/^\s*abc/) { # if abc is the first non-whitespace
s/abc/def/; # ...substitute it
}
Since the substitution only happens once (if the /g modifier is not used), and only the first match is affected, this will flawlessly substitute abc for def.
Try this:
$_ =~ s/^['\s'] * abc ['\s'] * /def/g;
If you need to check from start of a line then use ^.
Also, I am not sure why you have ' and spaces in your regex. This should also work for you:
$_ =~ s/^[\s]*abc[\s]*/def/g;
Use ^ character, and remove unnecessary apostrophes, spaces and [ ] :
$_ =~ s/^\s*abc/def/g
If you want to keep those spaces that were before the "abc":
$_ =~ s/^(\s*)abc/\1def/g

Vim regex backreference

I want to do this:
%s/shop_(*)/shop_\1 wp_\1/
Why doesn't shop_(*) match anything?
There's several issues here.
parens in vim regexen are not for capturing -- you need to use \( \) for captures.
* doesn't mean what you think. It means "0 or more of the previous", so your regex means "a string that contains shop_ followed by 0+ ( and then a literal ). You're looking for ., which in regex means "any character". Put together with a star as .* it means "0 or more of any character". You probably want at least one character, so use .\+ (+ means "1 or more of the previous")
Use this: %s/shop_\(.\+\)/shop_\1 wp_\1/.
Optionally end it with g after the final slash to replace for all instances on one line rather than just the first.
If I understand correctly, you want %s/shop_\(.*\)/shop_\1 wp_\1/
Escape the capturing parenthesis and use .* to match any number of any character.
(Your search is searching for "shop_" followed by any number of opening parentheses followed by a closing parenthesis)
If you would like to avoid having to escape the capture parentheses and make the regex pattern syntax closer to other implementations (e.g. PCRE), add \v (very magic!) at the start of your pattern (see :help \magic for more info):
:%s/\vshop_(*)/shop_\1 wp_\1/
#Luc if you look here: regex-info, you'll see that vim is behaving correctly. Here's a parallel from sed:
echo "123abc456" | sed 's#^([0-9]*)([abc]*)([456]*)#\3\2\1#'
sed: -e expression #1, char 35: invalid reference \3 on 's' command's RHS
whereas with the "escaped" parentheses, it works:
echo "123abc456" | sed 's#^\([0-9]*\)\([abc]*\)\([456]*\)#\3\2\1#'
456abc123
I hate to see vim maligned - especially when it's behaving correctly.
PS I tried to add this as a comment, but just couldn't get the formatting right.