\S Regular Expression - regex

Can someone give me an example of \S in a regular expression working? My understanding is that it should match any line that does not begin with \t, \n, etc.
If this is my file:
test
\ttesting
cat testfile | awk '/\S/ {print}'
Produces no output but I'd expect it to output the \ttesting. I haven't found a good example of what \S is supposed to do or how to get it to work.

As written, /\S/ matches if there is a non-whitespace character anywhere in the line. Thus it matches both lines. It sounds like you want to match on the beginning of the line:
$ cat testfile | awk '/^\S/ {print}'
test
$ cat testfile | awk '/^\s/ {print}'
testing
The caret ^ matches only at the beginning of a line. From the first example above, /^\S/matches on any line whose first character after the beginning of the line is a non-whitespace character. Thus, it matches the first line in your test file.
The second example does the opposite: it matches if the first character after the start of the line is a whitespace character (\s is the opposite of \S: it matches whitespace). Thus, it matches the line that starts with a tab.
The behavior of \S and \s are documented in section 3.5 of the GNU awk manual which states:
\s
Matches any whitespace character. Think of it as shorthand for [[:space:]].
\S
Matches any character that is not whitespace. Think of it as shorthand for [^[:space:]].

I do not think the \S flag is supported in all implementations of awk. It is not listed under Regular Expression Operators in the documentation. Your version of awk may or may not support it.
Another easy command line tool which does support it is grep. However, for your purposes, you need to specify that you only want to match non-whitespace at the beginning of string, so you need to use the ^ operator to do beginning of string.
cat testfile | grep '^\S'
Output:
testing

\S
When the UNICODE flags is not specified, matches any non-whitespace
character; this is equivalent to the set [^ \t\n\r\f\v] The LOCALE
flag has no extra effect on non-whitespace match. If UNICODE is set,
then any character not marked as space in the Unicode character
properties database is matched.
https://docs.python.org/2/library/re.html

Here is the sample:
cat -A file
sdf$
$
test$
^Itesting$
$
$
^I^I^I^I$
asdf$
afd afd$
so after run in gnu awk v4.1
awk '/\S/' file
sdf
test
testing
asdf
afd afd
It removes all empty lines or while space line (line with only space, tab, or enter, etc)
here is my awk version in cygwin
awk --version |head -1
GNU Awk 4.1.0, API: 1.0 (GNU MPFR 3.1.2, GNU MP 4.3.2)
refer link: The GNU Awk User's Guide
3.5 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\s
Matches any whitespace character. Think of it as shorthand for [[:space:]].
\S
Matches any character that is not whitespace. Think of it as shorthand for [^[:space:]].
\w
Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for [[:alnum:]_].
\W
Matches any character that is not word-constituent. Think of it as shorthand for [^[:alnum:]_].

\S is everything excluded by \s
\s means [\r\n\t\f ] so better watch out. If you dont want to print out the strings beginning with \t then only use \S
for strings beginning with any of \r\t\n\f you need \s
so NOT \s is \S
so you can guess it: \s + \S means everything i.e. equivalent to .*

Related

How to grep an exact string with slash in it?

I'm running macOS.
There are the following strings:
/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2
I want to grep only the bolded words.
I figured doing grep -wr 'superman/|/superman' would yield all of them, but it only yields /superman.
Any idea how to go about this?
You may use
grep -E '(^|/)superman($|/)' file
See the online demo:
s="/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2"
grep -E '(^|/)superman($|/)' <<< "$s"
Output:
/superman
/superman/batman
/superman/wonderwoman
/batman/superman
/wonderwoman/superman
The pattern matches
(^|/) - start of string or a slash
superman - a word
($|/) - end of string or a slash.
grep '/superman\>'
\> is the "end of word marker", and for "superman3", the end of word is not following "man"
The problems with your -w solution:
| is not special in a basic regex. You either need to escape it or use grep -E
read the man page about how -w works:
The test is that the
matching substring must either be at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end of the line or followed by a
non-word constituent character
In the case where the line is /batman/superman,
the pattern superman/ does not appear
the pattern /superman is:
at the end of the line, which is OK, but
is prededed by the character "n" which is a word constituent character.
grep -w superman will give you better results, or if you need to have superman preceded by a slash, then my original answer works.

Regex command is replacing two characters instead of one

I am attempting to replace the spaces in my string with an under-bar. With my limited coding experience, I have come up with this -
s/\b[ ]\D/_/g
This command works in finding all of the appropriate selections of my file however, it replaces the space and the proceeding character rather than only the space. How can I insure it only replaces the whitespaces and no additional characters?
Also, I would not like this to affect number characters (hence the \D).
The regex \b[ ]\D (which could also be written as \b \D, by the way) matches the space and the following non-digit character, so that's what's replaced with an underscore.
There are two (well, there are more, but these two are the straightforward ones) ways go go about fixing this in Perl:
With a capture group and back reference:
s/\b (\D)/_\1/g
Here the regex will still match the space and the non-digit character, but the non-digit character will be remembered as \1 and used as part of the replacement.
With a lookahead zero-length assertion:
s/\b (?=\D)/_/g
(?=\D) matches the empty string if (and only if) it is followed by something matching \D, so the non-digit character is no longer part of the match and is not replaced.
Addendum: By the way, I suspect you meant to use \b\D instead of just \D. \D matches spaces (because they are not digits), therefore
$ echo 'foo 123 bar baz qux' | perl -pe 's/\b (?=\D)/_/g'
foo 123_bar_ baz_qux
as opposed to
$ echo 'foo 123 bar baz qux' | perl -pe 's/\b (?=\b\D)/_/g'
foo 123_bar baz_qux
Try
s/\s/_/g
The \s is the character that will match all whitespace.
If you are worried about abutting spaces use \s+
the + means 1 or more whitespace characters.

Grep for a string that ends with specific character

Is there a way to use extended regular expressions to find a specific pattern that ends with a string.
I mean, I want to match first 3 lines but not the last:
file_number_one.pdf # comment
file_number_two.pdf # not interesting
testfile_number____three.pdf # some other stuff
myfilezipped.pdf.zip some comments and explanations
I know that in grep, metacharacter $ matches the end of a line but I'm not interested in matching a line end but string end. Groups in grep are very odd, I don't understand them well yet.
I tried with group matching, actually I have a similar REGEX but it does not work with grep -E
(\w+).pdf$
Is there a way to do string ending match in grep/egrep?
Your example works with matching the space after the string also:
grep -E '\.pdf ' input.txt
What you call "string" is similar to what grep calls "word". A Word is a run of alphanumeric characters. The nice thing with words is that you can match a word end with the special \>, which matches a word end with a march of zero characters length. That also matches at the end of line. But the word characters can not be changed, and do not contain punctuation, so we can not use it.
If you need to match at the end of line too, where there is no space after the word, use:
grep -E '\.pdf |\.pdf$' input.txt
To include cases where the character after the file name is not a space character '', but other whitespace, like a tab, \t, or the name is directly followed by a comment, starting with #, use:
grep -E '\.pdf[[:space:]#]|\.pdf$' input.txt
I will illustrate the matching of word boundarys too, because that would be the perfect solution, except that we can not use it here because we can not change the set of characters that are seen as parts of a word.
The input contains foo as separate word, and as part of longer words, where the foo is not at the end of the word, and therefore not at a word boundary:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n'
foo bar
foo.bar
foobar
foo_bar
foo
Now, to match the boundaries of words, we can use \< for the beginning, and \> to match the end:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n' | grep 'foo\>'
foo bar
foo.bar
foo
Note how _ is matched as a word char - but otherwise, wordchars are only the alphanumerics, [a-zA-Z0-9].
Also note how foo an the end of line is matched - in the line containing only foo. We do not need a special case for the end of line.
You can use \> operator
grep 'word\>' fileName
You need to escape the . in your regex. This regex will match anything that ends in .pdf (and only things that end in .pdf):
.*\.pdf$
Positive lookaheads are the most suited for this kinda stuff. Have a try :
grep -P "(^\w+\.pdf)(?=\s)" file
I assume filenames will always be on the start of the line.

How to terminate a regular expression and start another

I have a file which have the data something like this
34sdf, 434ssdf, 43fef,
34sdf, 434ssdf, 43fef, sdfsfs,
I have to identify the sdfsfs, and replace it and/or print the line.
The exact condition is the tokens are comma separated. target expression starts with a non numeric character, and till a comma is met.
Now i start with [^0-9] for starting with a non numeric character, but the next character is really unknown to me, it can be a number, a special char, an alphabet or even a space. So I wanted a (anything)*. But the previous [] comes into play and spoils it. [^0-9]* or [^0-9].*, or [^0-9]\+.*, or [^0-9]{1}*, or [^0-9][^,]* or [^0-9]{1}[^\,]*, nothing worked till now. So my question is how to write a regex for this (starting character a non numeric, then any character except a comma or any number of character till comma) I am using grep and sed (gnu). Another question is for posix or non-posix, any difference comes there?
Something like that maybe?
(?:(?:^(\D.*?))|(?:,\s(\D.*?))),
This captures the string that starts with a non-numeric character. Tested here.
I'm not sure if sed supports \D, but you can easily replace it with [^0-9] if not, which you already know.
EDIT: Can be trimmed to:
(?:\s|^)(\D.*?),
With sed, and slight modifications to your last regex:
sed -n 's/.*,[ ]*\([^ 0-9][^\,]*\),/\1/p' input
I think pattern (\s|^)(\D[^,]+), will catch it.
It matches white-space or start of string and group of a non-digit followed by anything but comma, which is followed by comma.
You can use [^0-9] if \D is not supported.
This might work for you (GNU sed):
sed '/\b[^0-9,][^,]*/!d' file # only print lines that match
or:
sed -n 's/\b[^0-9,][^,]*/XXX/gp' file # substitute `XXX` for match

How to match any non white space character except a particular one?

In Perl \S matches any non-whitespace character.
How can I match any non-whitespace character except a backslash \?
You can use a character class:
/[^\s\\]/
matches anything that is not a whitespace character nor a \. Here's another example:
[abc] means "match a, b or c"; [^abc] means "match any character except a, b or c".
You can use a lookahead:
/(?=\S)[^\\]/
This worked for me using sed [Edit: comment below points out sed doesn't support \s]
[^ ]
while
[^\s]
didn't
# Delete everything except space and 'g'
echo "ghai ghai" | sed "s/[^\sg]//g"
gg
echo "ghai ghai" | sed "s/[^ g]//g"
g g
On my system: CentOS 5
I can use \s outside of collections but have to use [:space:] inside of collections. In fact I can use [:space:] only inside collections. So to match a single space using this I have to use [[:space:]]
Which is really strange.
echo a b cX | sed -r "s/(a\sb[[:space:]]c[^[:space:]])/Result: \1/"
Result: a b cX
first space I match with \s
second space I match alternatively with [[:space:]]
the X I match with "all but no space" [^[:space:]]
These two will not work:
a[:space:]b instead use a\sb or a[[:space:]]b
a[^\s]b instead use a[^[:space:]]b
If using regular expressions in bash or grep or something instead of just in perl, \S doesn't work to match all non-whitespace chars. The equivalent of \S, however, is [^\r\n\t\f\v ].
So, instead of this:
[^\s\\]
...you'll have to do this instead, to match no whitespace chars (regex: \r\n\t\f\v ) and no backslash (\; regex: \\)
[^\r\n\t\f\v \\]
References:
[my answer] Unix & Linux: Any non-whitespace regular expression
In this case, it's easier to define the problem of "non-whitespace without the backslash" to be not "whitespace or backslash", as the accepted answer shows:
/[^\s\\]/
However, for tricker problems, the regex set feature might be handy. You can perform set operations on character classes to get what you want. This one subtracts the set that is just the backslash from the set that is the non-whitespace characters:
use v5.18;
use experimental qw(regex_sets);
my $regex = qr/abc(?[ [\S] - [\\] ])/;
while( <DATA> ) {
chomp;
say "[$_] ", /$regex/ ? 'Matched' : 'Missed';
}
__DATA__
abcd
abc d
abc\d
abcxyz
abc\\xyz
The output shows that neither whitespace nor the backslash matches after c:
[abcd] Matched
[abc d] Missed
[abc\d] Missed
[abcxyz] Matched
[abc\\xyz] Missed
This gets more interesting when the larger set would be difficult to express gracefully and set operations can refine it. I'd rather see the set operation in this example:
[b-df-hj-np-tv-z]
(?[ [a-z] - [aeiou] ])