Escaping char in regex - regex

I ran this example in my bash terminal:
echo "ab" | egrep '\a\b'
Output
ab
then I ran this one:
echo "a" | egrep '\a\b'
Output
a
I was confused. Why did I get output?
But then I input another one:
echo "a" | egrep '\ab'
and didn't get output.
What is the difference between \a\b and \ab for Extended RegEx?
P.S
Regex didn't ignore the first letter:
echo "b" | egrep '\a\b'
Output is empty
Regex acted as I expected, but what is the difference from the first case?
P.P.S
I found out that this example:
echo 'ab' | egrep '\a\b'
also does not output anything. In the first case (echo "ab" | egrep '\a\b'), the output was ab. How can quotes affect regex?

The '\a\b' makes \a\b regex where \a is an unknown escape sequence and matches an a and \b is a word boundary.
echo "a" | egrep '\a\b' - correctly outputs a, there is an a not followed with a digit, letter or underscore.
echo "a" | egrep '\ab' - correctly does not output anything, a does not contain ab.
echo "b" | egrep '\a\b' - correctly does not output anything, b does not contain a.
As for echo "ab" | egrep '\a\b', it should not output any result, ab does not contain a at a word boundary, there is b after a.
Other letters that means something else when escaped (GNU grep, POSIX):
\B - a position other than a word boundary position
\s - whitespace
\S - non-whitespace
\w - letters, digits, and underscores
\W - characters different from letters, digits, and underscores.
Also, there is \< (start of word) and \> (end of word).
If you use -P, PCRE pattern, there are even more escape sequences available (see pcresyntax and pcrepattern).

Related

sed replaces word boundary with two periods/fullstops

why would sed replace the word boundary with three periods/fullstops instead of one?
echo " 0/1:53,0,56:5:3,2 0/0:0,18,155:6:6,0 0/0:0,35,255:23:22,1" | sed 's/:[0-9,]\+\b/\./g'
returns 0/1... 0/0... 0/0...
This happens even when I use \> instead of \b for word boundary.
I'm running on
Operating System: Ubuntu 16.04.7 LTS
Kernel: Linux 4.15.0-128-generic
You get multiple dots because you have : in the number sequence as well. Do this:
$ echo " 0/1:53,0,56:5:3,2 0/0:0,18,155:6:6,0 0/0:0,35,255:23:22,1" | sed 's/:[0-9,:]\+/./g'
0/1. 0/0. 0/0.
In other words scan over [0-9,:]\+ instead of [0-9,]\+. Also there is no need to escape the dot in the replacement part.
You want to remove all non-space chars after each :, so use [^ ]* POSIX BRE pattern instead of [0-9,]\+:
echo " 0/1:53,0,56:5:3,2 0/0:0,18,155:6:6,0 0/0:0,35,255:23:22,1" | \
sed 's/:[^ ]*/./g'
# => 0/1. 0/0. 0/0.
See the online sed demo.
If there can be any whitespace, use sed 's/:[^[:space:]]*/./g'.
Note you do not need to escape the dot in the replacement pattern, it is a literal . char there.

To match all characters not ending with specified string

echo "xxabc jkl" | grep -onP '\w+(?!abc\b)'
1:xxabc
1:jkl
Why the result is not as below?
echo "xxabc jkl" | grep -onP '\w+(?!abc\b)'
1:jkl
The first string is xxabc which ending with abc.
I want to extract all characters which not ending with abc,why xxabc matched?
How to fix it,that is to say get only 1:jkl as output?
Why '\w+(?!abc\b)' can't work?
The \w+(?!abc\b) pattern matches xxabc because \w+ matches 1 or more word chars greedily, and thus grabs xxabc at once. Then, the negative lookahead (?!abc\b) makes sure there is no abc with a trailing word boundary immediately to the left of the current location. Since after xxabc there is no abc with a trailing word boundary, the match succeeds.
To match all words that do not end with abc using a PCRE regex, you may use
echo "xxabc jkl" | grep -onP '\b\w+\b(?<!abc)'
See the online demo
Details
\b - a leading word boundary
\w+ - 1 or more word chars
\b - a trailing word boundary
(?<!abc) - a negative lookbehind that fails the match if the 3 letters immediately to the left of the current location are abc.
Without pcregrep special features, you can do it adding a pipe to sed:
echo "xxabc jkl" | sed 's/[a-zA-Z]*abc//g' | grep -onE '[a-zA-Z]+'
or with awk:
echo "xxabc jkl" | awk -F'[^a-zA-Z]+' '{for(i=1;i<=NF;i++){ if ($i!~/abc$/) printf "%s: %s\n",NR,$i }}'
other approach:
echo "xxabc jkl" | awk -F'([^a-zA-Z]|[a-zA-Z]*abc\\>)+' '{OFS="\n"NR": ";if ($1) printf OFS;$1=$1}1'

Grep only for lowercase and spaces

I need to grep files for lines containing only lowercase letters and spaces. Both conditions must be met at least once and no other characters are allowed.
I know how to grep only for lowercase or only for space but I don't know how to join those two conditions in one regexp/command.
I have only this right now:
egrep "[[:space:]]" $DIR/$file | egrep -vq "[[:upper:]]"
which of course will display lines with digits and/or special characters as well which is not what I want.
Thanks.
This is what you require
The -x matches whole lines
The first expression matches lines composed entirely of spaces and lower case letters.
The second expression matches lines that have both a space and a lower case letter.
egrep -x '[[:lower:] ]*' $DIR/$file | egrep '( [[:lower:]])|([[:lower:]] )'
awk may be better to express such conditions:
awk '/^[ a-z]+$/ && /[a-z]/ && / /' file
That is, it checks that a line:
consists in just spaces and lowercase letters.
it contains at least a lowercase.
it contains at least a space.
Test
$ cat a
hello this is something simple
but SUDDENLY not
wah
wa ah
$ awk '/^[ a-z]+$/ && /[a-z]/ && / /' a
hello this is something simple
wa ah
First grep all lines that only consist of lowercase characters and whitespace, and then all those that contain at least one whitespace.
egrep -x '[[:lower:][:space:]]+' "$DIR/$file" | egrep '[[:space:]]+'
The [:space:] meta class also matches for tabs, and can be replaced with a plain space if desired.

Using sed in bash to match a specific character as long as it is not preceded or followed by any other character

How can I use sed to match only a lonely character (not part of a word)? For example, I want to match any instance of "a" in a file, but not match the "a" in "bag" or "contain".
With GNU sed, you can require that the a be surrounded by word boundaries. Use \b to denote a word boundary. For example:
$ echo 'a bag a part a.' | sed 's/\ba\b/A/g'
A bag A part A.
In addition to \b, GNU sed, among other seds, supports word-begin, \<, and word-end, \>. For example:
$ echo 'a bag a part a.' | sed 's/\<a\>/A/g'
A bag A part A.
For purposes of defining a word boundary, a word character can be any alphanumeric character or an underline. The boundary is where a word character is adjacent to a non-word character.
POSIX sed
Without the word boundary extensions, the same thing can be done with three substitution commands:
$ echo 'a bag a part a.' | sed -E -e 's/^a([^[:alnum:]_])/A\1/g' -e 's/^a$/A/g' -e 's/([^[:alnum:]_])a([^[:alnum:]_])/\1A\2/g'
A bag A part A.

regular expression to extract number from string

I want to extract number from string. This is the string
#all/30
All I want is 30. How can I extract?
I try to use :
echo "#all/30" | sed 's/.*\/([^0-9])\..*//'
But nothing happen.
How should I write for the regular expression?
Sorry for bad english.
You may consider using grep to extract the numbers from a simple string like this.
echo "#all/30" | grep -o '[0-9]\+'
-o option shows only the matching part that matches the pattern.
You could try the below sed command,
$ echo "#all/30" | sed 's/[^0-9]*\([0-9]\+\)[^0-9]*/\1/'
30
[^0-9]* [^...] is a negated character class. It matches any character but not the one inside the negated character class. [^0-9]* matches zero or more non-digit characters.
\([0-9]\+\) Captures one or more digit characters.
[^0-9]* Matches zero or more non-digit characters.
Replacing the matched characters with the chars inside group 1 will give you the number 30
echo "all/30" | sed 's/[^0-9]*\/\([0-9][0-9]*\)/\1/'
Avoid writing '.*' as it consumes entire string. Default matches are always greedy.
echo "all/30" | sed 's/[^0-9]*//g'
# OR
echo "all/30" | sed 's#.*/##'
# OR
echo "all/30" | sed 's#.*\([0-9]*\)#\1#'
without more info about possible input string we can only assume that structure is #all/ followed by the number (only)