Regex behaviour with angle brackets

Regex behaviour with angle brackets - regex

Please explain to me why the following expression doesn't output anything:
echo "<firstname.lastname#domain.com>" | egrep "<lastname#domain.com>"
but the following does:
echo "<firstname.lastname#domain.com>" | egrep "\<lastname#domain.com>"
The behaviour of the first is as expected but the second should not output. Is the "\<" being ignored within the regex or causing some other special behaviour?

AS #hwnd said \< matches the begining of the word. ie a word boundary \b must exists before the starting word character(character after \< in the input must be a word character),
In your example,
echo "<firstname.lastname#domain.com>" | egrep "<lastname#domain.com>"
In the above example, egrep checks for a literal < character present before the lastname string. But there isn't, so it prints nothing.
$ echo "<firstname.lastname#domain.com>" | egrep "\<lastname#domain.com>"
<firstname.**lastname#domain.com>**
But in this example, a word boundary \b exists before lastname string so it prints the matched characters.
Some more examples:
$ echo "namelastname#domain.com" | egrep "\<e#domain.com"
$ echo "namelastname#domain.com" | egrep "\<lastname#domain.com"
$ echo "namelastname#domain.com" | egrep "\<com"
namelastname#domain.**com**
$ echo "<firstname.lastname#domain.com>" | egrep "\<#domain.com>"
$ echo "n-ame-lastname#domain.com" | egrep "\<ame-lastname#domain.com"
n-**ame-lastname#domain.com**

Related

Why [^\d\w\s,] matches "leonardo,davinci"?

I can't understand why the regexp:
[^\d\s\w,]
Matches the string:
"leonardo,davinci"
That is my test:
$ echo "leonardo,davinci" | egrep '[^\d\w\s,]'
leonardo,davinci
While this works as expected:
$ echo "leonardo,davinci" | egrep '[\S\W\D]'
$
Thanks very much

It's because egrep doesn't have the predefined sets \d, \w, \s. Therefore, putting slash in front of them is just matching them literally:
leonardo,davinci
echo "leonardo,davinci" | egrep '[^a-zA-Z0-9 ,]'
Will indeed, not match.
If you have it installed, you can use pcregrep instead:
echo "leonardo,davinci" | pcregrep '[^\w\s,]'

grep to select strings that contains certain words

I have a list:
/device1/element1/CmdDiscovery
/device1/element1/CmdReaction
/device1/element1/Direction
/device1/element1/MS-E2E003-COM14/Field2
/device1/element1/MS-E2E003-COM14/Field3
/device1/element1/MS-E2E003-COM14/NRepeatLeft
How can I grep so that the returned strings containing only "Field" followed by digits or simply NRepeatLeft at the end of string (in my example it will be the last three strings)?
Expected output:
/device1/element1/MS-E2E003-COM14/Field2
/device1/element1/MS-E2E003-COM14/Field3
/device1/element1/MS-E2E003-COM14/NRepeatLeft

Try doing this :
grep -E "(Field[0-9]*|NRepeatLeft$)" file.txt
| | | ||
| | OR end_line |
| opening_choice closing_choice
extented_grep
if you don't have -E switch (stands for ERE : Extented Regex Expression):
grep "\(Field[0-9]*\|NRepeatLeft$\)" file.txt
OUTPUT
/device1/element1/MS-E2E003-COM14/Field2
/device1/element1/MS-E2E003-COM14/Field3
/device1/element1/MS-E2E003-COM14/NRepeatLeft
That will grep for lines matching Field[0-9] or lines matching RepeatLeft at the end. Is it what you expect ?

I am not much sure of how to use grep for your purpose.Probably you would like perl for this:
perl -lne 'if(/Field[\d]+/ or /NRepeatLeft/){print}' your_file

$ grep -E '(Field[0-9]*|NRepeatLeft)$' file.txt
Output:
/device1/element1/MS-E2E003-COM14/Field2
/device1/element1/MS-E2E003-COM14/Field3
/device1/element1/MS-E2E003-COM14/NRepeatLeft
Explanation:
Field # Match the literal word
[0-9]* # Followed by any number of digits
| # Or
NRepeatLeft # Match the literal word
$ # Match the end of the string
You can see how this works with your example here.

Why does grep match all lines for the pattern "\'"

In this SO question there is something that I cannot explain:
grep "\'" input_file
matches all lines in the given file. Does \' have a special meaning for grep?

grep regex GNU extension: ‘\'’ matches the end of the whole input

I did not know this feature of the regular expressions. But it's listed at regular-expressions.info as the end of the string anchor.
It does not exist in all regex implementations only in GNU Basic and Extended Regular Expressions, see this compatibility chart for more info.

That is a really strange beaviour of grep, I don't know how to explain it, but I must note that \' doesn't match any character. It looks like it has the same meaning as $:
$ echo x | grep "x\'"
x
$ echo xy | grep "x\'"
$ echo x | grep "\'x"
Update 1
As it is stated in http://www.gnu.org/software/findutils/manual/html_node/find_html/grep-regular-expression-syntax.html (thanks to Richard Sitze for the link) it really has the same meaning as $. But meanwhile I've noted a difference between \' and $:
$ echo x | grep 'x$'
x
$ echo x | grep 'x$$'
$ echo x | grep "x\'"
x
$ echo x | grep "x\'\'"
x
$ echo x | grep "x\'\'\'"
x
You can specify \' as many times as you wish but that is not so for $. There must be only one $.
Another important remark. The manual says:
‘\'’ matches the end of the whole input
But strictly speaking that's not truth because \' matches not only the end of the whole input but the end of every single line also:
$ (echo x; echo y) | grep "\'"
x
y
Exactly how $ does.

\ is an escape character. This mean the the ' should considered as text to search for, and not as a control character.

how to match "ABC-123" but not "XABC-123" in a regular expression

I have this egrep search:
egrep -is "(ABC-[0-9]+)"
which matches ABC-123 anywhere in a string.
I'd like it to ignore XABC-456 or YABC-789.
In other words, those examples should output "ok":
echo "ABC-123" | egrep -is "(ABC-[0-9]+)" && echo "ok"
echo "test ABC-123" | egrep -is "(ABC-[0-9]+)" && echo "ok"
But this shouldn't:
echo "XABC-123" | egrep -is "(<fill in>ABC-[0-9]+)" && echo "ok"
I tried this without any luck (no output):
echo "ABC-123" | egrep -is "(\bABC-[0-9]+)" && echo "ok"
(I'm running Solaris 10)
How can I do that?

It look like you're looking for \bABC-[0-9]+ - Word Boundaries.
Another option is to use a negetive lookbedind, whci gives you more control over what can and cannot be before the match: (?<![a-z])ABC-[0-9]+.

This should do :
^(ABC-[0-9]+)
This way you're telling you want the line to start with your regexp.

If \b doesn't work for you, have you tried ((^| )ABC-[0-9]+)?

Try the following:
echo "XABC-123" | egrep -is "(\bABC-[0-9]+)" && echo "ok"
There are a couple solutions that propose using ^ (starts with...) however, they will fail if you are looking at " ABC-123" which you might want to catch. Word boundaries is probably what you want, unless you are looking for starts with...
Here's some sample output:
tim#Ikura ~
$ echo " ABC-123" | egrep -is "(\bABC-[0-9]+)" && echo "ok"
ABC-123
ok
tim#Ikura ~
$ echo "ABC-123" | egrep -is "(\bABC-[0-9]+)" && echo "ok"
ABC-123
ok
tim#Ikura ~
$ echo "XABC-123" | egrep -is "(\bABC-[0-9]+)" && echo "ok"
tim#Ikura ~
$
Update: Solaris issues... "Searching for a word isn't quite as simple as it at first appears. The string "the" will match the word "other". You can put spaces before and after the letters and use this regular expression: " the ". However, this does not match words at the beginning or end of the line. And it does not match the case where there is a punctuation mark after the word.
There is an easy solution. The characters "\<" and ">" are similar to the "^" and "$" anchors, as they don't occupy a position of a character. They do "anchor" the expression between to only match if it is on a word boundary. The pattern to search for the word "the" would be "\<[tT]he>". The character before the "t" must be either a new line character, or anything except a letter, number, or underscore. The character after the "e" must also be a character other than a number, letter, or underscore or it could be the end of line character."
tim#Ikura ~
$ echo "XABC-123" | egrep -is "(\<ABC-[0-9]+\>)" && echo "ok"
tim#Ikura ~
$ echo " ABC-123" | egrep -is "(\<ABC-[0-9]+\>)" && echo "ok"
ABC-123
ok

echo "XABC-123" | egrep -is "^ABC-[0-9]+" && echo "ok"
EDIT: To accept ABC when anything but a letter precedes it:
echo "XABC-123" | egrep -is "(^|[^A-Z])ABC-[0-9]+" && echo "ok"

matching a line with a literal asterisk "*" in grep

Tried
$ echo "$STRING" | egrep "(\*)"
and also
$ echo "$STRING" | egrep '(\*)'
and countless other variations. I just want to match a line that contains a literal asterisk anywhere in the line.

Try a character class instead
echo "$STRING" | egrep '[*]'

echo "$STRING" | fgrep '*'
fgrep is used to match the special characters.

Simply escape the asterisk with a backslash:
grep "\*"

Use:
grep "*" file.txt
or
cat file.txt | grep "*"

Here's one way to match a literal asterisk:
$ echo "*" | grep "[*]"
*
$ echo "*" | egrep "[*]"
*
$ echo "asfd" | egrep "[*]"
$ echo "asfd" | grep "[*]"
$
Wrapping an expression in brackets usually allows you to capture a single special character easily; this will also work for a right bracket or a hyphen, for instance.
Be careful when this isn't in a bracket grouping:
$ echo "hi" | egrep "*"
hi
$ echo "hi" | grep "*"
$

If there is a need to detect an asterisk in awk, you can either use
awk '/\*/' file
Here, * is used in a regex, and thus, must be escaped since an unescaped * is a quantifier that means "zero or more occurrences". Once it is escaped, it no longer has any special meaning.
Alternatively, if you do not need to check for anything else, it makes sense to peform a fixed string check:
awk 'index($0, "*")' file
If * is found anywhere inside a "record" (i.e. a line) the current line will get printed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex behaviour with angle brackets - regex

Related

Why [^\d\w\s,] matches "leonardo,davinci"?

grep to select strings that contains certain words

Why does grep match all lines for the pattern "\'"

how to match "ABC-123" but not "XABC-123" in a regular expression

matching a line with a literal asterisk "*" in grep

Categories

Resources