End of line char ($) doesn't work inside square brackets - regex

Putting $ inside square brackets doesn't work for grep.
~ $ echo -e "hello\nthere" > example.txt
~ $ grep "hello$" example.txt
hello
~ $ grep "hello[$]" example.txt
~ $
Is this a bug in grep or am I doing something wrong?

That's what it's supposed to do.
[$]
...defines a character class that matches one character, $.
Thus, this would match a line containing hello$.
See the POSIX RE Bracket Expression definition for the formal specification requiring that this be so. Quoting from that full definition:
A bracket expression (an expression enclosed in square brackets, "[]" ) is an RE that shall match a single collating element contained in the non-empty set of collating elements represented by the bracket expression.
Thus, any bracket expression matches a single element.
Moreover, in the BRE Anchoring Expression definition:
A dollar sign ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a dollar sign as an anchor when used as the last character of a subexpression. The dollar sign shall anchor the expression (or optionally subexpression) to the end of the string being matched; the dollar sign can be said to match the end-of-string following the last character.
Thus -- as of BRE, the regexp format which grep recognizes by default with no arguments -- if $ is not at the end of the expression, it is not required to be recognized as an anchor.

If you're trying to match end of line characters or the end of the string, you can use (|) like so "ABC($|\n)".

You can, however, use $ in a parenthesis grouping, which facilitates the use of | (or), which can accomplish the same idea as a square bracket group.
Something like the following might be of interest to you:
~ $ cat example.txt
hello
there
helloa
hellob
helloc
~ $ grep "hello\($\|[ab]\)" example.txt
hello
helloa
hellob

Related

sed matching "$" literally without considering it regex

I was trying to use $ in the sed -e command and it works , eg:
sed -e 's/world$/test/g' test.txt
the above command will replace "world" at the end of string.
what confused me the following worked literally :
sed -e 's/${projects.version}/20.0/g' test.txt
the above command replaced ${projects.version}, I don't have any explanation how did the sed match the $ and didn't expect it to be a special character?
As the POSIX spec says:
$
The <dollar-sign> shall be special when used as an anchor.
A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
<dollar-sign> as an anchor when used as the last character of a
subexpression. The <dollar-sign> shall anchor the expression (or
optionally subexpression) to the end of the string being matched; the
<dollar-sign> can be said to match the end-of-string following the
last character.
so when it's not at the end of a BRE, it's just a literal $ character.
For EREs the 2nd paragraph is a little different:
A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the
expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs "ef$" and "(ef$)"
match "ef" in the string "abcdef", but fail to match in the string
"cdefab", and the ERE "e$f" is valid, but can never match because the
'f' prevents the expression "e$" from matching ending at the last
character.
Note that last sentence - that means the $ is NOT treated literally in an ERE when not at the end of a regexp, it just can't match anything.
This is something you should never have to worry about, though, because for clarity if nothing else, you should always make sure you write your regexps to escape any regexp metachar you want treated literally so you shouldn't write:
's/$foo/bar/'
but write either of these instead:
's/\$foo/bar/'
's/[$]foo/bar/'
and then none of the semantics mentioned above matter.
The rationale for the difference between the way $ is handled in BREs vs EREs in this context is explained at https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_08, but basically it's just that the standards were written this way to accommodate the different historical behavior of the way people used $ in BREs vs EREs.
Thanks to #M.NejatAydin here on SO and #oguzismail in comp.unix.shell on usenet for helping clarify the rationale.

Regex which captures a pattern plus everything after until a character is reached

I want a regular expression which catches every time +other appears as well as everything until the next comma.
With
(word),+(other)(word),(code),(word),(other)(code),(example)
(code),+(other),+(other)(code)(word)(example),(example),+(example)
+(code),(other)(code)(word),(code),(word)
I want to return
+(other)(word)
+(other)
+(other)(code)(word)(example)
My command that I would use looks something like egrep -o '\+\(other).*,.
The only problem is that the comma in this regex isn't necessarily the next comma. Right now the command returns
+(other)(word),(code),(word),(other)(code),
+(other),+(other)(code)(word)(example),(example),
You consume any 0+ chars as many as possible up to the last (and including) , with .*,.
To avoid matching , and only match up to the first ,, use a negated bracket expression [^,] and apply * quantifier to it:
egrep -o '\+\(other\)[^,]*
The [^,]* pattern will match any 0+ characters other than ,.
If your grep supports Perl compatible regular expressions (PCRE), you can use non-greedy matching:
$ grep -Po '\+\(other\).*?,' infile
+(other)(word),
+(other),
+(other)(code)(word)(example),

Regular Expression to follow a specific pattern

I'm trying to make sure the input to my shell script follows the format Name_Major_Minor.extension
where Name is any number of digits/characters/"-" followed by "_"
Major is any number of digits followed by "_"
Minor is any number of digits followed by "."
and Extension is any number of characters followed by the end of the file name.
I'm fairly certain my regular expression is just messed up slightly. any file I currently run through it evaluates to "yes" but if I add "[A-Z]$" instead of "*$" it always evaluates to "no". Regular expressions confuse the hell out of me as you can probably tell..
if echo $1 | egrep -q [A-Z0-9-]+_[0-9]+_[0-9]+\.*$
then
echo "yes"
else
echo "nope"
exit
fi
edit: realized I am missing the pattern for "minor". Still doesn't work after adding it though.
Use =~ operator
Bash supports regular expression matching through its =~ operator, and there is no need for egrep in this particular case:
if [[ "$1" =~ ^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$ ]]
Errors in your regular expression
The \.*$ sequence in your regular expression means "zero or more dots". You probably meant "a dot and some characters after it", i.e. \..*$.
Your regular expression matches only the end of the string ($). You likely want to match the whole string. To match the entire string, use the ^ anchor to match the beginning of the line.
Escape the command line arguments
If you still want to use egrep, you should escape its arguments as you should escape any command line arguments to avoid reinterpretation of special characters, or rather wrap the argument in single, or double quotes, e.g.:
if echo "$1" | egrep -q '^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$'
Use printf instead of echo
Don't use echo, as its behavior is considered unreliable. Use printf instead:
printf '%s\n' "$1"
Try this regex instead: ^[A-Za-z0-9-]+(?:_[0-9]+){2}\..+$.
[A-Za-z0-9-]+ matches Name
_[0-9]+ matches _ followed by one or more digits
(?:...){2} matches the group two times: _Major_Minor
\..+ matches a period followed by one or more character
The problem in your regex seems to be at the end with \.*, which matches a period \. any number of times, see here. Also the [A-Z0-9-] will only match uppercase letters, might not be what you wanted.

ask for explanation of sed regex

I struggle to understand the following two sed regex in a makefile:
sw_version:=software/module/1.11.0
sw:= $(shell echo $(sw_version) | sed -e 's:/[^/]*$$::;s:_[^/]*$$::g')
// sw is "software/module"
version:= $(shell echo $(sw_version) | sed -e 's:.*/::g;s/-.*$$//')
// version is "1.11.0"
I really appreciate a detailed explanation. Thanks!
$$ will be substituted to $ in make files, so the sed expression looks like this:
sed -e 's:/[^/]*$::;s:_[^/]*$::g'
s/// is the substitution command. And the delimiter doesn't need to be / in your case it's a colon (:):
s:/[^/]*$::;
s:_[^/]*$::g
It works with matching pattern and replacing with replacement:
s/pattern/replacement/
; is a delimiter to use multiply commands in the same call to sed.
So basically this is two substitutions one which replaces /[^/]*$ another which replaces _[^/]*$ with nothing.
[...] is a character class which will match what ever you stick in there one time. eg: [abc] will match either a or b or c. If ^ is in the beginning of the class it will match everything but what is in the class, eg: [^abc] will match everything but a and b and c.
* will repeat the last pattern zero or more times.
$ is end of line
Lets apply what we know to the examples above (read bottom up):
s:/[^/]*$::;
#^^^^^ ^^ ^^
#||||| || |Delimiter
#||||| || Replace with nothing
#||||| |End of line
#||||| Zero or more times
#||||Literal slash
#|||Match everything but ...
#||Character class
#|Literal slash
#Delimiter used for the substitution command
/[^/]*$ will match literal slash (/) followed by everything but a slash zero or more times at end of line.
_[^/]*$ will match literal underscore (_) followed by everything but a slash zero or more times at end of line.
That was the first, the second is left as an exercise.

Grep for a string that ends with specific character

Is there a way to use extended regular expressions to find a specific pattern that ends with a string.
I mean, I want to match first 3 lines but not the last:
file_number_one.pdf # comment
file_number_two.pdf # not interesting
testfile_number____three.pdf # some other stuff
myfilezipped.pdf.zip some comments and explanations
I know that in grep, metacharacter $ matches the end of a line but I'm not interested in matching a line end but string end. Groups in grep are very odd, I don't understand them well yet.
I tried with group matching, actually I have a similar REGEX but it does not work with grep -E
(\w+).pdf$
Is there a way to do string ending match in grep/egrep?
Your example works with matching the space after the string also:
grep -E '\.pdf ' input.txt
What you call "string" is similar to what grep calls "word". A Word is a run of alphanumeric characters. The nice thing with words is that you can match a word end with the special \>, which matches a word end with a march of zero characters length. That also matches at the end of line. But the word characters can not be changed, and do not contain punctuation, so we can not use it.
If you need to match at the end of line too, where there is no space after the word, use:
grep -E '\.pdf |\.pdf$' input.txt
To include cases where the character after the file name is not a space character '', but other whitespace, like a tab, \t, or the name is directly followed by a comment, starting with #, use:
grep -E '\.pdf[[:space:]#]|\.pdf$' input.txt
I will illustrate the matching of word boundarys too, because that would be the perfect solution, except that we can not use it here because we can not change the set of characters that are seen as parts of a word.
The input contains foo as separate word, and as part of longer words, where the foo is not at the end of the word, and therefore not at a word boundary:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n'
foo bar
foo.bar
foobar
foo_bar
foo
Now, to match the boundaries of words, we can use \< for the beginning, and \> to match the end:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n' | grep 'foo\>'
foo bar
foo.bar
foo
Note how _ is matched as a word char - but otherwise, wordchars are only the alphanumerics, [a-zA-Z0-9].
Also note how foo an the end of line is matched - in the line containing only foo. We do not need a special case for the end of line.
You can use \> operator
grep 'word\>' fileName
You need to escape the . in your regex. This regex will match anything that ends in .pdf (and only things that end in .pdf):
.*\.pdf$
Positive lookaheads are the most suited for this kinda stuff. Have a try :
grep -P "(^\w+\.pdf)(?=\s)" file
I assume filenames will always be on the start of the line.