egrep -o : different behaviour in Linux and MacOS - regex

I have two scripts which I would like to run both in a Linux machine and in a MacOS machine. But a different behaviour of the egrep command makes those scripts generate different outputs.
In particular, this is what happens when I use egrep on Linux (Ubuntu):
$ echo ".test" | egrep "[a-zA-Z0-9]*"
.test
$ echo ".test" | egrep -o "[a-zA-Z0-9]*"
test
$
and this is what happens when I use egrep on MacOS
$ echo ".test" | egrep "[a-zA-Z0-9]*"
.test
$ echo ".test" | egrep -o "[a-zA-Z0-9]*"
$
The first behaviour is what I would expect, the second one (empty output) is unexpected. Perhaps this is a bug in the implementation of egrep with -o option under MacOS?
Or, if the second behaviour is correct as well, do you know a way to obtain the same behaviour for the second case?
I tried to look at the corresponding man pages for the two commands, this is extracted from the Linux man page:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each
such part on a separate output line.
and this is extracted from the man page for MacOS:
-o, --only-matching
Prints only the matching part of the lines.
Although the descriptions seem a bit different, the meaning of the two options seems to be the same, so why is egrep -o behaving differently in MacOS? Am I not considering any subtle aspect of this command?

This depends on how the different grep implementations deal with empty matches ([a-zA-Z0-9]* matches the empty string).
I wrote a longer text about this over at Unix&Linux.
In short, should all empty matches be returned? There are infinitely many such matches.

Related

The following code is showing error in AIX server but working fine in Red Hat server

string='binddn:cn=SxX.UXxxxM-E2A,OU=CA,OU=AI INFRASTRUCTURE,DC=i,DC=companyname,DC=com'
The working peice of code in Red Hat
dn=($(grep -oi 'cn=[^():]*dc=com' <<< "$string"))
I modified the code for AIX and modified code is
dn=($(grep -xi 'cn=[^():]*dc=com' "$string"))
The code is working perfect in RedHat server, the output in redhat is
dn[0]="cn=SxX.UXxxxM-E2A,OU=CA,OU=AI INFRASTRUCTURE,DC=i,DC=companyname,DC=com"
The error in AIX is
grep: can't open binddn:cn=SxX.UXxxxM-E2A,OU=CA,OU=AI INFRASTRUCTURE,DC=i,DC=companyname,DC=com
Edited:
Another example:
string = "userbasedn:DC=i,DC=companyname,DC=com?subtree?(&(objectcategory=person)(uidNumber=*)(|(memberOf:1.2.840.113556.1.4.1941:=cn=example1,OU=GROUPS,OU=AI INFRASTRUCTURE,DC=i,DC=companyname,DC=com)(memberOf:1.2.840.11.1.4.1941:=cn=example2,OU=GROUPS,OU=AI INFRASTRUCTURE,DC=i,DC=companyname,DC=com)))
groupbasedn:DC=i,DC=companyname,DC=com?subtree?(&(objectcategory=group)(gidNumber=*))"
expected output
dn[0]=cn=example1,OU=GROUPS,OU=AI INFRASTRUCTURE,DC=i,DC=companyname,DC=com
dn[1]=cn=example2,OU=GROUPS,OU=AI INFRASTRUCTURE,DC=i,DC=companyname,DC=com
If you can use awk, try this:
echo "$string" | awk -F"cn=" 'NF>1{$0=tolower($0);for (i=2;i<=NF;i++) {split($i,a,"dc=com)");print FS a[1]"dc=com"}}'
cn=example1,ou=groups,ou=ai infrastructure,dc=i,dc=companyname,dc=com
cn=example2,ou=groups,ou=ai infrastructure,dc=i,dc=companyname,dc=com
The second argument to grep is a file name, not a string. AIX correctly reports that it cannot find a file with that name. You would get the same error on Red Hat if you tried the same command.
Unfortunately, the -x option doesn't do what you hope; it checks whether the entire line of input matches the regex. Again, you will find exactly the same behavior on Red Hat.
According to the AIX grep manual page it supports the -o option just fine, though.
The Bash "here string" syntax <<<"string" is not available if you don't have Bash, but it is easy to rephrase portably:
printf '%s\n' "$string" |
grep -oi 'cn=[^():]*dc=com'
If you don't have grep -o, try with sed:
printf '%s\n' "$string" |
sed -n 's/.*\(cn=[^():]*dc=com\).*/\1/p'
This is not exactly the same, because the first .* is greedy. If you expect more than one match on a line, you will need a slightly more complex regex.

Case insensitive sed for pattern [duplicate]

I'm trying to use SED to extract text from a log file. I can do a search-and-replace without too much trouble:
sed 's/foo/bar/' mylog.txt
However, I want to make the search case-insensitive. From what I've googled, it looks like appending i to the end of the command should work:
sed 's/foo/bar/i' mylog.txt
However, this gives me an error message:
sed: 1: "s/foo/bar/i": bad flag in substitute command: 'i'
What's going wrong here, and how do I fix it?
Update: Starting with macOS Big Sur (11.0), sed now does support the I flag for case-insensitive matching, so the command in the question should now work (BSD sed doesn't reporting its version, but you can go by the date at the bottom of the man page, which should be March 27, 2017 or more recent); a simple example:
# BSD sed on macOS Big Sur and above (and GNU sed, the default on Linux)
$ sed 's/ö/#/I' <<<'FÖO'
F#O # `I` matched the uppercase Ö correctly against its lowercase counterpart
Note: I (uppercase) is the documented form of the flag, but i works as well.
Similarly, starting with macOS Big Sur (11.0) awk now is locale-aware (awk --version should report 20200816 or more recent):
# BSD awk on macOS Big Sur and above (and GNU awk, the default on Linux)
$ awk 'tolower($0)' <<<'FÖO'
föo # non-ASCII character Ö was properly lowercased
The following applies to macOS up to Catalina (10.15):
To be clear: On macOS, sed - which is the BSD implementation - does NOT support case-insensitive matching - hard to believe, but true. The formerly accepted answer, which itself shows a GNU sed command, gained that status because of the perl-based solution mentioned in the comments.
To make that Perl solution work with foreign characters as well, via UTF-8, use something like:
perl -C -Mutf8 -pe 's/öœ/oo/i' <<< "FÖŒ" # -> "Foo"
-C turns on UTF-8 support for streams and files, assuming the current locale is UTF-8-based.
-Mutf8 tells Perl to interpret the source code as UTF-8 (in this case, the string passed to -pe) - this is the shorter equivalent of the more verbose -e 'use utf8;'.Thanks, Mark Reed
(Note that using awk is not an option either, as awk on macOS (i.e., BWK awk and BSD awk) appears to be completely unaware of locales altogether - its tolower() and toupper() functions ignore foreign characters (and sub() / gsub() don't have case-insensitivity flags to begin with).)
A note on the relationship of sed and awk to the POSIX standard:
BSD sed and awk limit their functionality mostly to what the POSIX sed and
POSIX awk specs mandate, whereas their GNU counterparts implement many more extensions.
Editor's note: This solution doesn't work on macOS (out of the box), because it only applies to GNU sed, whereas macOS comes with BSD sed.
Capitalize the 'I'.
sed 's/foo/bar/I' file
Another work-around for sed on Mac OS X is to install gsedfrom MacPorts or HomeBrew and then create the alias sed='gsed'.
If you are doing pattern matching first, e.g.,
/pattern/s/xx/yy/g
then you want to put the I after the pattern:
/pattern/Is/xx/yy/g
Example:
echo Fred | sed '/fred/Is//willma/g'
returns willma; without the I, it returns the string untouched (Fred).
The sed FAQ addresses the closely related case-insensitive search. It points out that a) many versions of sed support a flag for it and b) it's awkward to do in sed, you should rather use awk or Perl.
But to do it in POSIX sed, they suggest three options (adapted for substitution here):
Convert to uppercase and store original line in hold space; this won't work for substitutions, though, as the original content will be restored before printing, so it's only good for insert or adding lines based on a case-insensitive match.
Maybe the possibilities are limited to FOO, Foo and foo. These can be covered by
s/FOO/bar/;s/[Ff]oo/bar/
To search for all possible matches, one can use bracket expressions for each character:
s/[Ff][Oo][Oo]/bar/
The Mac version of sed seems a bit limited. One way to work around this is to use a linux container (via Docker) which has a useable version of sed:
cat your_file.txt | docker run -i busybox /bin/sed -r 's/[0-9]{4}/****/Ig'
Use following to replace all occurrences:
sed 's/foo/bar/gI' mylog.txt
I had a similar need, and came up with this:
this command to simply find all the files:
grep -i -l -r foo ./*
this one to exclude this_shell.sh (in case you put the command in a script called this_shell.sh), tee the output to the console to see what happened, and then use sed on each file name found to replace the text foo with bar:
grep -i -l -r --exclude "this_shell.sh" foo ./* | tee /dev/fd/2 | while read -r x; do sed -b -i 's/foo/bar/gi' "$x"; done
I chose this method, as I didn't like having all the timestamps changed for files not modified. feeding the grep result allows only the files with target text to be looked at (thus likely may improve performance / speed as well)
be sure to backup your files & test before using. May not work in some environments for files with embedded spaces. (?)
Following should be fine:
sed -i 's/foo/bar/gi' mylog.txt

Making a mistake with | (or) in regular expressions

I'm sure I'm doing something really obviously wrong here, but I can't figure out what. Using grep from a bash shell, I have a file test.txt:
ABC123
ABC456
ABC789
DEF123
DEF456
DEF789
Now at the command line:
$ grep ABC test2.txt
ABC123
ABC456
ABC789
$ grep DEF test2.txt
DEF123
DEF456
DEF789
So those work great. Now, I expect the following command to print the whole file, but:
$ grep ABC\|DEF test2.txt
$ grep (ABC)\|(DEF) test2.txt
-bash: syntax error near unexpected token `ABC'
$ grep \(ABC\)\|\(DEF\) test2.txt
$ grep 'ABC|DEF' test2.txt
What am I doing wrong?
Turn on the extended regex with -E:
grep -E "ABC|DEF" test2.txt
As others have pointed out, standard grep command does not support the or syntax. Unfortunately, from there, things are a mishmash.
Some systems have a egrep that does offer or syntax.
Some systems can use a -E or -P (for Perl) flag to extend grep syntax.
Some systems have both the -E and egrep that do the same thing. This implies that there are systems out there where grep -E and egrep are not the same. (sad but true).
Some systems now use the extended regular expressions in their standard grep command. Apparently, your system doesn't.
Read your manpages to see what your system does support. Some systems have a manpage for re_formatthat will explain what they support and don't support in extended format.
Then again, you could always just use a Perl one-liner:
$ perl -ne "print if /(ABC)|(DEF)/" test.txt
At least you know all the stuff that supports.
I don't think standard syntax supports it. You could use -P switch if available:
grep -P "(ABC|DEF)" test2.txt
Use egrep instead, which is the same as using grep -E:
egrep 'ABC|DEF' test2.txt

grep with regexp: whitespace doesn't match unless I add an assertion

GNU grep 2.5.4 on bash 4.1.5(1) on Ubuntu 10.04
This matches
$ echo "this is a line" | grep 'a[[:space:]]\+line'
this is a line
But this doesn't
$ echo "this is a line" | grep 'a\s\+line'
But this matches too
$ echo "this is a line" | grep 'a\s\+\bline'
this is a line
I don't understand why #2 does not match (whereas # 1 does) and #3 also shows a match. Whats the difference here?
Take a look at your grep manpage. Perl added a lot of regular expression extensions that weren't in the original specification. However, because they proved so useful, many programs adopted them.
Unfortunately, grep is sometimes stuck in the past because you want to make sure your grep command remains compatible with older versions of grep.
Some systems have egrep with some extensions. Others allow you to use grep -E to get them. Still others have a grep -P that allows you to use Perl extensions. I believe Linux systems' grep command can use the -P extension which is not available in most Unix systems unless someone has replaced the grep with the GNU version. Newer versions of Mac OS X also support the -P switch, but not older versions.
grep doesn't support the complete set of regular expressions, so try using -P to enable perl regular expressions. You don't need to escape the + i.e.
echo "this is a line" | grep -P 'a\s+line'

Why does this `grep -o` fail, and how should I work around it?

Given the input
echo abc123def | grep -o '[0-9]*'
On one computer (with GNU grep 2.5.4), this returns 123, and on another (with GNU grep 2.5.1) it returns the empty string. Is there some explanation for why grep 2.5.1 fails here, or is it just a bug? I'm using grep -o in this way in a bash script that I'd like to be able to run on different computers (which may have different versions of grep). Is there a "right way" to get consistent behavior?
Yes, 2.5.1's -o handling was buggy:
http://www.mail-archive.com/bug-grep#gnu.org/msg00993.html
Grep is probably not the right tool for this; sed or tr or even perl might be better depending on what the actual task is.
you can use the shell. its faster
$ str=abc123def
$ echo ${str//[a-z]/}
123
I had the same issue and found that egrep was installed on that machine. A quick solution was using
echo abc123def | egrep -o '[0-9]*'
This will give similar results:
echo abc123def | sed -n 's/[^0-9]*\([0-9]\+\).*/\1/p'
Your question is a near-duplicate of this one.
Because you are using a regex so you must use either:
grep -E
egrep (like Sebastian posted).
Good luck!