How can I match square bracket in regex with grep? - regex

I am trying to match both [ and ] with grep, but only succeeded to match [. No matter how I try, I can't seem to get it right to match ].
Here's a code sample:
echo "fdsl[]" | grep -o "[ a-z]\+" #this prints fdsl
echo "fdsl[]" | grep -o "[ \[a-z]\+" #this prints fdsl[
echo "fdsl[]" | grep -o "[ \]a-z]\+" #this prints nothing
echo "fdsl[]" | grep -o "[ \[\]a-z]\+" #this prints nothing
Edit: My original regex, on which I need to do this, is this one:
echo "fdsl[]" | grep -o "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!##&?-]\+"
#this prints nothing
N.B: I have tried all the answers from this post but that didn't work on this particular case. And I need to use those brackets inside [].

According to BRE/ERE Bracketed Expression section of POSIX regex specification:
[...] The right-bracket ( ']' ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial circumflex ( '^' ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as "[.].]" ) or is the ending right-bracket for a collating symbol, equivalence class, or character class. The special characters '.', '*', '[', and '\' (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression.
and
[...] If a bracket expression specifies both '-' and ']', the ']' shall be placed first (after the '^', if any) and the '-' last within the bracket expression.
Therefore, your regex should be:
echo "fdsl[]" | grep -Eo "[][ a-z]+"
Note the E flag, which specifies to use ERE, which supports + quantifier. + quantifier is not supported in BRE (the default mode).
The solution in Mike Holt's answer "[][a-z ]\+" with escaped + works because it's run on GNU grep, which extends the grammar to support \+ to mean repeat once or more. It's actually undefined behavior according to POSIX standard (which means that the implementation can give meaningful behavior and document it, or throw a syntax error, or whatever).
If you are fine with the assumption that your code can only be run on GNU environment, then it's totally fine to use Mike Holt's answer. Using sed as example, you are stuck with BRE when you use POSIX sed (no flag to switch over to ERE), and it's cumbersome to write even simple regular expression with POSIX BRE, where the only defined quantifier is *.
Original regex
Note that grep consumes the input file line by line, then checks whether the line matches the regex. Therefore, even if you use P flag with your original regex, \n is always redundant, as the regex can't match across lines.
While it is possible to match horizontal tab without P flag, I think it is more natural to use P flag for this task.
Given this input:
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89"
fds l[]kSAJD<>?,./:";'{}|[]\!##$%^&*()_+-=~`89
The original regex in the question works with little modification (unescape + at the end):
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89" | grep -Po "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!##&?-]+"
fds l[]kSAJD
?,./:";'
[]
!##$
&*()_+-=~
89
Though we can remove \n (since it is redundant, as explained above), and a few other unnecessary escapes:
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89" | grep -Po "[ \[\]\ta-zA-Z/:.0-9_~\"'+,;*=()$\!##&?-]+"
fds l[]kSAJD
?,./:";'
[]
!##$
&*()_+-=~
89

One issue is that [ is a special character in expression and it cannot get escaped with \ (at least not in my flavors of grep). Solution is to define it like [[].

According to regular-expressions.info:
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
... and ...
The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning.
So, assuming that the particular flavor of regular expressions syntax supported by grep conforms to this, then I would have expected that "[ a-z[\]]\+" should have worked.
However, my version of grep (GNU grep 2.14) only matches the "[]" at the end of "fdsl[]" with this regex.
However, I tried using the other technique mentioned in that quote (putting the ] in a position within the character class where it cannot take on its normal meaning, and it seems to have worked:
$ echo "fdsl[]" | grep -o "[][a-z ]\+"
fdsl[]

Related

In bash/sed, how do you match on a lowercase letter followed by the SAME letter in uppercase?

I want to delete all instances of "aA", "bB" ... "zZ" from an input string.
e.g.
echo "foObar" |
sed -Ee 's/([a-z])\U\1//'
should output "fbar"
But the \U syntax works in the latter half (replacement part) of the sed expression - it fails to resolve in the matching clause.
I'm having difficulty converting the matched character to upper case to reuse in the matching clause.
If anyone could suggest a working regex which can be used in sed (or awk) that would be great.
Scripting solutions in pure shell are ok too (I'm trying to think of solving the problem this way).
Working PCRE (Perl-compatible regular expressions) are ok too but I have no idea how they work so it might be nice if you could provide an explanation to go with your answer.
Unfortunately, I don't have perl or python installed on the machine that I am working with.
You may use the following perl solution:
echo "foObar" | perl -pe 's/([a-z])(?!\1)(?i:\1)//g'
See the online demo.
Details
([a-z]) - Group 1: a lowercase ASCII letter
(?!\1) - a negative lookahead that fails the match if the next char is the same as captured with Group 1
(?i:\1) - the same char as captured with Group 1 but in the different case (due to the lookahead before it).
The -e option allows you to define Perl code to be executed by the compiler and the -p option always prints the contents of $_ each time around the loop. See more here.
This might work for you (GNU sed):
sed -r 's/aA|bB|cC|dD|eE|fF|gG|hH|iI|jJ|kK|lL|mM|nN|oO|pP|qQ|rR|sS|tT|uU|vV|wW|xX|yY|zZ//g' file
A programmatic solution:
sed 's/[[:lower:]][[:upper:]]/\n&/g;s/\n\(.\)\1//ig;s/\n//g' file
This marks all pairs of lower-case characters followed by an upper-case character with a preceding newline. Then remove altogether such marker and pairs that match by a back reference irrespective of case. Any other newlines are removed thus leaving pairs untouched that are not the same.
Here is a verbose awk solution as OP doesn't have perl or python available:
echo "foObar" |
awk -v ORS= -v FS='' '{
for (i=2; i<=NF; i++) {
if ($(i-1) == tolower($i) && $i ~ /[A-Z]/ && $(i-1) ~ /[a-z]/) {
i++
continue
}
print $(i-1)
}
print $(i-1)
}'
fbar
There's an easy lex for this,
%option main 8bit
#include <ctype.h>
%%
[[:lower:]][[:upper:]] if ( toupper(yytext[0]) != yytext[1] ) ECHO;
(that's a tab before the #include, markdown loses those). Just put that in e.g. that.l and then make that. Easy-peasy lex's are a nice addition to your toolkit.
Note: This solution is (unsurprisingly) slow, based on OP's feedback:
"Unfortunately, due to the multiple passes - it makes it rather slow. "
If there is a character sequence¹ that you know won't ever appear in the input,you could use a 3-stage replacement to accomplish this with sed:
echo 'foObar foobAr' | sed -E -e 's/([a-z])([A-Z])/KEYWORD\1\l\2/g' -e 's/KEYWORD(.)\1//g' -e 's/KEYWORD(.)(.)/\1\u\2/g'
gives you: fbar foobAr
Replacement stages explained:
Look for lowercase letters followed by ANY uppercase letter and replace them with both letters as lowercase with the KEYWORD in front of them foObar foobAr -> fKEYWORDoobar fooKEYWORDbar
Remove KEYWORD followed by two identical characters (both are lowercase now, so the back-reference works) fKEYWORDoobar fooKEYWORDbar -> fbar fooKEYWORDbar
Strip remaining² KEYWORD from the output and convert the second character after it back to it's original, uppercase version fbar fooKEYWORDbar -> fbar foobAr
¹ In this example I used KEYWORD for demonstration purposes. A single character or at least shorter character sequence would be better/faster. Just make sure to pick something that cannot possibly ever be in the input.
² The remaining occurances are those where the lowercase-versions of the letters were not identical, so we have to revert them back to their original state

Can't understand this awk regex

I'm trying to understand a particular line of code from a Unix talk, and can't seem to understand what the awk portion is doing.
The full line is: man ls | col -b | grep '^[[:space:]]*ls \[' | awk -F '[][]' '{print $2}'. The text passed to awk (if for some reason you don't have the man program) is: ls [-ABCFGHLOPRSTUW#abcdefghiklmnopqrstuwx1] [file ...]. Somehow, awk is able to just pull out the list of options to ls, but I can't really understand how this regex [][] actually works & what it matches for.
My best guess is that the outer brackets denote a character class whose contents contain ][. If that's the case, why can't the inner brackets be written as []. Is it because pairs of brackets [[]] have a different meaning in awk?
Thanks in advance!
In POSIX regular expressions [...] is called a bracket expression.
It is very similar to character class in other reegx flavors. One key difference is that the backslash is NOT a meta-character in a POSIX bracket expression.
If you want to include [ and ] in a bracket expression then it needs to be placed correctly i.e. ] right at the start and [.
As per the linked article:
To match a ], put it as the first character after the opening [ or the negating ^. To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ].
In your example:
awk -F '[][]' '...'
awk sets (input) field separator as single literal [ or ] character.
If you had [[]] it would mean that [ is in brackets [], like [[] followed by a ] so the field separator would be []:
$ echo a[]b | awk -F'[[]]' '{print $2}'
b
But then the brackets other way around:
$ echo a][b | awk -F'[][]' '{print $3}'
b
Now the $2 is empty and $3==b (oh dear what done).
Your hunch about character classes is correct. If you want certain characters to be field separators, then you can list them between brackets. Using awk -F '[abc]' ... would specify the a and b and c characters as separators. Order is irrelevant; you could use awk -F '[cab]' ... and get the same results.
But what if you want the separating characters to be left and right brackets themselves? The documentation for regular expressions (man re_format on many systems) says this:
To include a literal `]' in the list, make it the first character ...
Which makes sense, given how the expression will be parsed. As the parser is scanning the expression, it's looking for the end, the right bracket. It doesn't care about seeing another left bracket or a comma or a space or whatever, but a right bracket would mark the end unless there's some way to tell the parser to take it literally. Since brackets with nothing between them, [], would be useless, a right bracket as the first character is defined to mean something else: this can't be the end, so take this right-bracket literally.
So if you want brackets as field-separating characters, you list [ and ] between brackets, but you put the right bracket first in the list so it'll be taken literally, per the instructions: [][]

Why does the order of replacing things matter in sed?

I have a file like this:
(paren)
[sharp]
And I try to replace like this:
sed "s/(/[/g" some_file.txt
And it works just fine:
[paren)
[sharp]
Then I try to replace like this:
sed "s/[/(/g" some_file.txt
And it gives me the error:
sed: 1: "s/[/(/g": unbalanced brackets ([])
I cannot find any evidence as to why this would error out. Why does the order of [ and ( matter?
Thank you very much.
The [ is a part of a bracket expression that must have a closing counterpart (]).
Escape the [ to match a literal [ symbol:
echo "[sharp]" | sed 's/\[/(/g'
See IDEONE demo
The reason it matters is because you're replacing a regex with a literal string.
So the bracket is viewed as a character when used after the second slash. It is viewed as part of an invalid regex when used between the first and second slash.
So in this expression the '[' is taken as a character:
s/(/[/g
In this expression it's not:
s/[/(/g
The first parameter in a replacement with sed must be a regex pattern:s/regex_pattern/replacement_string/
The opening square bracket has a special meaning in a regex pattern, since it is the beginning of a character class, for example [a-z]. That is why you obtain this error message that has nothing to do with the order of your replacements: unbalanced brackets ([]) (an opened character class must be closed.)
To obtain a literal opening square bracket, you need to escape it: \[
sed 's/\[/(/' file
If your goal is to translate characters into others, there is a more simple way, using a translation, that avoids the problem of circular replacements:
a='(paren)
[sharp]'
using tr
echo "$a" | tr '[]()' '()[]'
or with sed:
echo "$a" | sed 'y/[]()/()[]/'

How do spaces work in a regular expression with grep in the bash shell?

The way I would read the regular expression below:
a space char
a slash char
a 'n' char
zero or more space chars
end of line
But this test fails:
$ echo "Some Text \n " | grep " \\n *$"
If I delete a space in the regular expression, does not fail
$ echo "Some Text \n " | grep "\\n *$"
Some Text \n
Try this:
echo "Some Text \n " | grep ' \\n *$'
Note the single quotes. This serverfault question has more information about single vs. double quotes. With single quotes, the string is treated literally.
Here's an explanation. When you do:
echo "Test\n", you get Test\n as the output, because echo doesn't translate the escape sequences (unless you send it the -e flag).
echo "Test\n" | grep '\n', it only matches the n. This is because \n is an "escaped n" (it doesn't seem to translate into an actual newline). If you want it to match the \ and the n, you need to do echo "Test\n" | grep '\\n'.
When using regular expressions you have to be mindful of the context in which you are using them. Several characters are treated specially by the regular expression engine, and also by the mechanism you use to invoke it.
In your case you are using bash. Depending on how you quote things you may have to escape special characters twice. Once to prevent bash from interpreting the special character and once again to get the regex behavior you desire.
To solve problems like this you should first ask yourself "what must the expression look like?" You must then also ask, how must I prepare that expression so that the regular expression engine actually gets that pattern?" This involves understanding the effect that quoting has on the expression. Specifically in this case, the difference between single quote and double quotes (nd the other less comon quoting mechanisms).

How to look for lines which don't end with a certain character

How to look for lines which don't end with a ."
description="This has a full stop."
description="This has a full stop."
description="This line doesn't have a full stop"
You can use a character class to describe the occurrence of any character except .:
[^\n.](\n|$)
This will match any character that is neither a . nor new line character, and that is either followed by a new line character or by the end of the string. If multiline mode is supported, you can also use just $ instead of (\n|$).
Depends on your environent. On Linux/Unix/Cygwin you would do something like this:
grep -n -v '\."$' <file.txt
or
grep -n -v '\."[[:space:]]*$' <file.txt
if trailing whitespace is fine.
I guess the regular expression pattern you are looking for is the following:
\."$
\. means a real dot. (compared to . which means any character except \n)
" is the double quote that ends the line in your example.
$ means end of line.
The way you will use this pattern depends on the environment you are using, so give us more precision for a more precise answer :-)
In general, regular expression matches. It is not easy to do a don't match. The general solution for this kind of thing is to invert the truth value. For example:
grep: grep -v '\.$'
Perl: $line !~ /\.$/
Tcl: ![regexp {\.$} $line]
In this specific case, since it is just a character, you can use the character class syntax, [], since it accepts a ^ modifier to signify anything that is not the specified characters:
[^.]$
so, in Perl it would be something like:
$line =~ /[^.]$/