Can't understand this awk regex - regex

I'm trying to understand a particular line of code from a Unix talk, and can't seem to understand what the awk portion is doing.
The full line is: man ls | col -b | grep '^[[:space:]]*ls \[' | awk -F '[][]' '{print $2}'. The text passed to awk (if for some reason you don't have the man program) is: ls [-ABCFGHLOPRSTUW#abcdefghiklmnopqrstuwx1] [file ...]. Somehow, awk is able to just pull out the list of options to ls, but I can't really understand how this regex [][] actually works & what it matches for.
My best guess is that the outer brackets denote a character class whose contents contain ][. If that's the case, why can't the inner brackets be written as []. Is it because pairs of brackets [[]] have a different meaning in awk?
Thanks in advance!

In POSIX regular expressions [...] is called a bracket expression.
It is very similar to character class in other reegx flavors. One key difference is that the backslash is NOT a meta-character in a POSIX bracket expression.
If you want to include [ and ] in a bracket expression then it needs to be placed correctly i.e. ] right at the start and [.
As per the linked article:
To match a ], put it as the first character after the opening [ or the negating ^. To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ].
In your example:
awk -F '[][]' '...'
awk sets (input) field separator as single literal [ or ] character.

If you had [[]] it would mean that [ is in brackets [], like [[] followed by a ] so the field separator would be []:
$ echo a[]b | awk -F'[[]]' '{print $2}'
b
But then the brackets other way around:
$ echo a][b | awk -F'[][]' '{print $3}'
b
Now the $2 is empty and $3==b (oh dear what done).

Your hunch about character classes is correct. If you want certain characters to be field separators, then you can list them between brackets. Using awk -F '[abc]' ... would specify the a and b and c characters as separators. Order is irrelevant; you could use awk -F '[cab]' ... and get the same results.
But what if you want the separating characters to be left and right brackets themselves? The documentation for regular expressions (man re_format on many systems) says this:
To include a literal `]' in the list, make it the first character ...
Which makes sense, given how the expression will be parsed. As the parser is scanning the expression, it's looking for the end, the right bracket. It doesn't care about seeing another left bracket or a comma or a space or whatever, but a right bracket would mark the end unless there's some way to tell the parser to take it literally. Since brackets with nothing between them, [], would be useless, a right bracket as the first character is defined to mean something else: this can't be the end, so take this right-bracket literally.
So if you want brackets as field-separating characters, you list [ and ] between brackets, but you put the right bracket first in the list so it'll be taken literally, per the instructions: [][]

Related

Why does the order of replacing things matter in sed?

I have a file like this:
(paren)
[sharp]
And I try to replace like this:
sed "s/(/[/g" some_file.txt
And it works just fine:
[paren)
[sharp]
Then I try to replace like this:
sed "s/[/(/g" some_file.txt
And it gives me the error:
sed: 1: "s/[/(/g": unbalanced brackets ([])
I cannot find any evidence as to why this would error out. Why does the order of [ and ( matter?
Thank you very much.
The [ is a part of a bracket expression that must have a closing counterpart (]).
Escape the [ to match a literal [ symbol:
echo "[sharp]" | sed 's/\[/(/g'
See IDEONE demo
The reason it matters is because you're replacing a regex with a literal string.
So the bracket is viewed as a character when used after the second slash. It is viewed as part of an invalid regex when used between the first and second slash.
So in this expression the '[' is taken as a character:
s/(/[/g
In this expression it's not:
s/[/(/g
The first parameter in a replacement with sed must be a regex pattern:s/regex_pattern/replacement_string/
The opening square bracket has a special meaning in a regex pattern, since it is the beginning of a character class, for example [a-z]. That is why you obtain this error message that has nothing to do with the order of your replacements: unbalanced brackets ([]) (an opened character class must be closed.)
To obtain a literal opening square bracket, you need to escape it: \[
sed 's/\[/(/' file
If your goal is to translate characters into others, there is a more simple way, using a translation, that avoids the problem of circular replacements:
a='(paren)
[sharp]'
using tr
echo "$a" | tr '[]()' '()[]'
or with sed:
echo "$a" | sed 'y/[]()/()[]/'

regex match square brackets once

There is a text file with the following info:
[[parent]]
[son]
[daughter]
How to get only [son] and [daughter]?
$0 ~ /\[([a-z])*\]/ ???
Your regex is almost right. Just put the * inside the round brackets (in order to have the whole text inside the only group) and remember to use the ^ and $ delimiters (to avoid matching [[parent]]):
^\[([a-z]*)\]$
Match any square bracket at beginning of line where the next character is an alphabetic.
awk '/^\[[a-z]/' file
You might want to add uppercase and/or numbers to the character class, depending on what your real requiements are. (Your examples show only lowercase, so I have assumed that's a valid generalization.)
You can use this awk command:
awk -F '[][]+' 'NF && !/\[\[/{print $2}' file
son
daughter
awk command breakup:
-F '[][]+' # set input field separator as 1 or more of [ or ]
NF # only if at least one field is found
!/\[\[/ # when input doesn't start with [[

sed regex : remove occurrences of [ and ] from string

I am using Linux Mint 17.
I am able to replace everything from the test string except [ and ] (opening and closing square brackets).
Here is my expression, which is within a bash script, which I run from the command line
video_title=$(echo $video_title | sed 's|[?![]]||g')
I have tried placing \ before both square brackets and this does not work.
If I remove the [ and the ] the expression replaces the ? and the ! just fine.
Anyone any ideas?
According to the manual, to include ] in the list, it needs to be the first character.
A leading `^' reverses the meaning of LIST, so that it matches any
single character _not_ in LIST. To include `]' in the list, make
it the first character (after the `^' if needed), to include `-'
in the list, make it the first or last; to include `^' put it
after the first character.
So try something like this:
$ echo '[!2015?]' | sed 's|[][?!]||g'
2015
For just deleting characters from the input tr is a more appropriate tool than sed. In your case you can just use
video_title=$(echo $video_title | tr -d '?![]')
You may try this,
$ echo 'foo?![bar]' | sed 's~[?!]\|\]\|\[~~g'
foobar

How can I match square bracket in regex with grep?

I am trying to match both [ and ] with grep, but only succeeded to match [. No matter how I try, I can't seem to get it right to match ].
Here's a code sample:
echo "fdsl[]" | grep -o "[ a-z]\+" #this prints fdsl
echo "fdsl[]" | grep -o "[ \[a-z]\+" #this prints fdsl[
echo "fdsl[]" | grep -o "[ \]a-z]\+" #this prints nothing
echo "fdsl[]" | grep -o "[ \[\]a-z]\+" #this prints nothing
Edit: My original regex, on which I need to do this, is this one:
echo "fdsl[]" | grep -o "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!##&?-]\+"
#this prints nothing
N.B: I have tried all the answers from this post but that didn't work on this particular case. And I need to use those brackets inside [].
According to BRE/ERE Bracketed Expression section of POSIX regex specification:
[...] The right-bracket ( ']' ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial circumflex ( '^' ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as "[.].]" ) or is the ending right-bracket for a collating symbol, equivalence class, or character class. The special characters '.', '*', '[', and '\' (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression.
and
[...] If a bracket expression specifies both '-' and ']', the ']' shall be placed first (after the '^', if any) and the '-' last within the bracket expression.
Therefore, your regex should be:
echo "fdsl[]" | grep -Eo "[][ a-z]+"
Note the E flag, which specifies to use ERE, which supports + quantifier. + quantifier is not supported in BRE (the default mode).
The solution in Mike Holt's answer "[][a-z ]\+" with escaped + works because it's run on GNU grep, which extends the grammar to support \+ to mean repeat once or more. It's actually undefined behavior according to POSIX standard (which means that the implementation can give meaningful behavior and document it, or throw a syntax error, or whatever).
If you are fine with the assumption that your code can only be run on GNU environment, then it's totally fine to use Mike Holt's answer. Using sed as example, you are stuck with BRE when you use POSIX sed (no flag to switch over to ERE), and it's cumbersome to write even simple regular expression with POSIX BRE, where the only defined quantifier is *.
Original regex
Note that grep consumes the input file line by line, then checks whether the line matches the regex. Therefore, even if you use P flag with your original regex, \n is always redundant, as the regex can't match across lines.
While it is possible to match horizontal tab without P flag, I think it is more natural to use P flag for this task.
Given this input:
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89"
fds l[]kSAJD<>?,./:";'{}|[]\!##$%^&*()_+-=~`89
The original regex in the question works with little modification (unescape + at the end):
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89" | grep -Po "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!##&?-]+"
fds l[]kSAJD
?,./:";'
[]
!##$
&*()_+-=~
89
Though we can remove \n (since it is redundant, as explained above), and a few other unnecessary escapes:
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89" | grep -Po "[ \[\]\ta-zA-Z/:.0-9_~\"'+,;*=()$\!##&?-]+"
fds l[]kSAJD
?,./:";'
[]
!##$
&*()_+-=~
89
One issue is that [ is a special character in expression and it cannot get escaped with \ (at least not in my flavors of grep). Solution is to define it like [[].
According to regular-expressions.info:
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
... and ...
The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning.
So, assuming that the particular flavor of regular expressions syntax supported by grep conforms to this, then I would have expected that "[ a-z[\]]\+" should have worked.
However, my version of grep (GNU grep 2.14) only matches the "[]" at the end of "fdsl[]" with this regex.
However, I tried using the other technique mentioned in that quote (putting the ] in a position within the character class where it cannot take on its normal meaning, and it seems to have worked:
$ echo "fdsl[]" | grep -o "[][a-z ]\+"
fdsl[]

sed regex pattern for some tricky line (ini section)

I want parse something like a section entry in an *.ini file:
line=' [ fdfd fdf f ] '
What could be the sed pattern (???) for this line to split the
'fdfd fdf f'
out?
So:
echo "${line}" | sed -E 's/???/\1/g'
How can I describe all chars except [[:space:]], [ and ] ? This doesn't work for me: [^[[:space:]]\[]* .
When you use the [[:space:]] syntax, the outer brackets are normal "match one character from this list" brackets, the same as in [aeiou] but the inner brackets are part of [:space:] which is an indivisible unit.
So if you wanted to match a single character which either belongs to the space class or is an x you'd use [[:space:]x] or [x[:space:]]
When one of the characters you want to match is a ], it will terminate the bracketed character list unless you give it some special treatment. You've guessed that you need a backslash somewhere; a good guess but wrong. The way you include a ] in the list is to put it first. [ab]c] is a bracketed list containing the 2 characters ab, followed by 2 literal-match characters c], so it matches "ac]" or "bc]" but []abc] is a bracketed list of the 4 characters ]abc so it matches "a", "b", "c", or "]".
In a negated list the ] comes immediately after the ^.
So putting that all together, the way to match a single char from the set of all chars except the [:space:] class and the brackets is:
[^][:space:][]
The first bracket and the last bracket are a matching pair, even if you think it doesn't look like they should be.
$ echo "$line" | sed "s/^.*\[[[:space:]]*\([^]]*[^][:space:]]\)[[:space:]]*\].*$/'\1'/"
You can split the pattern into two:
$ echo "$line" | sed "s/^.*\[[[:space:]]*/'/; s/[[:space:]]*\].*$/'/"
awk works too:
$ echo "$line" | awk -F' *[[\]] *' -vQ="'" '{print Q$2Q}'
When you say split, do you mean split into an array, or do you mean filter out all spaces and brackets?
Assuming the value of line number 1 in file.ini is:
[ fdfd fdf f ]
If you mean array,
$linenumber=1;
array=($(sed -n ${linenumber}p file.ini | sed 's/[][]*//g'));
will split line number 1 of file.ini into an array and return the values:
${array[0]} = fdfd
${array[1]} = fdf
${array[2]} = f
If you mean filter spaces and brackets,
$linenumber=1;
sed -n ${linenumber}p file.ini | sed 's/[ []]*//g';
will return:
fdfdfdff
and if neither of those is what you meant, please specify the exact output you are looking to extract from the initial value so that we can address it correctly.