sed regex : remove occurrences of [ and ] from string - regex

I am using Linux Mint 17.
I am able to replace everything from the test string except [ and ] (opening and closing square brackets).
Here is my expression, which is within a bash script, which I run from the command line
video_title=$(echo $video_title | sed 's|[?![]]||g')
I have tried placing \ before both square brackets and this does not work.
If I remove the [ and the ] the expression replaces the ? and the ! just fine.
Anyone any ideas?

According to the manual, to include ] in the list, it needs to be the first character.
A leading `^' reverses the meaning of LIST, so that it matches any
single character _not_ in LIST. To include `]' in the list, make
it the first character (after the `^' if needed), to include `-'
in the list, make it the first or last; to include `^' put it
after the first character.
So try something like this:
$ echo '[!2015?]' | sed 's|[][?!]||g'
2015

For just deleting characters from the input tr is a more appropriate tool than sed. In your case you can just use
video_title=$(echo $video_title | tr -d '?![]')

You may try this,
$ echo 'foo?![bar]' | sed 's~[?!]\|\]\|\[~~g'
foobar

Related

Can't understand this awk regex

I'm trying to understand a particular line of code from a Unix talk, and can't seem to understand what the awk portion is doing.
The full line is: man ls | col -b | grep '^[[:space:]]*ls \[' | awk -F '[][]' '{print $2}'. The text passed to awk (if for some reason you don't have the man program) is: ls [-ABCFGHLOPRSTUW#abcdefghiklmnopqrstuwx1] [file ...]. Somehow, awk is able to just pull out the list of options to ls, but I can't really understand how this regex [][] actually works & what it matches for.
My best guess is that the outer brackets denote a character class whose contents contain ][. If that's the case, why can't the inner brackets be written as []. Is it because pairs of brackets [[]] have a different meaning in awk?
Thanks in advance!
In POSIX regular expressions [...] is called a bracket expression.
It is very similar to character class in other reegx flavors. One key difference is that the backslash is NOT a meta-character in a POSIX bracket expression.
If you want to include [ and ] in a bracket expression then it needs to be placed correctly i.e. ] right at the start and [.
As per the linked article:
To match a ], put it as the first character after the opening [ or the negating ^. To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ].
In your example:
awk -F '[][]' '...'
awk sets (input) field separator as single literal [ or ] character.
If you had [[]] it would mean that [ is in brackets [], like [[] followed by a ] so the field separator would be []:
$ echo a[]b | awk -F'[[]]' '{print $2}'
b
But then the brackets other way around:
$ echo a][b | awk -F'[][]' '{print $3}'
b
Now the $2 is empty and $3==b (oh dear what done).
Your hunch about character classes is correct. If you want certain characters to be field separators, then you can list them between brackets. Using awk -F '[abc]' ... would specify the a and b and c characters as separators. Order is irrelevant; you could use awk -F '[cab]' ... and get the same results.
But what if you want the separating characters to be left and right brackets themselves? The documentation for regular expressions (man re_format on many systems) says this:
To include a literal `]' in the list, make it the first character ...
Which makes sense, given how the expression will be parsed. As the parser is scanning the expression, it's looking for the end, the right bracket. It doesn't care about seeing another left bracket or a comma or a space or whatever, but a right bracket would mark the end unless there's some way to tell the parser to take it literally. Since brackets with nothing between them, [], would be useless, a right bracket as the first character is defined to mean something else: this can't be the end, so take this right-bracket literally.
So if you want brackets as field-separating characters, you list [ and ] between brackets, but you put the right bracket first in the list so it'll be taken literally, per the instructions: [][]

Why does the order of replacing things matter in sed?

I have a file like this:
(paren)
[sharp]
And I try to replace like this:
sed "s/(/[/g" some_file.txt
And it works just fine:
[paren)
[sharp]
Then I try to replace like this:
sed "s/[/(/g" some_file.txt
And it gives me the error:
sed: 1: "s/[/(/g": unbalanced brackets ([])
I cannot find any evidence as to why this would error out. Why does the order of [ and ( matter?
Thank you very much.
The [ is a part of a bracket expression that must have a closing counterpart (]).
Escape the [ to match a literal [ symbol:
echo "[sharp]" | sed 's/\[/(/g'
See IDEONE demo
The reason it matters is because you're replacing a regex with a literal string.
So the bracket is viewed as a character when used after the second slash. It is viewed as part of an invalid regex when used between the first and second slash.
So in this expression the '[' is taken as a character:
s/(/[/g
In this expression it's not:
s/[/(/g
The first parameter in a replacement with sed must be a regex pattern:s/regex_pattern/replacement_string/
The opening square bracket has a special meaning in a regex pattern, since it is the beginning of a character class, for example [a-z]. That is why you obtain this error message that has nothing to do with the order of your replacements: unbalanced brackets ([]) (an opened character class must be closed.)
To obtain a literal opening square bracket, you need to escape it: \[
sed 's/\[/(/' file
If your goal is to translate characters into others, there is a more simple way, using a translation, that avoids the problem of circular replacements:
a='(paren)
[sharp]'
using tr
echo "$a" | tr '[]()' '()[]'
or with sed:
echo "$a" | sed 'y/[]()/()[]/'

How to match until the last occurrence of a character in bash shell

I am using curl and cut on a output like below.
var=$(curl https://avc.com/actuator/info | tr '"' '\n' | grep - | head -n1 | cut -d'-' -f -1, -3)
Varible var gets have two kinds of values (one at a time).
HIX_MAIN-7ae526629f6939f717165c526dad3b7f0819d85b
HIX-R1-1-3b5126629f67892110165c524gbc5d5g1808c9b5
I am actually trying to get everything until the last '-'. i.e HIX-MAIN or HIX-R1-1.
The command shown works fine to get HIX-R1-1.
But I figured this is the wrong way to do when I have something something like only 1 - in the variable; it is getting me the entire variable value (e.g. HIX_MAIN-7ae526629f6939f717165c526dad3b7f0819d85b).
How do I go about getting everything up to the last '-' into the variable var?
This removes everything from the last - to the end:
sed 's/\(.*\)-.*/\1/'
As examples:
$ echo HIX_MAIN-7ae52 | sed 's/\(.*\)-.*/\1/'
HIX_MAIN
$ echo HIX-R1-1-3b5126629f67 | sed 's/\(.*\)-.*/\1/'
HIX-R1-1
How it works
The sed substitute command has the form s/old/new/ where old is a regular expression. In this case, the regex is \(.*\)-.*. This works because \(.*\)- is greedy: it will match everything up to the last -. Because of the escaped parens,\(...\), everything before the last - will be saved in group 1 which we can refer to as \1. The final .* matches everything after the last -. Thus, as long as the line contains a -, this regex matches the whole line and the substitute command replaces the whole line with \1.
You can use bash string manipulation:
$ foo=a-b-c-def-ghi
$ echo "${foo%-*}"
a-b-c-def
The operators, # and % are on either side of $ on a QWERTY keyboard, which helps to remember how they modify the variable:
#pattern trims off the shortest prefix matching "pattern".
##pattern trims off the longest prefix matching "pattern".
%pattern trims off the shortest suffix matching "pattern".
%%pattern trims off the longest suffix matching "pattern".
where pattern matches the bash pattern matching rules, including ? (one character) and * (zero or more characters).
Here, we're trimming off the shortest suffix matching the pattern -*, so ${foo%-*} will get you what you want.
Of course, there are many ways to do this using awk or sed, possibly reusing the sed command you're already running. Variable manipulation, however, can be done natively in bash without launching another process.
You can reverse the string with rev, cut from the second field and then rev again:
rev <<< "$VARIABLE" | cut -d"-" -f2- | rev
For HIX-R1-1----3b5126629f67892110165c524gbc5d5g1808c9b5, prints:
HIX-R1-1---
I think you should be using sed, at least after the tr:
var=$(curl https://avc.com/actuator/info | tr '"' '\n' | sed -n '/-/{s/-[^-]*$//;p;q}')
The -n means "don't print by default". The /-/ looks for a line containing a dash; it then executes s/-[^-]*$// to delete the last dash and everything after it, followed by p to print and q to quit (so it only prints the first such line).
I'm assuming that the output from curl intrinsically contains multiple lines, some of them with unwanted double quotes in them, and that you need to match only the first line that contains a dash at all (which might very well not be the first line). Once you've whittled the input down to the sole interesting line, you could use pure shell techniques to get the result that's desired, but getting the sole interesting line is not as trivial as some of the answers seem to be assuming.

Extract string located after or between matched pattern(s)

Given a string "pos:665181533 pts:11360 t:11.360000 crop=720:568:0:4 some more words"
Is it possible to extract string between "crop=" and the following space using bash and grep?
So if I match "crop=" how can I extract anything after it and before the following white space?
Basically, I need "720:568:0:4" to be printed.
I'd do it this way:
grep -o -E 'crop=[^ ]+' | sed 's/crop=//'
It uses sed which is also a standard command. You can, of course, replace it with another sequence of greps, but only if it's really needed.
I would use sed as follows:
echo "pos:665181533 pts:11360 t:11.360000 crop=720:568:0:4 some more words" | sed 's/.*crop=\([0-9.:]*\)\(.*\)/\1/'
Explanation:
s/ : substitute
.*crop= : everything up to and including "crop="
\([0-9.:]\) : match only numbers and '.' and ':' - I call this the backslash-bracketed expression
\(.*\) : match 'everything else' (probably not needed)
/\1/ : and replace with the first backslash-bracketed expression you found
I think this will work (need to recheck my reference):
awk '/crop=([0-9:]*?)/\1/'
yet another way with bash pattern substitution
PAT="pos:665181533 pts:11360 t:11.360000 crop=720:568:0:4 some more words"
RES=${PAT#*crop=}
echo ${RES%% *}
first remove all up to and including crop= found from left to right (#)
then remove all from and including the first space found from right to left (%%)

sed regex pattern for some tricky line (ini section)

I want parse something like a section entry in an *.ini file:
line=' [ fdfd fdf f ] '
What could be the sed pattern (???) for this line to split the
'fdfd fdf f'
out?
So:
echo "${line}" | sed -E 's/???/\1/g'
How can I describe all chars except [[:space:]], [ and ] ? This doesn't work for me: [^[[:space:]]\[]* .
When you use the [[:space:]] syntax, the outer brackets are normal "match one character from this list" brackets, the same as in [aeiou] but the inner brackets are part of [:space:] which is an indivisible unit.
So if you wanted to match a single character which either belongs to the space class or is an x you'd use [[:space:]x] or [x[:space:]]
When one of the characters you want to match is a ], it will terminate the bracketed character list unless you give it some special treatment. You've guessed that you need a backslash somewhere; a good guess but wrong. The way you include a ] in the list is to put it first. [ab]c] is a bracketed list containing the 2 characters ab, followed by 2 literal-match characters c], so it matches "ac]" or "bc]" but []abc] is a bracketed list of the 4 characters ]abc so it matches "a", "b", "c", or "]".
In a negated list the ] comes immediately after the ^.
So putting that all together, the way to match a single char from the set of all chars except the [:space:] class and the brackets is:
[^][:space:][]
The first bracket and the last bracket are a matching pair, even if you think it doesn't look like they should be.
$ echo "$line" | sed "s/^.*\[[[:space:]]*\([^]]*[^][:space:]]\)[[:space:]]*\].*$/'\1'/"
You can split the pattern into two:
$ echo "$line" | sed "s/^.*\[[[:space:]]*/'/; s/[[:space:]]*\].*$/'/"
awk works too:
$ echo "$line" | awk -F' *[[\]] *' -vQ="'" '{print Q$2Q}'
When you say split, do you mean split into an array, or do you mean filter out all spaces and brackets?
Assuming the value of line number 1 in file.ini is:
[ fdfd fdf f ]
If you mean array,
$linenumber=1;
array=($(sed -n ${linenumber}p file.ini | sed 's/[][]*//g'));
will split line number 1 of file.ini into an array and return the values:
${array[0]} = fdfd
${array[1]} = fdf
${array[2]} = f
If you mean filter spaces and brackets,
$linenumber=1;
sed -n ${linenumber}p file.ini | sed 's/[ []]*//g';
will return:
fdfdfdff
and if neither of those is what you meant, please specify the exact output you are looking to extract from the initial value so that we can address it correctly.