Why does the order of replacing things matter in sed?

Why does the order of replacing things matter in sed? - regex

I have a file like this:
(paren)
[sharp]
And I try to replace like this:
sed "s/(/[/g" some_file.txt
And it works just fine:
[paren)
[sharp]
Then I try to replace like this:
sed "s/[/(/g" some_file.txt
And it gives me the error:
sed: 1: "s/[/(/g": unbalanced brackets ([])
I cannot find any evidence as to why this would error out. Why does the order of [ and ( matter?
Thank you very much.

The [ is a part of a bracket expression that must have a closing counterpart (]).
Escape the [ to match a literal [ symbol:
echo "[sharp]" | sed 's/\[/(/g'
See IDEONE demo

The reason it matters is because you're replacing a regex with a literal string.
So the bracket is viewed as a character when used after the second slash. It is viewed as part of an invalid regex when used between the first and second slash.
So in this expression the '[' is taken as a character:
s/(/[/g
In this expression it's not:
s/[/(/g

The first parameter in a replacement with sed must be a regex pattern:s/regex_pattern/replacement_string/
The opening square bracket has a special meaning in a regex pattern, since it is the beginning of a character class, for example [a-z]. That is why you obtain this error message that has nothing to do with the order of your replacements: unbalanced brackets ([]) (an opened character class must be closed.)
To obtain a literal opening square bracket, you need to escape it: \[
sed 's/\[/(/' file
If your goal is to translate characters into others, there is a more simple way, using a translation, that avoids the problem of circular replacements:
a='(paren)
[sharp]'
using tr
echo "$a" | tr '[]()' '()[]'
or with sed:
echo "$a" | sed 'y/[]()/()[]/'

Related

How can I replace anything before first forward slash using bash script?

Using GitHub workflow I have the following command
echo MY_DIR=$(echo "${GITHUB_REF#refs/heads/}" | tr '[:upper:]' '[:lower:]')
This would return a value like something/something-else/another
I am looking to add to this script to replace everything before the first forward slash with thisword
Which would output thisword/something-else/another
Can regex be used on the single line script to do this replace? I believe I could use the following regex /^[^/]+/ but unsure how to combine with the current script.

Depending on the version and distro of sed (apologies, but there are many with different syntax and flags), you might be able to do something like:
echo MY_DIR=$(echo "${GITHUB_REF#refs/heads/}" | tr '[:upper:]' '[:lower:]' | sed 's/^[a-z]*\//thisword\//' )
Sed is finding-and-replacing a string of text starting from the beginning of the line ^ which contains any number of occurrences * of lowercase characters in any order [a-z] which are then followed by the first slash. The slashes can be escaped by using the backslash character \. To clarify sed's use of /, here's the same expression omitting the regex and slashes forming part of your search string: sed 's/find/replace/'.

Try the below regex
^([a-z]*)(\/)
function formatData() {
var str = "something/something-else/another";
var res = str.replace(/^([a-z]*)!?(\/)/gim, "otherword/");
document.getElementById("demo").innerHTML = res;
}

Assuming MY_DIR holds something/something-else/another, you can use
MY_DIR="something/something-else/another"
MY_DIR="thisword/${MY_DIR#*/}"
echo "$MY_DIR"
See the online demo.
This is an example of string variable expansion where # means "replace as few chars as possible from the left", and */ glob matches any text up to a / including it.

Can't understand this awk regex

I'm trying to understand a particular line of code from a Unix talk, and can't seem to understand what the awk portion is doing.
The full line is: man ls | col -b | grep '^[[:space:]]*ls \[' | awk -F '[][]' '{print $2}'. The text passed to awk (if for some reason you don't have the man program) is: ls [-ABCFGHLOPRSTUW#abcdefghiklmnopqrstuwx1] [file ...]. Somehow, awk is able to just pull out the list of options to ls, but I can't really understand how this regex [][] actually works & what it matches for.
My best guess is that the outer brackets denote a character class whose contents contain ][. If that's the case, why can't the inner brackets be written as []. Is it because pairs of brackets [[]] have a different meaning in awk?
Thanks in advance!

In POSIX regular expressions [...] is called a bracket expression.
It is very similar to character class in other reegx flavors. One key difference is that the backslash is NOT a meta-character in a POSIX bracket expression.
If you want to include [ and ] in a bracket expression then it needs to be placed correctly i.e. ] right at the start and [.
As per the linked article:
To match a ], put it as the first character after the opening [ or the negating ^. To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ].
In your example:
awk -F '[][]' '...'
awk sets (input) field separator as single literal [ or ] character.

If you had [[]] it would mean that [ is in brackets [], like [[] followed by a ] so the field separator would be []:
$ echo a[]b | awk -F'[[]]' '{print $2}'
b
But then the brackets other way around:
$ echo a][b | awk -F'[][]' '{print $3}'
b
Now the $2 is empty and $3==b (oh dear what done).

Your hunch about character classes is correct. If you want certain characters to be field separators, then you can list them between brackets. Using awk -F '[abc]' ... would specify the a and b and c characters as separators. Order is irrelevant; you could use awk -F '[cab]' ... and get the same results.
But what if you want the separating characters to be left and right brackets themselves? The documentation for regular expressions (man re_format on many systems) says this:
To include a literal `]' in the list, make it the first character ...
Which makes sense, given how the expression will be parsed. As the parser is scanning the expression, it's looking for the end, the right bracket. It doesn't care about seeing another left bracket or a comma or a space or whatever, but a right bracket would mark the end unless there's some way to tell the parser to take it literally. Since brackets with nothing between them, [], would be useless, a right bracket as the first character is defined to mean something else: this can't be the end, so take this right-bracket literally.
So if you want brackets as field-separating characters, you list [ and ] between brackets, but you put the right bracket first in the list so it'll be taken literally, per the instructions: [][]

Conditional in perl regex replacement

I'm trying to return different replacement results with a perl regex one-liner if it matches a group. So far I've got this:
echo abcd | perl -pe "s/(ab)(cd)?/defined($2)?\1\2:''/e"
But I get
Backslash found where operator expected at -e line 1, near "1\"
(Missing operator before \?)
syntax error at -e line 1, near "1\"
Execution of -e aborted due to compilation errors.
If the input is abcd I want to get abcd out, if it's ab I want to get an empty string. Where am I going wrong here?

You used regex atoms \1 and \2 (match what the first or second capture captured) outside of a regex pattern. You meant to use $1 and $2 (as you did in another spot).
Further more, dollar signs inside double-quoted strings have meaning to your shell. It's best to use single quotes around your program[1].
echo abcd | perl -pe's/(ab)(cd)?/defined($2)?$1.$2:""/e'
Simpler:
echo abcd | perl -pe's/(ab(cd)?)/defined($2)?$1:""/e'
Simpler:
echo abcd | perl -pe's/ab(?!cd)//'
Either avoid single-quotes in your program[2], or use '\'' to "escape" them.
You can usually use q{} instead of single-quotes. You can also switch to using double-quotes. Inside of double-quotes, you can use \x27 for an apostrophe.

Why torture yourself, just use a branch reset.
Find (?|(abcd)|ab())
Replace $1

And a couple of even better ways
Find abcd(*SKIP)(*FAIL)|ab
Replace ""
Find (?:abcd)*\Kab
Replace ""
These use regex wisely.
There is really no need nowadays to have to use the eval form
of the regex substitution construct s///e in conjunction with defined().
This is especially true when using the perl command line.
Good luck...

How can I match square bracket in regex with grep?

I am trying to match both [ and ] with grep, but only succeeded to match [. No matter how I try, I can't seem to get it right to match ].
Here's a code sample:
echo "fdsl[]" | grep -o "[ a-z]\+" #this prints fdsl
echo "fdsl[]" | grep -o "[ \[a-z]\+" #this prints fdsl[
echo "fdsl[]" | grep -o "[ \]a-z]\+" #this prints nothing
echo "fdsl[]" | grep -o "[ \[\]a-z]\+" #this prints nothing
Edit: My original regex, on which I need to do this, is this one:
echo "fdsl[]" | grep -o "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!##&?-]\+"
#this prints nothing
N.B: I have tried all the answers from this post but that didn't work on this particular case. And I need to use those brackets inside [].

According to BRE/ERE Bracketed Expression section of POSIX regex specification:
[...] The right-bracket ( ']' ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial circumflex ( '^' ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as "[.].]" ) or is the ending right-bracket for a collating symbol, equivalence class, or character class. The special characters '.', '*', '[', and '\' (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression.
and
[...] If a bracket expression specifies both '-' and ']', the ']' shall be placed first (after the '^', if any) and the '-' last within the bracket expression.
Therefore, your regex should be:
echo "fdsl[]" | grep -Eo "[][ a-z]+"
Note the E flag, which specifies to use ERE, which supports + quantifier. + quantifier is not supported in BRE (the default mode).
The solution in Mike Holt's answer "[][a-z ]\+" with escaped + works because it's run on GNU grep, which extends the grammar to support \+ to mean repeat once or more. It's actually undefined behavior according to POSIX standard (which means that the implementation can give meaningful behavior and document it, or throw a syntax error, or whatever).
If you are fine with the assumption that your code can only be run on GNU environment, then it's totally fine to use Mike Holt's answer. Using sed as example, you are stuck with BRE when you use POSIX sed (no flag to switch over to ERE), and it's cumbersome to write even simple regular expression with POSIX BRE, where the only defined quantifier is *.
Original regex
Note that grep consumes the input file line by line, then checks whether the line matches the regex. Therefore, even if you use P flag with your original regex, \n is always redundant, as the regex can't match across lines.
While it is possible to match horizontal tab without P flag, I think it is more natural to use P flag for this task.
Given this input:
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89"
fds l[]kSAJD<>?,./:";'{}|[]\!##$%^&*()_+-=~`89
The original regex in the question works with little modification (unescape + at the end):
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89" | grep -Po "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!##&?-]+"
fds l[]kSAJD
?,./:";'
[]
!##$
&*()_+-=~
89
Though we can remove \n (since it is redundant, as explained above), and a few other unnecessary escapes:
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89" | grep -Po "[ \[\]\ta-zA-Z/:.0-9_~\"'+,;*=()$\!##&?-]+"
fds l[]kSAJD
?,./:";'
[]
!##$
&*()_+-=~
89

One issue is that [ is a special character in expression and it cannot get escaped with \ (at least not in my flavors of grep). Solution is to define it like [[].

According to regular-expressions.info:
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
... and ...
The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning.
So, assuming that the particular flavor of regular expressions syntax supported by grep conforms to this, then I would have expected that "[ a-z[\]]\+" should have worked.
However, my version of grep (GNU grep 2.14) only matches the "[]" at the end of "fdsl[]" with this regex.
However, I tried using the other technique mentioned in that quote (putting the ] in a position within the character class where it cannot take on its normal meaning, and it seems to have worked:
$ echo "fdsl[]" | grep -o "[][a-z ]\+"
fdsl[]

Extract string located after or between matched pattern(s)

Given a string "pos:665181533 pts:11360 t:11.360000 crop=720:568:0:4 some more words"
Is it possible to extract string between "crop=" and the following space using bash and grep?
So if I match "crop=" how can I extract anything after it and before the following white space?
Basically, I need "720:568:0:4" to be printed.

I'd do it this way:
grep -o -E 'crop=[^ ]+' | sed 's/crop=//'
It uses sed which is also a standard command. You can, of course, replace it with another sequence of greps, but only if it's really needed.

I would use sed as follows:
echo "pos:665181533 pts:11360 t:11.360000 crop=720:568:0:4 some more words" | sed 's/.*crop=\([0-9.:]*\)\(.*\)/\1/'
Explanation:
s/ : substitute
.*crop= : everything up to and including "crop="
\([0-9.:]\) : match only numbers and '.' and ':' - I call this the backslash-bracketed expression
\(.*\) : match 'everything else' (probably not needed)
/\1/ : and replace with the first backslash-bracketed expression you found

I think this will work (need to recheck my reference):
awk '/crop=([0-9:]*?)/\1/'

yet another way with bash pattern substitution
PAT="pos:665181533 pts:11360 t:11.360000 crop=720:568:0:4 some more words"
RES=${PAT#*crop=}
echo ${RES%% *}
first remove all up to and including crop= found from left to right (#)
then remove all from and including the first space found from right to left (%%)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why does the order of replacing things matter in sed? - regex

The [ is a part of a bracket expression that must have a closing counterpart (]). Escape the [ to match a literal [ symbol: echo "[sharp]" | sed 's/\[/(/g' See IDEONE demo

Related

How can I replace anything before first forward slash using bash script?

Can't understand this awk regex

Conditional in perl regex replacement

How can I match square bracket in regex with grep?

Extract string located after or between matched pattern(s)

Categories

Resources