unix sed command regular expression - regex

Can anyone explain me how the regular expression works in the sed substitute command.
$ cat path.txt
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/sbin:/sbin:/bin/:/usr/sbin:/usr/bin:/opt/omni/bin:
/opt/omni/lbin:/opt/omni/sbin:/root/bin
$ sed 's/\(\/[^:]*\).**/\1/g' path.txt
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
From the above sed command they used back reference and save operator concept.
Can anyone explain me how the regular expression especially /[^:]* work in the substitute command to get only the first path in each line.

I think you wrote an extra asterisk * in your sed code, so it should be like this:
$ sed 's/\(\/[^:]*\).*/\1/g' file
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
To change the delimiter will help to understand it a little bit better:
sed 's#\(/[^:]*\).*#\1#g'
The s#something#otherthing#g is a basic sed command that looks for something and changes it for otherthing all over the file.
If you do s#(something)#\1#g then you "save" that something and then you can print it back with \1.
Hence, what it is doing is to get a pattern like /[^:]* and then print is back. /[^:]* means / and then every char except :. So it will get / + all the string until it finds a semicolon :. It will store that piece of the string and then print it back.
Small examples:
# get every char
$ echo "hello123bye" | sed 's#\([a-z]*\).*#\1#g'
hello
# get everything until it finds the number 3
$ echo "hello123bye" | sed 's#\([^3]*\).*#\1#g'
hello12

[^:]*
in regex would match all characters except for :, so it would match until this:
/usr/kbos/bin
also it would match these,
/usr/local/bin
/usr/jbin
/usr/bin
/usr/sas/bin
As, these all contains characters, that are not :
.* match any character, zero or more times.
Thus, this regex [^:]*.*, would match all this expressions:
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/jbin:/usr/bin:/usr/sas/bin
/usr/bin:/usr/sas/bin
However, you get only the first field (ie,/usr/kbos/bin, by using back reference in sed), because, regular expression output the longest possible match found.

Related

Convert regex positive look ahead to sed operation

I would like to sed to find and replace every occurrence of - with _ but only before the first occurrence of = on every line.
Here is a dataset to work with:
ke-y_0-1="foo"
key_two="bar"
key_03-three="baz-jazz-mazz"
key-="rax_foo"
key-05-five="craz-"
In the end the dataset should look like this:
ke_y_0_1="foo"
key_two="bar"
key_03_three="baz-jazz-mazz"
key_="rax_foo"
key_05_five="craz-"
I found this regex will match properly.
\-(?=.*=)
However the regex uses positive lookaheads and it appears that sed (even with -E, -e or -r) dose not know how to work with positive lookaheads.
I tried the following but keep getting Invalid preceding regular expression
cat dataset.txt | sed -r "s/-(?=.*=)/_/g"
Is it possible to convert this in a usable way with sed?
Note, I do not want to use perl. However I am open to awk.
You can use
sed ':a;s/^\([^=]*\)-/\1_/;ta' file
See the online demo:
#!/bin/bash
s='ke-y_0-1="foo"
key_two="bar"
key_03-three="baz-jazz-mazz"
key-="rax_foo"
key-05-five="craz-"'
sed ':a; s/^\([^=]*\)-/\1_/;ta' <<< "$s"
Output:
ke_y_0_1="foo"
key_two="bar"
key_03_three="baz-jazz-mazz"
key_="rax_foo"
key_05_five="craz-"
Details:
:a - setting a label named a
s/^\([^=]*\)-/\1_/ - find any zero or more chars other than a = char from the start of string (while capturing into Group 1 (\1)) and then matches a - char, and replaces with Group 1 value (\1) and a _ (that replaces the found - char)
ta - jump to lable a location upon successful replacement. Else, stop.
You might also use awk setting the field separator to = and replace all - with _ for the first field.
To print only the replaced lines:
awk 'BEGIN{FS=OFS="="}gsub("-", "_", $1)' file
Output
ke_y_0_1="foo"
key_03_three="baz-jazz-mazz"
key_="rax_foo"
key_05_five="craz-"
If you want to print all lines:
awk 'BEGIN{FS=OFS="="}{gsub("-", "_", $1);print}' file

Property File with Sed regex - Ignore first character for match

I have a test property file with this in it:
-config.test=false
config.test=false
I'm trying to, using sed, update the values of these properties whether they have the - in front of them or not. Originally I was using this, which worked:
sed -i -e "s/#*\(config.test\)\s*=\s*\(.*\)/\1=$(echo "true" | sed -e 's/[\/&]/\\&/g')/" $FILE_NAME
However, since I was basically ignoring all characters before the match, I found that when I had properties with keys that ended in the same value, it'd give me problems. Such as:
# The regex matches both of these
config.test=true
not.config.test=true
Is there a way to either ignore the first character for a match or ignore the initial - specifically?
EDIT:
Adding a little clarification in terms of what I'd want the regex to match:
config.test=false # Should match
-config.test=false # Should match
not.config.test=false # Should NOT match
sed -E 's/^(-?config\.test=).*/\1true/' file
? means zero or 1 repetitions of so it means the - can be present or not when matching the regexp.
I found some solution for a regex of a specific length instead of ignoring the first character with sed and awk. Sometimes the opposite does the same by an easier way.
If you only have the alternative to use sed I have two workaround depending on your file.
If your file looks like this
$ cat file
config.test=false
-config.test=false
not.config.test=false
you can use this one-liner
sed 's/^\(.\{11,12\}=\)\(.*$\)/\1true/' file
sed is looking at the beginning ^ of each line and is grouping \( ... \) for later back referencing every character . that occurs 11 or 12 times \{11,12\} followed by a =.
This first group will be replaced with the back reference \1.
The second group that match every character after the = to the end of line \(.*$\) will be dropped. Instead of the second group sed replaces with your desired string true.
This also means, that every character after the new string true will be chopped.
If you want to avoid this and your file looks like
$ cat file
config.test=true # Should match
-config.test=true # Should match
not.config.test=false # Should NOT match
you can use this one-liner
sed 's/^\(.\{11,12\}=\)\(false\)\(.*$\)/\1true\3/' file
This is like the example before but works with three groups for back referencing.
The content of the former group 2 is now in group 3. So no content after a change from false to true will be chopped.
The new second group \(false\) will be dropped and replaced by the string true.
If your file looks like in the example before and you are allowed to use awk, you can try this
awk -F'=' 'length($1)<=12 {sub(/false/,"true")};{print}'
For me this looks much more self-explanatory, but is up to your decision.
In both sed examples you invoke only one time the sed command which is always good.
The first sed command needs 39 and the second 50 character to type.
The awk command needs 52 character to type.
Please tell me if this works for you or if you need another solution.

How to match until the last occurrence of a character in bash shell

I am using curl and cut on a output like below.
var=$(curl https://avc.com/actuator/info | tr '"' '\n' | grep - | head -n1 | cut -d'-' -f -1, -3)
Varible var gets have two kinds of values (one at a time).
HIX_MAIN-7ae526629f6939f717165c526dad3b7f0819d85b
HIX-R1-1-3b5126629f67892110165c524gbc5d5g1808c9b5
I am actually trying to get everything until the last '-'. i.e HIX-MAIN or HIX-R1-1.
The command shown works fine to get HIX-R1-1.
But I figured this is the wrong way to do when I have something something like only 1 - in the variable; it is getting me the entire variable value (e.g. HIX_MAIN-7ae526629f6939f717165c526dad3b7f0819d85b).
How do I go about getting everything up to the last '-' into the variable var?
This removes everything from the last - to the end:
sed 's/\(.*\)-.*/\1/'
As examples:
$ echo HIX_MAIN-7ae52 | sed 's/\(.*\)-.*/\1/'
HIX_MAIN
$ echo HIX-R1-1-3b5126629f67 | sed 's/\(.*\)-.*/\1/'
HIX-R1-1
How it works
The sed substitute command has the form s/old/new/ where old is a regular expression. In this case, the regex is \(.*\)-.*. This works because \(.*\)- is greedy: it will match everything up to the last -. Because of the escaped parens,\(...\), everything before the last - will be saved in group 1 which we can refer to as \1. The final .* matches everything after the last -. Thus, as long as the line contains a -, this regex matches the whole line and the substitute command replaces the whole line with \1.
You can use bash string manipulation:
$ foo=a-b-c-def-ghi
$ echo "${foo%-*}"
a-b-c-def
The operators, # and % are on either side of $ on a QWERTY keyboard, which helps to remember how they modify the variable:
#pattern trims off the shortest prefix matching "pattern".
##pattern trims off the longest prefix matching "pattern".
%pattern trims off the shortest suffix matching "pattern".
%%pattern trims off the longest suffix matching "pattern".
where pattern matches the bash pattern matching rules, including ? (one character) and * (zero or more characters).
Here, we're trimming off the shortest suffix matching the pattern -*, so ${foo%-*} will get you what you want.
Of course, there are many ways to do this using awk or sed, possibly reusing the sed command you're already running. Variable manipulation, however, can be done natively in bash without launching another process.
You can reverse the string with rev, cut from the second field and then rev again:
rev <<< "$VARIABLE" | cut -d"-" -f2- | rev
For HIX-R1-1----3b5126629f67892110165c524gbc5d5g1808c9b5, prints:
HIX-R1-1---
I think you should be using sed, at least after the tr:
var=$(curl https://avc.com/actuator/info | tr '"' '\n' | sed -n '/-/{s/-[^-]*$//;p;q}')
The -n means "don't print by default". The /-/ looks for a line containing a dash; it then executes s/-[^-]*$// to delete the last dash and everything after it, followed by p to print and q to quit (so it only prints the first such line).
I'm assuming that the output from curl intrinsically contains multiple lines, some of them with unwanted double quotes in them, and that you need to match only the first line that contains a dash at all (which might very well not be the first line). Once you've whittled the input down to the sole interesting line, you could use pure shell techniques to get the result that's desired, but getting the sole interesting line is not as trivial as some of the answers seem to be assuming.

Regular Expression in Linux command sed

I have a shell variable:
all_apk_file="a 1 2.apk x.apk y m.apk"
I want to replace the a 1 2.apk with TEST, using the command:
echo $all_apk_file | sed 's/(.*apk ){1}/TEST/g'
The .*apk means end with apk, {1} means only match one time, but it doesn't work; I only got the original variable as output: a 1 2.apk x.apk y m.apk
Can anyone tell me why?
First, to enable the regular expressions you're familiar with in sed, you need to use the -r switch (sed -r ...):
echo $all_apk_file | sed -r 's/(.*apk ){1}/TEST/g'
# returns TESTy m.apk
Look at what it returns: TESTy m.apk. This is because the .* is greedy, so it matches as much as it possibly can. That is, the .* matches a 1 2.apk x, and you've said you want to replace .*apk, being a 1 2.apk x.apk with 'TEST', resulting in TESTy m.apk (note the following space after the '.apk' in your regular expression, which is why the match doesn't extend all the way to the last '.apk', which has no space following it).
Usually one could change the .* to .*? to make it non-greedy, but this behaviour is not supported in sed.
So, to fix it you just have to make your regex more restrictive.
It is hard to tell what you want to do - remove the first three words where the third ends in '.apk' and replace with 'TEST'? In that case, one could use the regular expression:
[a-z0-9]+ +[a-z0-9]+ +[a-z0-9]+\.apk
in combination with the 'i' switch (case insensitive).
You will have to give your logic for deciding what to remove (first three words, any number of words up to the first '.apk' word, etc) in order for us to help you further with the regex.
Secondly, you've put the 'g' switch in your regex. This means that all matching patterns will be replaced, and you seem to only want the first to be replaced. So remove the 'g' switch.
Finally, all of thse in combination:
echo $all_apk_file | sed -r 's/[a-z0-9]+ +[a-z0-9]+ +[a-z0-9]+\.apk/TEST/i'
# TEST x.apk y m.apk
This might work for you:
echo "$all_apk_file" | sed 's/apk/\n/;s/.*\n/TEST/'
TEST x.apk y m.apk
As to why your regexp did not work see #mathematical.coffee and #Jonathan Leffler's excellent explanations.
s/apk/\n/ is synonymous with s/apk/\n/1 which means replace the first occurence of apk with \n. As sed uses the \n as a record separator we know that it cannot occur in any initial strings passed to the sed commands. With these two facts under our belts we can split strings.
N.B. If you wanted to replace upto the second apk then s/apk/\n/2 would fit the bill. Of course for the last occurence of apk then .*apk comes into play.
One part of the problem is that in regular sed, the () and {} are ordinary characters in patterns until escaped with backslashes. Since there are no parentheses in the variable's value, the regex never matches. With GNU sed, you can also enable extended regular expressions with the -r flag. If you fix that problem, you will then run into the problem that .* is greedy, and the g modifier actually doesn't change anything:
$ echo $all_apk_file | sed 's/\(.*apk \)\{1\}/TEST/g'
TESTy m.apk
$ echo $all_apk_file | sed -r 's/(.*apk ){1}/TEST/g'
TESTy m.apk
$ echo $all_apk_file | sed -r 's/(.*apk ){1}/TEST/'
TESTy m.apk
$
It only stops there because there isn't a space after m.apk in the echoed value of the variable.
The issue now is: what is it that you want replaced? It sounds like 'everything up to and including the first occurrence of apk at the end of a word. This is probably most easily done with trailing context or non-greedy matching as found in Perl regular expressions. If switching to Perl is an option, do so. If not, it is not trivial in normal sed regular expressions.
$ echo $all_apk_file | sed 's/^[^.]* [^.][^.]*\.apk /TEST /'
TEST x.apk y m.apk
$
This looks for anything without dots in it, followed by a blank, followed by no dots again, and .apk; this means that the first dot allowed is the one in 2.apk. It works for the sample data; it would not work if the variable contained:
all_apk_file="a 1.2 2.apk m.apk y.apk 37"
You'll need to tune this to meet your requirements.

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input