Sed replace entire line with replacement - regex

I posted this question on superuser a bit ago, but I haven't gotten an answer I like. Here it is, slightly modified.
I was hoping for a way to make sed replace the entire line with the replacement (rather than just the match) so I could do something like this:
sed -e "/$some_complex_regex_with_a_backref/\1/"
and have it only print the back-reference.
From this question, it seems like the way to do it is mess around with the regex to match the entire line, or use some other tool (like perl). Simply changing the regex to .*regex.* doesn't always work (as mentioned in that question). For example:
$ echo $regex
\([:alpha:]*\)day
$ echo "$phrase"
it is Saturday tomorrow
Warm outside...
$ echo "$phrase" | sed "s/$regex/\1/"
it is Satur tomorrow
Warm outside...
$ echo "$phrase" | sed "s/.*$regex.*/\1/"
Warm outside...
$ # what I'd like to have happen
$ echo "$phrase" | [[[some command or string of commands]]]
Satur
Warm outside...
I'm looking for the most concise way to replace the entire line (not just the matching part) assuming the following:
The regex is in a variable, so can't be changed on a case by case basis.
I'd like to do this without using perl or other beefier languages (sed, awk, grep, etc. are ok)
The solution can't remove lines that don't match the original regex (sorry grep -o doesn't cut it).
What I thought of was the following (but it's ugly, so I'd like to know if there is something better):
$ sentinel=XXX
$ echo "$phrase" | sed "s/$regex/$sentinel\1$sentinel/" |
> sed "s/^.*$sentinel\(.*\)$sentinel.*$/\1/"

This might work for you:
sed '/'"$regex"'/!b;s//\n\1\n/;s/.*\n\(.*\)\n.*/\1/' file

Related

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2
The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt
sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

Regex and sed in sh script not evaluating properly

first post here. Trying to capture just the integer output from an SNMP reply with regex. I've used a regex tester to come up with the correct pattern match but sed refuses to output the result. This is just a primitive fact finding script right now, it'll grow into something more complex but right now this is my stumbling block.
The reply to each line of the snmpget statements are:
IF-MIB::ifInOctets.1001 = Counter32: 692749329
IF-MIB::ifOutOctets.1001 = Counter32: 3119381688
I want to capture just the value after "Counter32: " and the regex (?<=: )(\d+) accomplishes that in the testers I could find online.
#!/bin/sh
SED_IFACES="-e '/(?<=: )(\d+)/g'"
INTERNET_IN=`snmpget -v 2c -c public 123.45.678.9 1.3.6.1.2.1.2.2.1.10.1001` | eval sed $SED_IFACES
INTERNET_OUT=`snmpget -v 2c -c public 123.45.678.9 1.3.6.1.2.1.2.2.1.16.1001` | eval sed $SED_IFACES
echo $INTERNET_IN
echo $INTERNET_OUT
$ cat file
IF-MIB::ifInOctets.1001 = Counter32: 692749329
IF-MIB::ifOutOctets.1001 = Counter32: 3119381688
$ awk '{print $NF}' file
692749329
3119381688
$ sed 's/.* //' < file
692749329
3119381688
You can do
sed 's/^.*Counter32: \(.*\)$/\1/'
Which captures the value and prints it out with the \1.
Also note that you are using Perl regular expressions in your example, and sed does not support these. It is also missing the substitution "s/" part.

regexp (sed) suppress "no match" output

I'm stuck on that and can't wrap my head around it: How can I tell sed to return the value found, and otherwise shut up?
It's really beyond me: Why would sed return the whole string if he found nothing? Do I have to run another test on the returned string to verify it? I tried using "-n" from the (very short) man page but it effectively suppresses all output, including matched strings.
This is what I have now :
echo plop-02-plop | sed -e 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/'
which returns
02 (and that is fine and dandy, thank you very much), but:
echo plop-02plop | sed -e 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/'
returns
plop-02plop (when it should return this = "" nothing! Dang, you found nothing so be quiet!
For crying out loud !!)
I tried checking for a return value, but this failed too ! Gasp !!
$ echo plop-02-plop | sed -e 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/' ; echo $?
02
0
$ echo plop-02plop | sed -e 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/' ; echo $?
plop-02plop
0
$
This last one I cannot even believe. Is sed really the tool I should be using? I want to extract a needle from a haystack, and I want a needle or nothing..?
sed by default prints all lines.
What you want to do is
/patt/!d;s//repl/
IOW delete lines not matching your pattern, and if they match, extract particular element from it, giving capturing group number for instance. In your case it will be:
sed -e '/^.*\(.\)\([0-9][0-9]\)\1.*$/!d;s//\2/'
You can also use -n option to suppress echoing all lines. Then line is printed only when you explicitly state it. In practice scripts using -n are usually longer and more cumbersome to maintain. Here it will be:
sed -ne 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/p'
There is also grep, but your example shows, why sed is sometimes better.
Perhaps you can use egrep -o?
input.txt:
blooody
aaaa
bbbb
odor
qqqq
E.g.
sehe#meerkat:/tmp$ egrep -o o+ input.txt
ooo
o
o
sehe#meerkat:/tmp$ egrep -no o+ input.txt
1:ooo
4:o
4:o
Of course egrep will have slightly different (better?) regex syntax for advanced constructs (back-references, non-greedy operators). I'll let you do the translation, if you like the approach.

grep on unix / linux: how to replace or capture text?

So I'm pretty good with regular expressions, but I'm having some trouble with them on unix. Here are two things I'd love to know how to do:
1) Replace all text except letters, numbers, and underscore
In PHP I'd do this: (works great)
preg_replace('#[^a-zA-Z0-9_]#','',$text).
In bash I tried this (with limited success); seems like it dosen't allow you to use the full set of regex:
text="my #1 example!"
${text/[^a-zA-Z0-9_]/'')
I tried it with sed but it still seems to have problems with the full regex set:
echo "my #1 example!" | sed s/[^a-zA-Z0-9\_]//
I'm sure there is a way to do it with grep, too, but it was breaking it into multiple lines when i tried:
echo abc\!\#\#\$\%\^\&\*\(222 | grep -Eos '[a-zA-Z0-9\_]+'
And finally I also tried using expr but it seemed like that had really limited support for extended regex...
2) Capture (multiple) parts of text
In PHP I could just do something like this:
preg_match('#(word1).*(word2)#',$text,$matches);
I'm not sure how that would be possible in *nix...
Part 1
You are almost there with the sed just add the g modifier so that the replacement happen globally, without the g, replacement will happen just once.
$ echo "my #1 example!" | sed s/[^a-zA-Z0-9\_]//g
my1example
$
You did the same mistake with your bash pattern replacement too: not making replacements globally:
$ text="my #1 example!"
# non-global replacement. Only the space is delete.
$ echo ${text/[^a-zA-Z0-9_]/''}
my#1 example!
# global replacement by adding an additional /
$ echo ${text//[^a-zA-Z0-9_]/''}
my1example
Part 2
Capturing works the same in sed as it did in PHP's regex: enclosing the pattern in parenthesis triggers capturing:
# swap foo and bar's number using capturing and back reference.
$ echo 'foo1 bar2' | sed -r 's/foo([0-9]+) bar([0-9]+)/foo\2 bar\1/'
foo2 bar1
$
As an alternative to codaddict's nice answer using sed, you could also use tr for the first part of your question.
echo "my #1 _ example!" | tr -d -C '[[:alnum:]_]'
I've also made use of the [:alnum:] character class, just to show another option.
what do you mean you can't use the regex syntax for bash?
$ text="my #1 example!"
$ echo ${text//[^a-zA-Z0-9_]/}
my1example
you have to use // for more than 1 replacement.
for your 2nd question, with bash 3.2++
$ [[ $text =~ "(my).*(example)" ]]
$ echo ${BASH_REMATCH[1]}
my
$ echo ${BASH_REMATCH[2]}
example

matching a specific substring with regular expressions using awk

I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.
You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING
While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'
The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.
Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO