The sed command is not working with regex - regex

I'm parsing the output of a HTTP GET request with sed to retrieve the contents of a given html tag. The result of that request is like this:
"<!DOCTYPE html><html><body><h1>Hello!</h1><p>v1.0.4-b</p></body></html>"
And I want to retrieve the version number inside the p element.
However, sed seems to have a bug in regex parsing.
When I use:
sed 's/.*<p>//'
It correctly replaces the text at the left of the version (i.e., it outputs "v1.0.4-b</p></body></html>"). But, when I try to use regex groups, with
sed 's/.*<p>(.*)<\/p>.*/\1/'
It fails to match and gives an error:
sed: -e expression #1, char 20: invalid reference \1 on `s' command's RHS.
Despite that, when I test the regex on online regex validators it works.
Thank you in advance

You need to use
sed -n 's~.*<p>\([^<]*\)</p>.*~\1~p'
sed -n -E 's~.*<p>([^<]*)</p>.*~\1~p'
See the online demo:
#!/bin/bash
sed -n 's~.*<p>\([^<]*\)</p>.*~\1~p' <<< \
"<!DOCTYPE html><html><body><h1>Hello!</h1><p>v1.0.4-b</p></body></html>"
## => v1.0.4-b
The sed 's/.*<p>(.*)<\p>.*/\1/' command would not work because
You are using a POSIX BRE pattern where the unescaped ( and ) are treated as literal parentheses chars, not a capturing group. In POSIX BRE, you need \(...\) to define a capturing group (this is why you get the invalid reference \1 exception)
If you add -E option to enable POSIX ERE, you can use (...) to define a capturing group
You are not matching /p, you have \p in the pattern.
As there are slashes in the pattern, it is more convenient to choose regex delimiters other than /, I chose ~ here.
Also, I used -n option to suppress default line output and p flag to print only the result of the substitution.

Related

bash tool to search and replace text (while leaving text in the middle the same)

I have text files that look like this:
foo(bar(some_id)) I want to replace that with
bleh(some_id)
I can come up with the regex to find the instances, which is: foo\(bar\([a-zA-z0-9_]+\)\). But I dont know how to express that I want to keep the text in the middle the same.
Any suggestion? (I'm thinking of using sed or awk or any standard bash tool, whichever is easier )
You can use
sed -E 's/foo\(bar\(([^()]*).*/bleh(\1)/'
sed 's/foo(bar(\([^()]*\).*/bleh(\1)/'
The first pattern is POSIX ERE compliant, hence the -E option.
The foo\(bar\(([^()]*).* POSIX ERE pattern matches foo(bar(, then captures any zero or more chars other than ( and ) into Group 1 (\1 refers to this group value from the replacement pattern), and then matches the rest of string. After the replacement, the Group 1 value remains. You may add .* at the start if there is text before foo(bar(.
The second sed command is POSIX BRE equivalent of the above command.
See an online demo:
s='foo(bar(some_id))'
sed -E 's/foo\(bar\(([^()]*).*/bleh(\1)/' <<< "$s"
# => bleh(some_id)
sed 's/foo(bar(\([^()]*\).*/bleh(\1)/' <<< "$s"
# => bleh(some_id)
Using sed
$ sed 's/.*\(([^)]*)\).*/bleh\1/' input_file
bleh(some_id)

Bash script to enclose words in single quotes

I'm trying to write a bash script to enclose words contained in a file with single quotes.
Word - Hello089
Result - 'Hello089',
I tried the following regex but it doesn't work. This works in Notepad++ with find and replace. I'm not sure how to tweak it to make it work in bash scripting.
sed "s/(.+)/'$1',/g" file.txt > result.txt
Replacement backreferences (also called placeholders) are defined with \n syntax, not $n (this is perl-like backreference syntax).
Note you do not need groups here, though, since you want to wrap the whole lines with single quotation marks. This is also why you do not need the g flags, they are only necessary when you plan to find multiple matches on the same line, input string.
You can use the following POSIX BRE and ERE (the one with -E) solutions:
sed "s/..*/'&',/" file.txt > result.txt
sed -E "s/.+/'&',/" file.txt > result.txt
In the POSIX BRE (first) solution, ..* matches any char and then any 0 or more chars (thus emulating .+ common PCRE pattern). The POSIX ERE (second) solution uses .+ pattern to do the same. The & in the right-hand side is used to insert the whole match (aka \0). Certainly, you may enclose the whole match with capturing parentheses and then use \1, but that is redundant:
sed "s/\(..*\)/'\1',/" file.txt > result.txt
sed -E "s/(.+)/'\1',/" file.txt > result.txt
See the escaping, capturing parentheses in POSIX BRE must be escaped.
See the online sed demo.
s="Hello089";
sed "s/..*/'&',/" <<< "$s"
# => 'Hello089',
sed -E "s/.+/'&',/" <<< "$s"
# => 'Hello089',
$1 is expanded by the shell before sed sees it, but it's the wrong back reference anyway. You need \1. You also need to escape the parentheses that define the capture group. Because the sed script is enclosed in double quotes, you'll need to escape all the backslashes.
$ echo "Hello089" | sed "s/\\(.*\\)/'\1',/g"
'Hello089',
(I don't recall if there is a way to specify single quotes using an ASCII code instead of a literal ', which would allow you to use single quotes around the script.)

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'
sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658
-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.
$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658
you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

Backreferences in sed returning wrong value

I am trying to replace an expression using sed. The regex works in vim but not in sed. I'm replacing the last dash before the number with a slash so
/www/file-name-1
should return
/www/file-name/1
I am using the following command but it keeps outputting /www/file-name/0 instead
sed 's/-[0-9]/\/\0/g' input.txt
What am I doing wrong?
You must surround between parentheses the data to reference it later, and sed begins to count in 1. To recover all the characters matched without the need of parentheses, it is used the & symbol.
sed 's/-\([0-9]\)/\/\1/g' input.txt
That yields:
/www/file-name/1
You need to capture using parenthesis before you can back reference (which start a \1). Try sed -r 's|(.*)-|\1/|':
$ sed -r 's|(.*)-|\1/|' <<< "/www/file-name-1"
/www/file-name/1
You can use any delimiter with sed so / isn't the best choice when the substitution contains /. The -r option is for extended regexp so the parenthesis don't need to be escaped.
It seems sed under OS X starts counting backreferences at 1. Try \1 instead of \0

PCRE regex to sed regex

First of all sorry for my bad english. I'm a german guy.
The code given below is working fine in PHP:
$string = preg_replace('/href="(.*?)(\.|\,)"/i','href="$1"',$string);
Now T need the same for sed. I thought it should be:
sed 's/href="(.*?)(\.|\,)"/href="{$\1}"/g' test.htm
But that gives me this error:
sed: -e expression #1, char 36:
invalid reference \1 on `s' command's
RHS
sed does not support non-greedy regex match.
sed -e 's|href=\"\(.[^"][^>]*\)\([.,]\)\">|href="\1">|g' file
You need a backslash in front of the parentheses you want to reference, thus
sed 's/href="\(.*?\)(.|\,)"/href="{$\1}"/g' test.htm
You have to escape the block selector characters ( and ) as follows.
sed 's/href="\(.*?\)\(.|\,\)"/href="{$\1}"/g' test.htm
here is a solution, it is not prefect, only deal with the situation of one extra "," or "."
sed -r -e 's/href="([^"]*)([.,]+)"/href="\1"/g' test.htm
If you want to match a literal ".", you need to escape it or use it in a character class. As an alternative to slashing the capturing parentheses (which you need to do with basic REs), you can use the -E option to tell sed to use extended REs. Lastly, the REs used by sed use \N to refer to subpatterns, where N is a digit.
sed -E "s/href=([\"'])([^\"']*)[.,]\1/href=\1\2\1/i"
This has its own issue that will prevent matches of href attributes that use both types of quotes.
man sed and man re_format will give more information on REs as used in sed.