SED Regex rewriting between two patterns using backreferences - regex

I have text in many html files that is in this format:
<!-- BEGIN FOOTER -->
<div id="footer">
<p align="right"> Slogan<br />
5555 Street East <br />
City, State 99999 <br />
Call Us (555)555-5555 <br />
</p>
<div align="center">
<a class="footer" href="http://www.example.com" title="Site">Site</a>
</div>
<br>
</div>
<!--END FOOTER-->
I am using this:
sed -E -i 's/(<!-- BEGIN FOOTER -->)(.|\n)*(<!--END FOOTER-->)/\1 <br>REPLACE<br> \3 /m' file.html
but not working to capture & backreference them:
\1 <!-- BEGIN FOOTER -->
\3 <!--END FOOTER-->
And insert this in between them:
REPLACE
So trying to end up with this:
<!-- BEGIN FOOTER -->
<br>REPLACE<br>
<!--END FOOTER-->

This might work for you (GNU sed):
sed '/<!-- BEGIN FOOTER -->/{:a;N;/<!--END FOOTER-->/!ba;s/\n.*\n/\n<br>REPLACE<br>\n/}' file
This gathers up lines between footers and replaces the lines between with the required string.
An alternative (similar to revo);
sed '/<!-- BEGIN FOOTER -->/,/<!--END FOOTER-->/!b;/<!-- BEGIN FOOTER -->/b;/<!--END FOOTER-->/!d;i\<br>REPLACE<br>' file
Another way:
sed '/<!-- BEGIN FOOTER -->/,/<!--END FOOTER-->/!b;/<!--END FOOTER-->/p | sed '/<!-- BEGIN FOOTER -->/p;/<!-- BEGIN FOOTER -->/,/<!--END FOOTER-->/c\<br>REPLACE<br>'

sed reads from input file one line at a time and directs the result to standard output. It means no where in input line there is \n character to be matched except in some cases that a few special commands are being used. You could use address ranges along with insert i command to achieve desired output:
sed '/BEGIN FOOTER/,/END FOOTER/{ /END FOOTER/{i\<br>REPLACE<br>
b}; /BEGIN FOOTER/b; d;}' file
Watch the linebreak after first line. i inserts data before current line in pattern space. b jumps over remaining commands causing a new iteration to be dobe. /BEGIN FOOTER/,/END FOOTER/ implies a range in which the other commands should execute.

You can first save the lines with BEGIN FOOTER and END FOOTER in beg and end variables:
beg=$(grep -n "BEGIN FOOTER" inputfile | cut -d: -f1)
end=$(grep -n "END FOOTER" inputfile| cut -d: -f1)
Then use sed's c\ command:
sed -i.bak "$((beg+1)),$((end-1))c\<br>REPLACE<br>" inputfile
Result:
<!-- BEGIN FOOTER -->
<br>REPLACE<br>
<!--END FOOTER-->
Alternatively use the single command:
sed "/BEGIN FOO/,/END FOO/c\<!--BEGIN FOOTER-->\n<br>REPLACE<br>\n<!--END FOOTER-->" inputfile

Related

Search pattern between tags in html

I need to get value from a tag with specific title.
I have this command.
sed -n 's/title="view quote">\(.*\)<\/a>/\1/p' index.html
This is part of index.html and i need that 'Everything in life is luck'
<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_106578" class="oncl_q">
<img id="qimage_106578" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1.jpg" class="bqphtgrid" alt="Everything in life is luck. - Donald Trump">
</a>
</div>
Everything in life is luck.
Donald Trump
</div>
And i need all this vlaues to fill in array in bash.
Your sed command is mostly good - just missing .* at each end of regex to remove additional head and tail.
This command extract all values with your specific title:
sed -n 's/.*title="view quote">\(.*\)<\/a>.*/\1/p' index.html
To put into an array:
IFS=$'\n' array=( $(sed -n 's/.*title="view quote">\(.*\)<\/a>.*/\1/p' index.html) )
To verify your result array:
for ((i=0;i<${#array[#]};i++)); do
echo ${array[$i]}
done

Sed substitute pattern within a line

How can I substitute characters only within a specific pattern, preferably in sed but awk or otherwise if there's an easier option? I would like to replace spaces in my html h3 ids with hyphens (-), but I don't want it to hyphen the entire line.
Eg, in my foo.html:
<p>This is a paragraph which likes its spaces.</p>
<h3 id="No spaces in this id please">Keep spaces in this title</h3>
<p>Here's another paragraph with spaces.</p>
<h3 id="Another id that should be spaceless">Spaces please!</h3>
<p>Yes I would like extra noodles in my soup.</p>
What I want are h3s like this:
<h3 id="Another-id-that-should-be-spaceless">Spaces please!</h3>
I've tried
sed -e "/^<h3 id=\"/,/\">/s/ /-/g;" <foo.html >bar.html
But this greedily adds hyphens to lines (2nd p) and parts (h3 content) which shouldn't have hyphens! Bar.html:
<p>This is a paragraph which likes its spaces.</p>
<h3-id="No-spaces-in-this-id-please">Keep-spaces-in-this-title</h3>
<p>Here's-another-paragraph-with-spaces.</p>
<h3-id="Another-id-that-should-be-spaceless">Spaces-please!</h3>
<p>Yes I would like extra noodles in my soup.</p>
Note I'm using GNU sed. Thanks!
This sed replace one space at a time in the id value of h3 tags. When substitution succeeds, the t command loops to :a label to search for remaining spaces to replace:
sed -e ':a;s/\(<h3[^>]*id="[^"> ]*\) \(.*\)/\1-\2/;ta;' < foo.html > bar.html

Replacing content between two tags first occurence

I am trying to replace some content between tags with the following expression
sed -i -e ':a;N;$!ba' -e 's#.*#{{{ svg "/myLogo.svg" 100 100 }}} <img src="/logo.png">#' $file
The problem I am facing is I apply it to a text like:
<div class="bar">
<a href="/" class="logo">
<svg viewBox="13.195 149.965 803 267.334">
<path fill="#6A5B53" d="M233.773,218.468l-6.19,65.33c-3.429-29.427-19.904-64.188-41.427-82.473
c0.667-1.048-0.381-8,0.476-8.856L266.773,218.468z"/>
<path fill="#FFC3B7" d="M260.583,285.894c0.476-63.331-51.236-116.662-115.424-116.377/>
</svg>
<image src="/logo.jpg"> </svg>
</a>
<ul class="newmenu"><li>Char</li>
<li>Price</li>
<li>Account</li>
<li>Login</li>
</ul> <div class="log">.....
So after I execute it, the script replaces it until the last </a> instead of replacing up to the first </a>.
How can I prevent from replacing all the way up to the last tag?
Your .* in ...class="logo">.*</a> is greedy and match any characters until last </a> found.
If your opening and closing a tags are on different lines, you can use the sed c command :
sed -i -e '/<a href="\/" class="logo">/,/<\/a>/ c\
{{{ svg "/myLogo.svg" 100 100 }}} <img src="/logo.png">
' file
Explanation :
/<a href="\/" class="logo">/,/<\/a>/ : this address range match all lines from /<a href="\/" class="logo">/ to next /<\/a>/
c\ : change command, to replace matching lines with following text (\ is for starting the replacement with a newline)
following the ccommand is the substitution text. If you want to replace with multiple lines, you must add a trailing \ to each line (except the last one).
To illustrate this last point :
sed -e '/fromtext/,/totext/ c\
add line 1\
add line 2\
addline 3
' file
In some sed versions, you can write your substitution text on the same line after the c as (eg sed '/fromtext/ c newtext' file).

AWK sub function syntax

I have a files with the contents:
aaa.bbb.ccc ddd.eee.fff.ggg h.i.j.k
If i use the code:
awk '{sub(/\.$/, ""); print $1}' test.txt
returns aaa.bbb.ccc
awk '{sub(/\.$/, ""); print $3}' test.txt
Returns: h.i.j.k
I understand the sub function is used as: sub(regexp, replacement, target)
I dont understand this part .$/, from the sub function.
what is the .$?
thanks
UPDATE
Ok, i like your way of explaining things - thank you!
If i apply this to a real example,
/usr/bin/host 172.0.0.10
01.0.0.172.in-addr.arpa domain name pointer hostname.domain.com.
/usr/bin/host 172.0.0.10 | /bin/awk '{sub(/.$/, ""); print $5}'
gives: hostname.domain.com
/usr/bin/host 172.0.0.10| /bin/awk '{sub(/.$/, ""); print $1}'
gives: 10.0.0.172.in-addr.arpa
-The sub function will match to the end of the line as there is a "."
-what is the "" doing?
-I dont understand how awk is splitting things into columns?
sub(/regexp/, replacement, target)
sub(/\.$/, replacement, target)
Your regexp is \.$, not .$/
\ is the escape character. It escapes the character that follows it, thus stripping it from the regex meaning and processing it literally.
. in regex matches any single character. Unless it's escaped by \ like in your example, thus it just matches the dot character .
$ simply means the end of the line.
Putting this together, \.$ is an escaped dot at the end of the line. This would match for example any end of paragraph that ends in a period.
In your example, the sub doesn't substitute anything because there is no . at the end of the line (your input ends with .k. So your first awk just prints the 1st column, and the other one prints the 3rd column.
Update
For your updated question.
Awk splits a string in columns by whitespace by default. Thus in your input, columns are like this:
01.0.0.172.in-addr.arpa domain name pointer hostname.domain.com.
|----------$1-----------|--$2--|-$3-|--$4---|----------$5--------|
in your sub command, awk finds the dot at the end of the line and replaces with "" which is the empty string (i.e. it just deletes it)
So your 1st command - {sub(/.$/, ""); print $5}, it prints the 5th column which is hostname.domain.com. after it replaces the . at the end with nothing (deletes it). It's worth noting that in this regex you don't escape the . anymore, so the pattern just matches any character at the end and deletes it (it happens to be a . in your input)
Your other command - {sub(/.$/, ""); print $1} deletes the character at the very end of the line and then just prints the first column 10.0.0.172.in-addr.arpa
You can also set custom column separators in awk, I recommend you read some introduction and tutorials on awk to have a better understanding of how it works. E.g. simple awk tutorial
sub(regexp, replacement, target)
So here we used the regex as \.$, which matches the dot at the end. Here sub(/\.$/, "") we didn't mention the target so it takes $0 ie the whole line. If you specify any target , it would remove the last dot only on that particular column.
awk '{sub(/\.$/, ""); print $1}' test.txt
Removes a dot which was present only at the end of the line and prints only the column 1. If there is no dot at the last, then replacement won't occur.
awk '{sub(/\.$/, ""); print $3}' test.txt
Removes the dot at the end of the line and prints only the column 3. Because of there is no dot at the end, it returns the third column aka last column as it is.
Example:
$ cat file
aaa.bbb.ccc. ddd.eee.fff.ggg h.i.j.k.
$ awk '{sub(/\.$/, ""); print $1}' file
aaa.bbb.ccc.
$ awk '{sub(/\.$/, ""); print $3}' file
h.i.j.k
I had one table with this format
<table width="700" border="1" align="center" cellpadding="0" cellspacing="0" bordercolor="ffcc00" bgcolor="ffcc00">
<tbody>
<th colspan="7" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style2">
Exciter Power Supply</div>
</th>
</tr>
<tr>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">+ 3 V </div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">
OK</div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1"> + 5 V</div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">
OK</div>
</th>
</tr>
</tbody>
When i get the value of +3 V
curl -s http://my-site/index.htm | sed -e 's/<[^>]*>//g' | awk '/+ 3 V/{getline; print}'
I had the output OK'&nbsp'; + 5 V
For remove blank space and text of another field i use sub() for change caracters plus tr to remove the caracters
curl -s http://my-site/index.htm | sed -e 's/<[^>]*>//g' | awk '/+ 3 V/{getline; sub(/+ 5 V/, ""); print}' | tr " " " "
My output is only OK

delete html comment tags using regexp

This is how my text (html) file looks like
<!--
| |
| This is a dummy comment |
| please delete me |
| asap |
| |
________________________________
| -->
this is another line
in this long dummy html file...
please do not delete me
I'm trying to delete the comment using sed :
cat file.html | sed 's/.*<!--\(.*\)-->.*//g'
It doesn't work :( What am I doing wrong?
Thank you very much for your help!
patrickmdnet has the correct answer. Here it is on one line using extended regex:
cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92
http://www.catonmat.net/blog/sed-one-liners-explained-part-three/
One problem with your original attempt is that your regex only handles comments that are entirely on one line. Also, the leading and trailing ".*" will remove non-comment text.
You would better off using existing code instead of rolling your own.
http://sed.sourceforge.net/grabbag/scripts/strip_html_comments.sed
#! /bin/sed -f
# Delete HTML comments
# i.e. everything between <!-- and -->
# by Stewart Ravenhall <stewart.ravenhall#ukonline.co.uk>
/<!--/!b
:a
/-->/!{
N
ba
}
s/<!--.*-->//
(from http://sed.sourceforge.net/grabbag/scripts/)
See this link for various ways to use perl modules for removing HTML comments (using Regexp::Common, HTML::Parser, or File::Comments.) I am sure there are methods using other utilities.
http://www.perlmonks.org/?node_id=500603
I think you can do this with awk if you want. Start:
[~] $ more test.txt
<!--
An HTML style comment
-->
Some other text
<div>
<p>blah</p>
</div>
<!-- Whoops
Another comment -->
<span>Something</span>
Result of the awk:
[~]$ cat test.txt | awk '/<!--/ {off=1} /-->/ {off=2} /([\s\S]*)/ {if (off==0) print; if (off==2) off=0}'
Some other text
<div>
<p>blah</p>
</div>
<span>Something</span>
Improving (hopefully) on the awk-based answer provided by eldarerathis --
The code below addresses the concern raised by john-jones.
In this version, the prefix leading up to the start of the html comment is preserved, as is the suffix following the close of the html comment.
$ cat some-file | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'
for example
$ cat test.txt
<!--
An HTML style comment
-->
<meta charset="utf-8"> <!-- charset encoding must be within the first 1024 bytes of the document -->
Some other text
<div>
<p>blah</p>
</div>
<!-- Whoops
Another comment -->
<span>Something</span>
<div> <!-- start of foo -->
foo
</div> <!-- end of foo -->
<div> <!-- start of multiline comment
bar
end of multiline comment --> </div>
$ cat test.txt | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'
Some other text
<div>
<p>blah</p>
</div>
<span>Something</span>
<meta charset="utf-8">
<div>
foo
</div>
<div> </div>