Sed substitute pattern within a line - regex

How can I substitute characters only within a specific pattern, preferably in sed but awk or otherwise if there's an easier option? I would like to replace spaces in my html h3 ids with hyphens (-), but I don't want it to hyphen the entire line.
Eg, in my foo.html:
<p>This is a paragraph which likes its spaces.</p>
<h3 id="No spaces in this id please">Keep spaces in this title</h3>
<p>Here's another paragraph with spaces.</p>
<h3 id="Another id that should be spaceless">Spaces please!</h3>
<p>Yes I would like extra noodles in my soup.</p>
What I want are h3s like this:
<h3 id="Another-id-that-should-be-spaceless">Spaces please!</h3>
I've tried
sed -e "/^<h3 id=\"/,/\">/s/ /-/g;" <foo.html >bar.html
But this greedily adds hyphens to lines (2nd p) and parts (h3 content) which shouldn't have hyphens! Bar.html:
<p>This is a paragraph which likes its spaces.</p>
<h3-id="No-spaces-in-this-id-please">Keep-spaces-in-this-title</h3>
<p>Here's-another-paragraph-with-spaces.</p>
<h3-id="Another-id-that-should-be-spaceless">Spaces-please!</h3>
<p>Yes I would like extra noodles in my soup.</p>
Note I'm using GNU sed. Thanks!

This sed replace one space at a time in the id value of h3 tags. When substitution succeeds, the t command loops to :a label to search for remaining spaces to replace:
sed -e ':a;s/\(<h3[^>]*id="[^"> ]*\) \(.*\)/\1-\2/;ta;' < foo.html > bar.html

Related

SED - remove attribute from HTML tag

I want to remove a specific attribute(name in my example) from the HTML tag, that might be in different positions for each line in my file
Example Input:
<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">
Expected output:
<img src="https://websiteurl.com/286.jpg" alt="img">
My code:
sed 's/name=".*"//g' <<< '<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">'
but it only shows <img >, I am losing src attribute as well
Notes:
name attribute might be in any position in a tag (not necessarily at the beginning)
you can use sed, awk, Perl, or anything you like, it should work on the command line
Your sed expression matches the text up to the last " in the line. It must have been
sed 's/ name="[^"]*"//g'
With your shown samples, could you please try following. Written and tested in GNU awk.
awk '/^<img/ && match($0,/src.*/){print substr($0,1,4),substr($0,RSTART,RLENGTH)}' Input_file
2nd solution: Using sub(substitute function) of awk.
awk '/^<img/{sub(/name="[^"]*" /,"")} 1' Input_file
Explanation:
1st solution: Using match function of awk to match from src till last of line and printing 1st 4 characters with space with matched regex value.
2nd solution: Checking condition if line starts from <img then substitute name=" till again " comes with NULL and printing current line.

Replacing content between two tags first occurence

I am trying to replace some content between tags with the following expression
sed -i -e ':a;N;$!ba' -e 's#.*#{{{ svg "/myLogo.svg" 100 100 }}} <img src="/logo.png">#' $file
The problem I am facing is I apply it to a text like:
<div class="bar">
<a href="/" class="logo">
<svg viewBox="13.195 149.965 803 267.334">
<path fill="#6A5B53" d="M233.773,218.468l-6.19,65.33c-3.429-29.427-19.904-64.188-41.427-82.473
c0.667-1.048-0.381-8,0.476-8.856L266.773,218.468z"/>
<path fill="#FFC3B7" d="M260.583,285.894c0.476-63.331-51.236-116.662-115.424-116.377/>
</svg>
<image src="/logo.jpg"> </svg>
</a>
<ul class="newmenu"><li>Char</li>
<li>Price</li>
<li>Account</li>
<li>Login</li>
</ul> <div class="log">.....
So after I execute it, the script replaces it until the last </a> instead of replacing up to the first </a>.
How can I prevent from replacing all the way up to the last tag?
Your .* in ...class="logo">.*</a> is greedy and match any characters until last </a> found.
If your opening and closing a tags are on different lines, you can use the sed c command :
sed -i -e '/<a href="\/" class="logo">/,/<\/a>/ c\
{{{ svg "/myLogo.svg" 100 100 }}} <img src="/logo.png">
' file
Explanation :
/<a href="\/" class="logo">/,/<\/a>/ : this address range match all lines from /<a href="\/" class="logo">/ to next /<\/a>/
c\ : change command, to replace matching lines with following text (\ is for starting the replacement with a newline)
following the ccommand is the substitution text. If you want to replace with multiple lines, you must add a trailing \ to each line (except the last one).
To illustrate this last point :
sed -e '/fromtext/,/totext/ c\
add line 1\
add line 2\
addline 3
' file
In some sed versions, you can write your substitution text on the same line after the c as (eg sed '/fromtext/ c newtext' file).

Find a regexp in awk

I have a file with a line like this:
<div class="cell contentCell bbActiveRow" tabindex="-1" style="width: 150px; left: 77px; display: block;" cellposition="15,2"><div class="cell contentCell bbActiveRow last-child" tabindex="-1" style="width: 150px; left: 697px; display: block;" cellposition="15,6">159</div></div><div class="contentRow bb_row" rowindex="16" style="display: block; top: 429px;"><div class="cell first-child " title="Go to box" tabindex="-1" role="linkAction" cellposition="16,0"><span class="pre-child" style="background-color:#16A765;"> </span><span class="link" role="link"> </span></div>
The important bit I want to catch is the 159 in:
,6">159</div>
I can catch it fine with grep:
cat c |grep ',6\">[0-9]\+<'
Now, what I want to do, is actually catch the number itself (159) and print it out.
Note that the actual file I have has several of those lines. Ideally, only the numbers will print out.
I thought I could do it with awk:
cat c | awk ' /,6\">([0-9]\+)/ { print $1 } '
But nope, nothing gets printed out.
Having the regexp ready, and knowing that there are several lines in the file with entries that match the expression (with different numbers), how would you squeeze those numbers out?
This oneliner is an alternate way to do that (using an xpath expression which matches div elements containing a cellposition attribute value ending with ',6'):
# xmllint --html test.html --xpath '//div[substring(#cellposition, string-length(#cellposition) - 1)=",6"]/text()'
159
A pragmatic approach:
cat c | grep -o ',6\">[0-9]\+<' | awk -F'<|>' '{ print $2 }'
-o causes grep to only report the matching part of each line.
awk -F'<|>' '{ print $2 }' then extracts the token between > and <.
As for why your awk command didn't work:
awk uses extended regular expressions, in which + must NOT be escaped as \+ to be recognized as a quantifier.
Even with that fixed, the command wouldn't work, because, by default, awk splits by whitespace, so $2 will simply report the 2nd whitespace-separated token on each matching line, irrespective of the regular expression that caused the match.
The solution at the top even finds multiple matches on a line, but if we assume that there's at most 1, it is relatively straightforward to do it all in awk, if you have GNU awk:
cat c | gawk '{ m=gensub(/^.*,6\">([0-9]+)<.*$/, "\\1", "1"); if (m != $0) print m }'
The non-POSIX gensub() replaces regex matches and returns the replacement, while crucially also supporting backreferences, which the POSIX sub() and gsub() functions do not.
The above matches the entire line, then replaces it with the captured number only (via (escaped) backreference \1), and stores the result in a variable. If the variable doesn't equal the input line, a match was captured, and it is printed.
While a solution with POSIX awk features only is possible (using match(), RSTART, RLENGTH, split()), it would be cumbersome.
Finally, if you have xmllint (OS X does, and some Linux distros), consider guido's answer for a solution that performs actual HTML parsing and applies an XPath query, and is therefore more robust.

Sed grabbing tags & newlines (Mac OSX)

I have this text where I need to remove the page numbers:
<p class="p3">El gabinete se iba iluminando lentamente ... Por delante de las</p>
<p class="p5"><span class="s4"><i>32</i></span> grandes nubes de un color violeta obscuro...</p>
<p class="p3">
I need to remove
</p>
<p class="p5"><span class="s4"><i>32</i></span>
from it.
So far I have this
sed -E -i '' 's/</p>\n<p class="p[0-9]+"[^>]*><span class=".+">.+<\/span> / /g' Capítulo1.html
But that is not working it works without the </p>\n part, but I really need to capture and replace the </p> too.
Note this is on Mac and sed seems to be a bit different from Linux.
Also the paragraph classes can be anything starting with p followed by a number,similar for the span class s followed by number, and the italic tags can be there or not and in between is the pagenumber.
Unless the newlines really matter, you could try stripping them out first:
tr -d '\n' | sed ...
You missed escaping the forwardslash of the closing paragraph tag, try this:
's/<\/p>\r?\n<p class="p\d+"[^>]*><span class=".+">.+<\/span> / /g' Capítulo1.html
For a more complete match as you've described, try this:
's/<\/p>\r?\n<p class="p\d+"[^>]*?><span class="s\d+">(<i>)?\d+(<\/i>)?<\/span>/ /g' Capítulo1.html
This more specifically narrows down the span class matching, and adds non-greediness to stop any unexpected surprises when a huge chunk of data is removed between a span opening tag and the furthest matching span closing tag.

Regex in perl/sed replacement not matching whitespace/characters

Given this file, I'm trying to do a super primitive sed or perl replacement of a footer.
Typically I use DOM to parse HTML files but so far I've had no issues due to the primitive HTML files I'm dealing with ( time matters ) using sed/perl.
All I need is to replace the <div id="footer"> which contains whitespace, an element that has another element, and the closing </div> with <?php include 'footer.php';?>.
For some reason I can't even get this pattern to match up until the <div id="stupid">. I know there are whitespace characters so i used \s*:
perl -pe 's|<div id="footer">.*\s*.*\s*|<?php include INC_PATH . 'includes/footer.php'; ?>|' file.html | less
But that only matches the first line. The replacement looks like this:
<?php include INC_PATH . includes/footer.php; ?>
<div id="stupid"><img src="file.gif" width="206" height="252"></div>
</div>
Am I forgetting something simple, or should I specify some sort of flag to deal with a multiline match?
perl -v is 5.14.2 and I'm only using the pe flags.
You probably want -0777, which will force perl to read the entire file at once.
perl -0777 -n -e 's|something|else|g' file
Also, your strategy of doing .*\s*.*\s* is pretty fragile. It'll match e.g. <div id="foo", which is just a fragment...
Are you forgetting that almost all regex parsing works on a line-by-line basis?
I've always had to use tr to convert the newlines into some other character, and then back again after the regex.
Just found this: http://www.perlmonks.org/?node_id=17947
You need to tell the regex engine to treat your scalar as a multiline string with the /m option; otherwise it won't attempt to match across newlines.
perl -p
is working on the file on a line by line basis see perl.com
that means your regex will never see all lines to match, it will only match when it gets the line that starts with "<div id="footer">" and on the following lines it will not match anymore.