Sed insert right after <body...> - regex

I want to add line of html right after the opening <body...> tag, however there aren't any newlines, so it makes it somewhat difficult for me.
For example, I want to insert <p>This is the first line</p>, which I call from a text file, after <body...> tag:
<html><head><title>Some title</title></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Some text goes here
Everything I've tried so far only inserts after the first newline of the matched string. How can I insert after body when the body styling is always different?

Not advisable to manipulate HTML with sed. However you can try:
sed -i.bak 's~<body[^>]*>~&<p>This is the first line</p>~' file.html

Related

Using grep to extract multiple matchings in a line

I am working on data extraction from an html document with various <p...>data data</p> in the same line. I want to extract the data in each paragraph on a new line. How can I do this? I looked at this answer but the problem here is that it specifies the end with a single character and does not work with a set of characters.
Example:
<p...> data1 <b>imp</b> data2 </p>
should give me data1 <b>imp</b> data2 but instead gives data1 as it catches < of the bold tag.
EDIT : Here is one more example:
<p class="cb-col cb-col-90 cb-com-ln">Aniket Choudhary to Warner, <b>SIX</b>, .. and Warner makes the most of the free-hit.</p> should give me Aniket Choudhary to Warner, <b>SIX</b>, .. and Warner makes the most of the free-hit.
Assuming GNU grep (Mac users have BSD grep, and this will not work):
grep -Poz '<p[^>]*>\K[\S\s]*?(?=<\/p>)'
This finds <p...> and then "forgets" it due to \K. Then it matches slowly until it reaches the </p>. If your <p>...</p> blocks are going to be very large, this will take a long time to accomplish.
The reason for the -o flag is to return "only" the text you want.
The reason for the -z flag is so that it doesn't stop at the end of each line; it instead considers each input to go until it finds a null. If your text contains newlines between <p> and </p>, this should try to find it.
Caveat: <p>...stuff here...<p>this here</p>...more here...</p> will return
...stuff here...<p>this here
since it doesn't test that the first <p> may contain nested <p>'s.

sed (regex) not working properly

I have to separate out an expression from the following piece of HTML code:
<div class="summary">
<h3>Why is executing Java code in comments allowed?</h3>
<div class="tags t-java t-unicode">
java unicode
</div>
<div class="started">
modified <span title="2015-06-15 17:43:58Z" class="relativetime">yesterday</span>
zwol <span class="reputation-score" title="reputation score 52560" dir="ltr">52.6k</span>
</div>
</div>
The part which I want starts from .... 'title="the following code produces the outp ..........executing Java code in comments allowed?' all the way upto the end of 'a' and 'h3' tags.
Due to various reasons, I have to only use either sed or awk.
I have tried various regular expressions. Since the required part may sometimes even span multiple lines , I have used the following sed command: (Since .* matches only upto a newline character)
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(.*\)<\/a><\/h3>.*/\1/p;}' Trial.html
I am getting no results with this. However, If I remove the end part:
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(.*\)/\1/p;}' Trial.html
I am able to catch the beginning of my required string and it prints upto the end.
I have also referred to this serverfault.com question, for help:
https://serverfault.com/questions/315145/regex-for-sed-to-grab-multiple-lines-or-a-better-way
Edit:
There could be other similar blocks also. I don't have to stop at the first result. I have taken the html from this page:
https://stackoverflow.com/?tab=month
This is another question which is very similar to mine!
https://unix.stackexchange.com/questions/64645/text-between-two-tags
Your line
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(\.*\)<\/a><\/h3>.*/\1/p;}' Trial.html
That line puts everything in hold space, than after file is read, swaps it to pattern space to be used for multi line parsing.
modification idea, instead of grouping \(\.*\) which by the way is not correct since you've escaped here '.' so it's not any character but literal '.'
you could use title="\([^<]*\) which will catch all characters till first '<'.
Also if title=" is only once present in file than no need for many letters in first part of pattern, only ^.*title=" will be enough.

Convert html tag using sed

I have a tag such as the following:
<div style="position:absolute;opacity:0.5" class="header">Home</div>
(there may or may not be a style or other attribute) and using sed I need to convert it to a span where the id of the span is the class of the div:
<span style="position:absolute;opacity:0.5" id="header">Home</span>
I know how to do this in PHP but unfortunately my Linux is lacking :).
The regex to find the eligible DIVs is something along:
#<div .* id=(.*)>.*</div>#
but I don't know how to write the replacement part, mainly because I need to keep the content between the div tag name and the id. It's 4:45 am so that may have something to do with it as well :p.
I'd appreciate any help on this, thank you.
Using sed, and if you want more specific handling:
sed '/<div/{s/<div /<span /;s/ class *=/ id =/;s!</div!</span!}' input
still, this assumes start and close tags are on the same line, and there is a single div tag on that line. Also it assumes that the class attribute is the only one on that line.
A more strict command is:
sed 's!<div\([^>]*\) class *= *\([^>]*\)>\([^<]*\)</div>!<span\1 id=\2>\3</span>!g' input
sed 's/div/span/;s/id/class/' foo.html
Will output
<span style="position:absolute;opacity:0.5" class="header">Home</div>
Where foo.html is your document
PAY ATTENTION
This will replace only the first uccurence of div and id. If you want to replace all, you have to add "g" char at the end of each substitution pattern, like s/div/span/g
And, not less important, if you want to overwrite your document (so if you want to replace occurence "in place") you have to proced in the following way sed -ie 's/div/span/;s/id/class/' foo.html
Last thing: as correcly Basile Starynkevitch says in the comments, maybe sed isn't the best choice

regular expression to copy an html tag to another place

In html I need to copy the text that is in the title tag, to a new line as a h1 tag in a certain place..
Example:
<TITLE>any text here, including special characters like .,:- or whataever</TITLE>
Output to leave that title as as, and then add a line under this:
under this line:
<font face="Arial"><span style="font-size: 14pt">
it adds
<h1 align="center">any text here, including special characters like .,:- or whataever</h1>
I use notepad++ to search and replace, but I don't think it would work in this case, so please suggest any other simple and effective Windows program to use, wish steps if any.. and I need to edit this in bulk for many files..
Thanks in advance,
As this question is still open, and you have a pretty bad accept rate, I'll try to offer you a solution:
Search for:
<TITLE>(.*)</TITLE>(.*?)<font face="Arial"><span style="font-size: 14pt">
And make sure to check the ". matches Newline" option in the search box. Then use the Following replace:
<TITLE>\1</TITLE>\2<font face="Arial"><span style="font-size: 14pt">\n<h1>\1</h2>
Worked in my notepad++.
The regex you need yo use should be like:
/^<TITLE>(.*)<\/TITLE>/
Now I'll give you a small Perl snippet that takes in a line, matches against the above regex and spits out the text in the title tags wrapped in <h1>...</h1>. Remember that in regexes, the parentheses are used to "group" matches - or in layman terms, to store them for future references. The script:
$line = <STDIN>;
chomp $line;
$line =~ /^<TITLE>(.*)<\/TITLE>/;
print "<h1>"."$1"."</h1>";
Here, all text between TITLE tags is grouped and hence stored in the variable $1.
When I do:
echo "<TITLE>lama</TITLE>" |script.pl
I get:
<h1>lama</h1>
Hope that helps.

regex for all characters on yahoo pipes

I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?
You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/