Using grep to extract multiple matchings in a line - regex

I am working on data extraction from an html document with various <p...>data data</p> in the same line. I want to extract the data in each paragraph on a new line. How can I do this? I looked at this answer but the problem here is that it specifies the end with a single character and does not work with a set of characters.
Example:
<p...> data1 <b>imp</b> data2 </p>
should give me data1 <b>imp</b> data2 but instead gives data1 as it catches < of the bold tag.
EDIT : Here is one more example:
<p class="cb-col cb-col-90 cb-com-ln">Aniket Choudhary to Warner, <b>SIX</b>, .. and Warner makes the most of the free-hit.</p> should give me Aniket Choudhary to Warner, <b>SIX</b>, .. and Warner makes the most of the free-hit.

Assuming GNU grep (Mac users have BSD grep, and this will not work):
grep -Poz '<p[^>]*>\K[\S\s]*?(?=<\/p>)'
This finds <p...> and then "forgets" it due to \K. Then it matches slowly until it reaches the </p>. If your <p>...</p> blocks are going to be very large, this will take a long time to accomplish.
The reason for the -o flag is to return "only" the text you want.
The reason for the -z flag is so that it doesn't stop at the end of each line; it instead considers each input to go until it finds a null. If your text contains newlines between <p> and </p>, this should try to find it.
Caveat: <p>...stuff here...<p>this here</p>...more here...</p> will return
...stuff here...<p>this here
since it doesn't test that the first <p> may contain nested <p>'s.

Related

Possible Bug using Regex in Notepad++ with Replace All?

Have I found a bug in Notepad++ or am I doing something wrong?
Background info
(Please note that I do know that one are supposed not to use Regex parsing HTML, but I think this is a special case that should work - without the possible Notepad++ bug ;-)
I have exported Apple Notes as HTML using Exporter 3.0 on a Mac. In the HTML output every Note line is between <div> - </div> elements and also "header/title lines" like <h1> - </h1> or <h2> - </h2> etc. Each "header/title line" is often split in several unnecessary HTML header elements as in the following simplified example.
<div><h1>TEST </h1><h1>Title<br></h1></div>
<div><b><h2>T1</h2><u><h2>T2</h2></u><h2> </h2></b><h2>(</h2><h2>T3</h2><u><h2>T4</h2></u><h2>)</h2><b><h2><br></h2></b></div>
This HTML can't be imported into OneNote giving the same result as seen in Apple Notes i.e. each "header/title" line is split in multiple lines. That's true even when changing the <h1>/<h2> block elements to inline elements using an initial <style>h1, h2 {display: inline;}</style> statement. (Maybe that is a bug or restriction in OneNote, but I need to find a workaround.)
Therefore, I need to clean the example HTML output above from the unnecessary HTML header <h1> or <h2> (all but the first in every line) and </h1> or </h2> (all but the last in every line), to get the following result that can be imported to OneNote without problem.
<div><h1>TEST Title<br></h1></div>
<div><b><h2>T1<u>T2</u> </b>(T3<u>T4</u>)<b><br></h2></b></div>
Solution ? - Developed Regex
I'm quite new to Regex, especially advanced Regex, but I think I have found a way to clean the erroneous HTML code using TWO different Regex expressions as follows.
Both works well when tested using regex101.com, I think.
The first one is used to remove unnecessary </h1> or </h2> elements and is a Positive Lookahead function (it works both in regex101 and in Notepad++)
(</h[1-6]>)(?=.*?\1)
(Demo)
Picture 1 shows a working Find All + Mark All in Notepad++
Picture 2 shows a working Replace All
The Second one used to remove unnecessary <h1> or <h2> elements and is a Positive Lookbehind function (it works in regex101 but NOT fully in Notepad++)
(?<=(<(h[1-6])>))(?:.*?)\K\1
(Demo)
Picture 3 shows a working Find All + Mark All in Notepad++ = All 8 occurrences found
Picture 4 shows a NOT working Replace All in Notepad++ = Only 5 occurrences (of the 8 found) are replaced
If I redo the same Replace All a second time 2 of the
remaining 3 occurrences are replaced.
If I redo the same Replace All a third time the last
remaining occurrence is replaced.
BUG ?
Is this a bug in Notepad++ or is this behavior normal or am I doing something strange here? Please help me understand.
So, rather than make multiple passes through your data, you can get it all in one pass with this:
(^.*?<h[1-6]>)?(.*?)</?h[1-6]>(?=.*</h[1-6]>.*?$)
and replace it with \1\2. The first capture group skips the first <h#> on each line and is null after line start. The second capture group captures everything up to the next <h#> tag. The optional slash (/?) scans and deletes both open and close tags. The last part is a positive lookahead to make sure the last </h#> is not deleted.
In the two lines of your examples all the header levels were the same on the line and this regex is fine. If the first open and last close don't match, then you have a problem but I think your solutions also have that same problem. In any case you can fix that in a second pass with ^(.*<h)([1-6])(.*<h)[1-6] and replace it with \1\2\3\2.
I would also point out that this creates unbalanced HTML with a <b>, followed by <h1>, followed by </b>, followed by </h1>. I don't know if that is OK for your case. If not, it might be better to remove ALL the <h#> tags and anchor new ones just inside the <div> </div> pair.
In any event here is a REGEX101 screenprint with this regex working on your examples:

Use sed to delete all occurrences of <a name="foo">

I have multiple html documents and each one has many occurrences of
<a name="pIDsomestring">
where 'somestring' varies with each occurrence.
I want to delete the entire tag, as well as the
</a>
closing HTML tag that immediately follows it, but importantly, not the text inside the anchor tag.
Is there an easy way to do this with sed?
HTML is much more complicated than what can be parsed with sed. Two pieces of HTML can be absolutely equivalent, and yet look completely different as far as a sed command is concerned. For example, you can't really write a sed command that will recognize that these two are equivalent:
<a name="foo">bar</a>
<A
NAME = "foo"
><!-- </A> --bar</>-- -->
(The </>, if you're wondering, means </a> in this case. And heh, even Stack Overflow's syntax highlighter gets confused by the <!-- comment -- not-a-comment -- comment --> notation.)
The above is a pathological example, of course, but even perfectly-ordinary real-world HTML often has line-breaks and other whitespace in random places that have no effect on the HTML but a great deal of effect on a sed command.
But if you're just doing a one-off task where you can manually verify the results afterward, you can try something like this:
's#<a name="[^"]*">\(\([^<]\|<[^/]\|</[^a]\|</a[^>]\)*\)</a>#\1#g'
which will usually work as long as the whole thing is on one line.

sed (regex) not working properly

I have to separate out an expression from the following piece of HTML code:
<div class="summary">
<h3>Why is executing Java code in comments allowed?</h3>
<div class="tags t-java t-unicode">
java unicode
</div>
<div class="started">
modified <span title="2015-06-15 17:43:58Z" class="relativetime">yesterday</span>
zwol <span class="reputation-score" title="reputation score 52560" dir="ltr">52.6k</span>
</div>
</div>
The part which I want starts from .... 'title="the following code produces the outp ..........executing Java code in comments allowed?' all the way upto the end of 'a' and 'h3' tags.
Due to various reasons, I have to only use either sed or awk.
I have tried various regular expressions. Since the required part may sometimes even span multiple lines , I have used the following sed command: (Since .* matches only upto a newline character)
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(.*\)<\/a><\/h3>.*/\1/p;}' Trial.html
I am getting no results with this. However, If I remove the end part:
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(.*\)/\1/p;}' Trial.html
I am able to catch the beginning of my required string and it prints upto the end.
I have also referred to this serverfault.com question, for help:
https://serverfault.com/questions/315145/regex-for-sed-to-grab-multiple-lines-or-a-better-way
Edit:
There could be other similar blocks also. I don't have to stop at the first result. I have taken the html from this page:
https://stackoverflow.com/?tab=month
This is another question which is very similar to mine!
https://unix.stackexchange.com/questions/64645/text-between-two-tags
Your line
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(\.*\)<\/a><\/h3>.*/\1/p;}' Trial.html
That line puts everything in hold space, than after file is read, swaps it to pattern space to be used for multi line parsing.
modification idea, instead of grouping \(\.*\) which by the way is not correct since you've escaped here '.' so it's not any character but literal '.'
you could use title="\([^<]*\) which will catch all characters till first '<'.
Also if title=" is only once present in file than no need for many letters in first part of pattern, only ^.*title=" will be enough.

Convert html tag using sed

I have a tag such as the following:
<div style="position:absolute;opacity:0.5" class="header">Home</div>
(there may or may not be a style or other attribute) and using sed I need to convert it to a span where the id of the span is the class of the div:
<span style="position:absolute;opacity:0.5" id="header">Home</span>
I know how to do this in PHP but unfortunately my Linux is lacking :).
The regex to find the eligible DIVs is something along:
#<div .* id=(.*)>.*</div>#
but I don't know how to write the replacement part, mainly because I need to keep the content between the div tag name and the id. It's 4:45 am so that may have something to do with it as well :p.
I'd appreciate any help on this, thank you.
Using sed, and if you want more specific handling:
sed '/<div/{s/<div /<span /;s/ class *=/ id =/;s!</div!</span!}' input
still, this assumes start and close tags are on the same line, and there is a single div tag on that line. Also it assumes that the class attribute is the only one on that line.
A more strict command is:
sed 's!<div\([^>]*\) class *= *\([^>]*\)>\([^<]*\)</div>!<span\1 id=\2>\3</span>!g' input
sed 's/div/span/;s/id/class/' foo.html
Will output
<span style="position:absolute;opacity:0.5" class="header">Home</div>
Where foo.html is your document
PAY ATTENTION
This will replace only the first uccurence of div and id. If you want to replace all, you have to add "g" char at the end of each substitution pattern, like s/div/span/g
And, not less important, if you want to overwrite your document (so if you want to replace occurence "in place") you have to proced in the following way sed -ie 's/div/span/;s/id/class/' foo.html
Last thing: as correcly Basile Starynkevitch says in the comments, maybe sed isn't the best choice

REGEX Pattern - How do I match upto a certain tag in html

I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/