sed (regex) not working properly - regex

I have to separate out an expression from the following piece of HTML code:
<div class="summary">
<h3>Why is executing Java code in comments allowed?</h3>
<div class="tags t-java t-unicode">
java unicode
</div>
<div class="started">
modified <span title="2015-06-15 17:43:58Z" class="relativetime">yesterday</span>
zwol <span class="reputation-score" title="reputation score 52560" dir="ltr">52.6k</span>
</div>
</div>
The part which I want starts from .... 'title="the following code produces the outp ..........executing Java code in comments allowed?' all the way upto the end of 'a' and 'h3' tags.
Due to various reasons, I have to only use either sed or awk.
I have tried various regular expressions. Since the required part may sometimes even span multiple lines , I have used the following sed command: (Since .* matches only upto a newline character)
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(.*\)<\/a><\/h3>.*/\1/p;}' Trial.html
I am getting no results with this. However, If I remove the end part:
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(.*\)/\1/p;}' Trial.html
I am able to catch the beginning of my required string and it prints upto the end.
I have also referred to this serverfault.com question, for help:
https://serverfault.com/questions/315145/regex-for-sed-to-grab-multiple-lines-or-a-better-way
Edit:
There could be other similar blocks also. I don't have to stop at the first result. I have taken the html from this page:
https://stackoverflow.com/?tab=month
This is another question which is very similar to mine!
https://unix.stackexchange.com/questions/64645/text-between-two-tags

Your line
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(\.*\)<\/a><\/h3>.*/\1/p;}' Trial.html
That line puts everything in hold space, than after file is read, swaps it to pattern space to be used for multi line parsing.
modification idea, instead of grouping \(\.*\) which by the way is not correct since you've escaped here '.' so it's not any character but literal '.'
you could use title="\([^<]*\) which will catch all characters till first '<'.
Also if title=" is only once present in file than no need for many letters in first part of pattern, only ^.*title=" will be enough.

Related

Possible Bug using Regex in Notepad++ with Replace All?

Have I found a bug in Notepad++ or am I doing something wrong?
Background info
(Please note that I do know that one are supposed not to use Regex parsing HTML, but I think this is a special case that should work - without the possible Notepad++ bug ;-)
I have exported Apple Notes as HTML using Exporter 3.0 on a Mac. In the HTML output every Note line is between <div> - </div> elements and also "header/title lines" like <h1> - </h1> or <h2> - </h2> etc. Each "header/title line" is often split in several unnecessary HTML header elements as in the following simplified example.
<div><h1>TEST </h1><h1>Title<br></h1></div>
<div><b><h2>T1</h2><u><h2>T2</h2></u><h2> </h2></b><h2>(</h2><h2>T3</h2><u><h2>T4</h2></u><h2>)</h2><b><h2><br></h2></b></div>
This HTML can't be imported into OneNote giving the same result as seen in Apple Notes i.e. each "header/title" line is split in multiple lines. That's true even when changing the <h1>/<h2> block elements to inline elements using an initial <style>h1, h2 {display: inline;}</style> statement. (Maybe that is a bug or restriction in OneNote, but I need to find a workaround.)
Therefore, I need to clean the example HTML output above from the unnecessary HTML header <h1> or <h2> (all but the first in every line) and </h1> or </h2> (all but the last in every line), to get the following result that can be imported to OneNote without problem.
<div><h1>TEST Title<br></h1></div>
<div><b><h2>T1<u>T2</u> </b>(T3<u>T4</u>)<b><br></h2></b></div>
Solution ? - Developed Regex
I'm quite new to Regex, especially advanced Regex, but I think I have found a way to clean the erroneous HTML code using TWO different Regex expressions as follows.
Both works well when tested using regex101.com, I think.
The first one is used to remove unnecessary </h1> or </h2> elements and is a Positive Lookahead function (it works both in regex101 and in Notepad++)
(</h[1-6]>)(?=.*?\1)
(Demo)
Picture 1 shows a working Find All + Mark All in Notepad++
Picture 2 shows a working Replace All
The Second one used to remove unnecessary <h1> or <h2> elements and is a Positive Lookbehind function (it works in regex101 but NOT fully in Notepad++)
(?<=(<(h[1-6])>))(?:.*?)\K\1
(Demo)
Picture 3 shows a working Find All + Mark All in Notepad++ = All 8 occurrences found
Picture 4 shows a NOT working Replace All in Notepad++ = Only 5 occurrences (of the 8 found) are replaced
If I redo the same Replace All a second time 2 of the
remaining 3 occurrences are replaced.
If I redo the same Replace All a third time the last
remaining occurrence is replaced.
BUG ?
Is this a bug in Notepad++ or is this behavior normal or am I doing something strange here? Please help me understand.
So, rather than make multiple passes through your data, you can get it all in one pass with this:
(^.*?<h[1-6]>)?(.*?)</?h[1-6]>(?=.*</h[1-6]>.*?$)
and replace it with \1\2. The first capture group skips the first <h#> on each line and is null after line start. The second capture group captures everything up to the next <h#> tag. The optional slash (/?) scans and deletes both open and close tags. The last part is a positive lookahead to make sure the last </h#> is not deleted.
In the two lines of your examples all the header levels were the same on the line and this regex is fine. If the first open and last close don't match, then you have a problem but I think your solutions also have that same problem. In any case you can fix that in a second pass with ^(.*<h)([1-6])(.*<h)[1-6] and replace it with \1\2\3\2.
I would also point out that this creates unbalanced HTML with a <b>, followed by <h1>, followed by </b>, followed by </h1>. I don't know if that is OK for your case. If not, it might be better to remove ALL the <h#> tags and anchor new ones just inside the <div> </div> pair.
In any event here is a REGEX101 screenprint with this regex working on your examples:

Using grep to extract multiple matchings in a line

I am working on data extraction from an html document with various <p...>data data</p> in the same line. I want to extract the data in each paragraph on a new line. How can I do this? I looked at this answer but the problem here is that it specifies the end with a single character and does not work with a set of characters.
Example:
<p...> data1 <b>imp</b> data2 </p>
should give me data1 <b>imp</b> data2 but instead gives data1 as it catches < of the bold tag.
EDIT : Here is one more example:
<p class="cb-col cb-col-90 cb-com-ln">Aniket Choudhary to Warner, <b>SIX</b>, .. and Warner makes the most of the free-hit.</p> should give me Aniket Choudhary to Warner, <b>SIX</b>, .. and Warner makes the most of the free-hit.
Assuming GNU grep (Mac users have BSD grep, and this will not work):
grep -Poz '<p[^>]*>\K[\S\s]*?(?=<\/p>)'
This finds <p...> and then "forgets" it due to \K. Then it matches slowly until it reaches the </p>. If your <p>...</p> blocks are going to be very large, this will take a long time to accomplish.
The reason for the -o flag is to return "only" the text you want.
The reason for the -z flag is so that it doesn't stop at the end of each line; it instead considers each input to go until it finds a null. If your text contains newlines between <p> and </p>, this should try to find it.
Caveat: <p>...stuff here...<p>this here</p>...more here...</p> will return
...stuff here...<p>this here
since it doesn't test that the first <p> may contain nested <p>'s.

Use sed to delete all occurrences of <a name="foo">

I have multiple html documents and each one has many occurrences of
<a name="pIDsomestring">
where 'somestring' varies with each occurrence.
I want to delete the entire tag, as well as the
</a>
closing HTML tag that immediately follows it, but importantly, not the text inside the anchor tag.
Is there an easy way to do this with sed?
HTML is much more complicated than what can be parsed with sed. Two pieces of HTML can be absolutely equivalent, and yet look completely different as far as a sed command is concerned. For example, you can't really write a sed command that will recognize that these two are equivalent:
<a name="foo">bar</a>
<A
NAME = "foo"
><!-- </A> --bar</>-- -->
(The </>, if you're wondering, means </a> in this case. And heh, even Stack Overflow's syntax highlighter gets confused by the <!-- comment -- not-a-comment -- comment --> notation.)
The above is a pathological example, of course, but even perfectly-ordinary real-world HTML often has line-breaks and other whitespace in random places that have no effect on the HTML but a great deal of effect on a sed command.
But if you're just doing a one-off task where you can manually verify the results afterward, you can try something like this:
's#<a name="[^"]*">\(\([^<]\|<[^/]\|</[^a]\|</a[^>]\)*\)</a>#\1#g'
which will usually work as long as the whole thing is on one line.

Regex Match All Characters Between Tags on nth occurrence

I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.

Match line breaks with a regular expression

The text:
<li><a href="#">Animal and Plant Health Inspection Service Permits
Provides information on the various permits that the Animal and Plant Health Inspection Service issues as well as online access for acquiring those permits.
I want to use a regular expression to insert </a> at the end of Permits. It just so happens that all of my similar blocks of HTML/text already have a line break in them. I believe I need to find a line break \n where the line contains (or starts with) <li><a href="#">.
You could search for:
<li><a href="#">[^\n]+
And replace with:
$0</a>
Where $0 is the whole match. The exact semantics will depend on the language are you using though.
WARNING: You should avoid parsing HTML with regex. Here's why.
By default . (any character) does not match newline characters.
This means you can simply match zero or more of any character then append the end tag.
Find: <li><a href="#">.*
Replace: $0</a>