Find a block of descriptive text inside html using regex - regex

I'm trying to figure out how to use look-ahead to try to capture the descriptive text in an html page such as
<div class="itemBanner" style="float:left; padding:10px">
<div style="padding-right:5px; padding-bottom:5px">
<div class="itemBanner">
HTML Tags Stripper is designed to strip HTML tags from the text. It will also strip embedded JavaScript code, style information (style sheets), as well as code inside php/asp tags (<?php ?> <%php ?> <% %>). It will also replace sequence of new line characters (multiple) with only one. <b>Allow tags</b> feature is session sticky, i.e. it will remember allowed tags list, so you will have to type them only once.<p></p>You can either provide text in text area below, or enter URL of the web page. If URL provided then HTML Tags Stripper will visit web-page for its contents.<p></p>
<b>Known issues:</b><br />
I figured a regex that looks for a '>' followed by at least 150 characters before a '<' would do the trick.
The closest I've gotten so far is:
(([^.<]){1,500})<
Which still misses on things like periods and other characters before and after the string.

Your regex will match anything that's neither "." nor "<" 1 to 500 times, then a "<".
Assuming you want to capture everything from the itemBanner div until the very next occurrence of a closing div, you can use these elements:
<div class="itemBanner"> - explicit match
() - parathentical wrap for referencing, e.g. match[1]
.*? - any length of characters, non-greedily (as few as possible)
<\/div> - explicit match, with escaped '/'
to form this Ruby regex:
item_banner_div_regex = /<div class="itemBanner">(.*?)<\/div>/
match = item_banner_div_regex.match(html)
inside_item_banner_div = match && match[1]
Note: The exact regex will depend on the implementation you're using.

Related

Regular Expression to exclude a String around the required String

In between a HTML code:
...<div class="..."><a class="..." href="...">I need this String only</a></div>...
How do I write Regular Expression (for Rainmeter which uses Perl RegEx) such that:
-required string "I need this String only" is grouped to be extracted,
-the HTML link tag <a>...</a> might be
absent or present & can be present in between the required string and multiple times as well.
My attempt:
(?siU) <div class="...">.*[>]{0,1}(.*)[</a>]{0,1}</div>
where:
.*= captures every characters except newline{<a class ... "}
[>]{0,1}= accepts 0 or 1 times presence of > {upto >}
(.*)= captures my String
[</a>]{0,1}= accepts 0 or 1 times presence of </a>
this, of course, doesn't work as I want,
This gives output with HTML linking preceding my string
so my question is
How to write a better(and working) RegEx?
Even though I agree with the advice to use a real parser for this problem, this regular expression should solve your problem:
<div [^.<>]|*>(?:[^<>]*<a [^<>]*>)*([^<>]*)(?:</a>)*</div>
Logic:
require <div ...> at the beginning and </div> at the end.
allow and ignore <a ...> before the matched text arbitrarily many times
allow and ignore </a> after the matched text arbitrarily many times
ignore any text before any <a ...> with [^<>]* in front of it. Using .* would also work, but then it would skip all text arbitrarily up to the last instance of <a ...> in your string.
I use [^<>]* instead of .* to match non-tag text in a protected way, since literal < and > are not allowed.
I use (?:...) to group without capturing. If that is not supported in your programming language, just use (...) instead, and adjust which match you use.
Caveat: this won't be fully general but should work for your problem as described.

How to Match Redundant Lines From Contenteditable Div in Regex

I'm trying to process the html inside a contenteditable div. It might look like:
<div>Hi I'm Jack...</div>
<div><br></div>
<div><br></div>
<div>More text.</div> *<div><br></div>*
*<div><br></div>**<div><br></div>*
*<div><br></div>*
*<div>
<br>
</div>*
What regex expression would match all trailing <div><br></div> but not the ones sandwiched between useful divs containing text, i.e., <div> text (not html) </div>?
I have enclosed all expressions I want to match in asterisks. The asterisk are for reference only and are not part of my string.
Thanks,
Jack
You can use the pattern:
(?:<div>[\n\s]*<br>[\n\s]*<\/div>)(?!.*?<div>[^<]+<\/div>)
You can try it here.
Let me know if this works for all your cases and I will write a detailed explanation of the pattern.

Regex Match All Characters Between Tags on nth occurrence

I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.

groovy - regex to retrieve inner html tag

I wanted to try to match the inner part of the string between the span tags where it is guaranteed that the id of this span tags starts with blk.
How can I match this with groovy?
Example :
<p>I wanted to try to <span id="blk1">match</span> the inner part of the string<span id="blk2"> between </span>the span tags <span>where</span> it is guaranteed that the id of this span tags <span id="blk3">starts</span> with blk.</p>
According to the example above,I want to have
match
between
starts
I tried the following , but it returns null;
def html='''<p>I wanted to try to <span id="blk1">match</span> the inner part of the string<span id="blk2"> between </span>the span tags <span>where</span> it is guaranteed that the id of this span tags <span id="blk3">starts</span> with blk.</p>'''
html=html.findAll(/<span id="blk(.)*">(.)*<\/span>/).join();
println html;
Rather than messing around with Regular Expressions, why not just parse the HTML and then extract the nodes from it?
#Grab( 'net.sourceforge.nekohtml:nekohtml:1.9.18' )
import org.cyberneko.html.parsers.SAXParser
def html = '''<p>
| I wanted to try to <span id="blk1">match</span> the inner part
| of the string<span id="blk2"> between </span> the span tags <span>where</span>
| it is guaranteed that the id of this span tags <span id="blk3">starts</span>
| with blk.
|</p>'''.stripMargin()
def content = new XmlSlurper( new SAXParser() ).parseText( html )
List<String> spans = content.'**'.findAll { it.name() == 'SPAN' && it.#id?.text()?.startsWith( 'blk' ) }*.text()
You seem to have span on one side and strong on the other.
In addition should be careful with using .* alone, as it will match most of the string in one go because regex is greedy. You should usually make it lazy by using .*?
When you use (.)* to match the text between tags, you will not get out the actual text fro mthat group, but only the last character that was matched, you need to put the quantifier inside the matching group.
Using [^<>]+ is a much better way to match text between html tags, and would be similar to .* except a few points.
It will match any character, except "<" and ">"
It will require to match at least one character, so it will not match an empty span.
Furthermore, if you can ensure that what follows "blk" will always be an integer, I recommend using \d+ to match it.
html=html.findAll(/<=span id="blk\d">([^<>]+)<\/span>/).join();
That being said, I have little experience in Groovy, but you wish that a list containing those three words should be printed? The following regex will extract text from the html as well.
html=html.findAll(/(?<=span id="blk\d">)([^<>]+)(?=<\/span>)/).join();

Get html content with RegEx between several tags

For example there are some html tags <div id="test"><div><div>testtest</div></div></div></div></div></div>
From that html, I need to get this <div id="test"><div><div>testtest</div></div></div>
Current regex /<div id=\"test\">.*(</div>){3}/gim
Since you have the specific requirement of needing exactly three closing tags, this regular expression should do the trick:
(<div.*?>)+.*?(</div>){3}
The trick here is to use the lazy star (*?) to keep the catch-all (.) character from matching more than you'd like.