Regex to pull out HTML items

Regex to pull out HTML items - regex

Given the following HTML block, what would be the best Regex pattern to create the following list: (keep the url links in the Matches collection.
Abdominal Aortic Aneurysm see Aortic Aneurysm
Abdominal Pain
Abdominal Pregnancy see Ectopic Pregnancy
Abnormalities see Birth Defects
ABO Blood Groups see Blood and Blood Disorders
Abortion
About Your Medicines see Medicines; Over-the-Counter Medicines
ABPA see Aspergillosis
Abscess
Abuse see Child Abuse; Domestic Violence; Elder Abuse
Here is the raw input:
<li><span class="formod5"> </span></li>
<li class="item">Abdominal Aortic Aneurysm see Aortic Aneurysm</li>
<li class="item">Abdominal Pain</li>
<li class="item">Abdominal Pregnancy see Ectopic Pregnancy</li>
<li class="item">Abnormalities see Birth Defects</li>
<li class="item">ABO Blood Groups see Blood and Blood Disorders</li>
<li><span class="formod5"> </span></li>
<li class="item">Abortion</li>
<li class="item">About Your Medicines see Medicines; Over-the-Counter Medicines</li>
<li class="item">ABPA see Aspergillosis</li>
<li class="item">Abscess</li>
<li class="item">Abuse see Child Abuse; Domestic Violence; Elder Abuse</li>
<li><span class="formod5"> </span></li>
TIA

Ignore these DOM guys. They don’t know what they’re talking about, and even if they do, they haven’t answered your question, which is rude.
If that’s really all you’re trying to do, which I believe is strip tags and leave the rest, you can strip those particular tags up there that don’t contain fancy stuff with a simple:
s/<.*?>//g;
and you’ll have to convert the entities like
s/ //g
On arbitrary HTML, you have to be a lot more careful than this of course, because you have <script> tags and <style> tags and CDATA sections and alt=">" and all that jazz, but on the sample you presented, this will work just fine.
Don’t you have better ways of converting HTML to text than this, though?

Do not use regex for this kind of stuff (i think that you don't use hammer instead of the wrench when you need to screw a bolt?), use special tools that are used for this kind of operations : HTML DOM parser (http://simplehtmldom.sourceforge.net/) or something similar.

Related

RegEx for removing all spam links in a <div> The only identifier is overflow:hidden

I have just discovered around a thousands posts on our site with hidden links. They are all contained in divs the styles like this:
<div style='width:10px;height:13px;overflow:hidden'>
<div style='overflow:hidden;width:7px;height:13px'>
The width and height are all different, the only identifier is the overflow:hidden
Here is one example
<div style='width:10px;height:13px;overflow:hidden'>
<p>BRANDO CHANGED WILL IN LAST DAYS.(News)</p>
<p>The Mirror (London, England) July 8, 2004 Byline: IAN MARKHAM-SMITH HOLLYWOOD legend Marlon Brando changed his will days before his death, it emerged last night.</p>
<p>Movie mogul Mike Medavoy revealed that before the eccentric 80-year-old succumbed to illness on Friday, he summoned lawyers and some friends to make significant changes to his estate. lastnightmovienow.net last night movie</p>
</div>
How do I create a RegEx that finds every day with the style that contains overflow:hidden then any character, set of character etc up until the closing div.
I tried this, but didn't work
<div style='.*overflow:hidden'>(.*)</div>
I think it's due to not escaping the normal HTML.
I'm a RegEx noob.
Thanks
Ollie

Thanks mate, very detailed response :)
As you say it's sketchy, worked on some posts and not others.
We solved this by adding this to the functions.php file to strip all the problematic divs out server side.
RegEx was the incorrect approach.
function my_the_content_filter( $content ) {
$content = preg_replace("#<div[^>]*overflow:hidden[^>]*>.*?</div>#is", "", $content);
return $content;
}
add_filter( 'the_content', 'my_the_content_filter');
?>

Regex: Remove a <p></p> paragraph that has curly brackets inside

I would like to remove any paragraph for article body that has curly brackets inside.
For example, from this piece of content:
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
I would like to remove this part:
<p>five − = 2 .hide-if-no-js { display: none !important; } </p>
Using the following regex: <p>.*?\{.*?\}.*?</p>
It removes the whole article instead of this paragraph that contains curly braces, for some strange reason...
What am I doing wrong with the regex code?
Thanks!

Lazy / greedy quantifiers not always work as intended, instead of them match the string excluding <, this works for me: <p>[^<]*\{[^<]*</p>

Try this:
var str = '<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>';
var result = str.replace(/(<p>[^<]*\{.*<\/p>)/, '');
console.log(result);
Regex Demo

I'd suggest a two step approach (parsing and analyzing the text node).
Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):
Python:
# -*- coding: utf-8> -*-
import re
from bs4 import BeautifulSoup
html = """
<html>
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
regex = r'{[^}]+}'
for p in soup.find_all('p', string=re.compile(regex)):
p.replaceWith('')
print soup
PHP:
<?php
$html = "<html>
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
</html>";
$html = str_replace(' ', ' ', $html); // only because of the
$xml = simplexml_load_string($html);
# look for p tags
$lines = $xml->xpath("//p");
# the actual regex - match anything between curly brackets
$regex = '~{[^}]+}~';
for ($i=0;$i<count($lines);$i++) {
if (preg_match($regex, $lines[$i]->__toString())) {
# unset it if it matches
unset($lines[$i][0]);
}
}
// vanished without a sight...
print_r($xml);
// convert it back to a string
$html = echo $xml->asXML();
?>

I'd suggest a two step approach (parsing and analyzing the text node). Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):

Regex to match only the first occurrence of an html element

Yes yes, I know, "don't parse HTML with Regex". I'm doing this in notepad++ and it's a one-time thing so please bear with me for a moment.
I'm trying to simplify some HTML code by using some more advanced techniques. Notably, I have "inserts" or "callouts" or whatever you call them, in my documentation, indicating "note", "warning" and "technical" short phrases to grab the attention of the reader on important information:
<div class="note">
<p><strong>Notes</strong>: This icon shows you something that complements
the information around it. Understanding notes is not critical but
may be helpful when using the product.</p>
</div>
<div class="warning">
<p><strong>Warnings</strong>: This icon shows information that may
be critical when using the product.
It is important to pay attention to these warnings.</p>
</div>
<div class="technical">
<p><strong>Technical</strong>: This icon shows technical information
that may require some technical knowledge to understand. </p>
</div>
I want to simplify this HTML into the following:
<div class="box note"><strong>Notes</strong>: This icon shows you something that complements
the information around it. Understanding notes is not critical but
may be helpful when using the product.</div>
<div class="box warning"><strong>Warnings</strong>: This icon shows information that may
be critical when using the product.
It is important to pay attention to these warnings.</div>
<div class="box technical"><strong>Technical</strong>: This icon shows technical information
that may require some technical knowledge to understand.</div>
I almost have the regex necessary to do a nice global search & replace in my project from notepad++, but it's not picking up "only" the first div, it's picking up all of them - if my cursor is at the beginning of my file, the "select" when I click Find is from the first <div class="something"> up until the last </div>, essentially.
Here's my expression: <div class="(.*[^"])">[^<]*<p>(.*?)<\/p>[^<]*<\/div> (notepad++ "automatically" adds the / / around it, kinda).
What am I doing wrong, here?

You have a greedy dot-quantifier while matching the class attribute — that's the evil guy who's causing your problems.
Make it non-greedy: <div class="(.*?[^"])"> or change it to a character class: <div class="([^"]*)">.
Compare: greedy class vs. non-greedy class.

Prestashop, Smarty template renders too many buttons

I've got this code in my prestashop template, there is no loop, only conditional, and I get 5 back buttons (elseif section, first li tag), why is it happen?
{if $node.children|#count > 0 && ($smarty.get.controller!='product' && $smarty.get.controller!='category')}
<li class = "li-parent">
<asset class="menu-arrow-left"></asset>
<p><span>{$node.name|escape:'htmlall':'UTF-8'}</span></p>
{elseif $node.children|#count > 0 && ($smarty.get.controller=='product' || $smarty.get.controller=='category')}
<li class="li-back"><asset class="menu-arrow-right"></asset><p class="class="border-bottom-grandiet-small"><span>Back</span></p></li>
<li class = "li-parent">
<p><span>{$node.children[0].name|escape:'htmlall':'UTF-8'}</span></p>
{/if}

I don't see anything in this code that could cause displaying 5 back buttons. I suspect this code is included in some kind of loop and that's why it's displayed 5 times.
You should change the whole above code with:
testonly
and then look at page or page source and check how many testonly texts will appear.
It's also possible if you really use loop that you should use some extra condition. For example instead of:
<li class="li-back"><asset class="menu-arrow-right"></asset><p class="class="border-bottom-grandiet-small"><span>Back</span></p></li>
you should use
{if $node.children|#iteration eq 1}
<li class="li-back"><asset class="menu-arrow-right"></asset><p class="class="border-bottom-grandiet-small"><span>Back</span></p></li>
{/if}
and probably the rest should be more similar to the first condition so instead of:
<li class = "li-parent">
<p><span>{$node.children[0].name|escape:'htmlall':'UTF-8'}</span></p>
you should use:
<li class = "li-parent">
<p><span>{$node.name|escape:'htmlall':'UTF-8'}</span></p>
but it's really hard to say if we don't know what's the data structure and what exactly you want to achieve. If it still doesn't work you should provide more details to your question, explain what you want to achieve, what data you have in your variables and so on.

How to write this Regex

HTML:
<dt>
<a href="#profile-experience" >Past</a>
</dt>
<dd>
<ul class="past">
<li>
President, CEO & Founder <span class="at">at</span> China Connection
</li>
<li>
Professional Speaker and Trainer <span class="at">at</span> Edgemont Enterprises
</li>
<li>
Nurse & Clinic Manager <span class="at">at</span> <span>USAF</span>
</li>
</ul>
</dd>
I want match the <li> node.
I write the Regex:
<dt>.+?Past+?</dt>\s+?<dd>\s+?<ul class=""past"">\s+?(?:<li>\s*?([\W\w]+?)+?\s*?</li>)+\s+?</ul>
In fact they do not work.

No not parse HTML using a regex like it's just a big pile of text. Using a DOM parser is a proper way.

Don't use regular expressions to parse HTML...

Don't use a regular expression to match an html document. It is better to parse it as a DOM tree using a simple state machine instead.
I'm assuming you're trying to get html list items. Since you're not specifying what language you use here's a little pseudo code to get you going:
Pseudo code:
while (iterating through the text)
if (<li> matched)
find position to </li>
put the substring between <li> to </li> to a variable
There are of course numerous third-party libraries that do this sort of thing. Depending on your development environment, you might have a function that does this already (e.g. javascript).

Which language do you use?
If you use Python, you should try lxml: http://lxml.de. With lxml, you can search for the node with tag ul and class "past". You then retrieve its children, which are li, and get text of those nodes.

If you are trying to extract from or manipulate this HTML, xPath, xsl, or CSS selectors in jQuery might be easier and more maintainable than a regex. What exactly is your goal and in what framework are you operating?

please learn to use jQuery for this sort of thing

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to pull out HTML items - regex

Do not use regex for this kind of stuff (i think that you don't use hammer instead of the wrench when you need to screw a bolt?), use special tools that are used for this kind of operations : HTML DOM parser (http://simplehtmldom.sourceforge.net/) or something similar.

Related

RegEx for removing all spam links in a <div> The only identifier is overflow:hidden

Regex: Remove a <p></p> paragraph that has curly brackets inside

Regex to match only the first occurrence of an html element

Prestashop, Smarty template renders too many buttons

How to write this Regex

Categories

Resources