Regex: Remove a <p></p> paragraph that has curly brackets inside

Regex: Remove a <p></p> paragraph that has curly brackets inside - regex

I would like to remove any paragraph for article body that has curly brackets inside.
For example, from this piece of content:
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
I would like to remove this part:
<p>five − = 2 .hide-if-no-js { display: none !important; } </p>
Using the following regex: <p>.*?\{.*?\}.*?</p>
It removes the whole article instead of this paragraph that contains curly braces, for some strange reason...
What am I doing wrong with the regex code?
Thanks!

Lazy / greedy quantifiers not always work as intended, instead of them match the string excluding <, this works for me: <p>[^<]*\{[^<]*</p>

Try this:
var str = '<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>';
var result = str.replace(/(<p>[^<]*\{.*<\/p>)/, '');
console.log(result);
Regex Demo

I'd suggest a two step approach (parsing and analyzing the text node).
Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):
Python:
# -*- coding: utf-8> -*-
import re
from bs4 import BeautifulSoup
html = """
<html>
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
regex = r'{[^}]+}'
for p in soup.find_all('p', string=re.compile(regex)):
p.replaceWith('')
print soup
PHP:
<?php
$html = "<html>
<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology & What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five − = 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
</html>";
$html = str_replace(' ', ' ', $html); // only because of the
$xml = simplexml_load_string($html);
# look for p tags
$lines = $xml->xpath("//p");
# the actual regex - match anything between curly brackets
$regex = '~{[^}]+}~';
for ($i=0;$i<count($lines);$i++) {
if (preg_match($regex, $lines[$i]->__toString())) {
# unset it if it matches
unset($lines[$i][0]);
}
}
// vanished without a sight...
print_r($xml);
// convert it back to a string
$html = echo $xml->asXML();
?>

I'd suggest a two step approach (parsing and analyzing the text node). Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):

Related

Remove all HTML tags from a cell

I'm trying to remove all the HTML tags and comments within the following cell in Google Sheets:
<div class="prod-desc" itemprop="description">
<div class="row">
<div class="col-md-8">
<p>This is a 100 count box of the ACC-DX01A Proximity Card to be used with any of our DX line of Access Control Readers. It is the size of a credit card so it can easily fit into your wallet. Use these like a proximity card and carry them on your key ring for easy access. </p>
<p> Please note: To add a DX Card or FOB to the DX Access Control System, you must use the Auto/Add Function. If you need assistance, FREE US based tech support is just a phone call away. </p>
</div>
<!-- Description Side Bar START ************************************ -->
<div class="col-md-4"> <img src="/images_templ/Accesss-Control_product-image.jpg"> <span class="boxtitle ">Full Line of Access Control</span> <span style="font-size: 18px; font-family: inherit; font-weight: 400">Access Control Proximity Card Readers and Electronic Door Locks and more!</span> </div>
<!-- Description Side Bar END ************************************ -->
</div>
</div>
So ideally the input should come out as:
This is a 100 count box of the ACC-DX01A Proximity Card to be used with any of our DX line of Access Control Readers. It is the size of a credit card so it can easily fit into your wallet. Use these like a proximity card and carry them on your key ring for easy access.
Please note: To add a DX Card or FOB to the DX Access Control System, you must use the Auto/Add Function. If you need assistance, FREE US based tech support is just a phone call away.
Full Line of Access Control Access Control Proximity Card Readers and Electronic Door Locks and more!
I've searched around found several answers, however, none of them seems to be working for me, maybe it's because of the new lines and carriage returns? I don't know. What I want to do is remove all the HTML and keep all the newlines and carriage returns in the text. Here are some posts that I was following:
Remove HTML In Google Sheets Cells
https://superuser.com/questions/564727/html-tags-in-google-spreadsheet

try like this:
=ARRAYFORMULA(TEXTJOIN(CHAR(10), 1,
TRIM(SPLIT(REGEXREPLACE(A1, "</?\S+[^<>]*>", ), CHAR(10)))))

Yes. Besides the answer that #player0 gave, you can also use 'Search and Replace' ctrl+H And then just paste all you wish to change/remove and replace it with nothing. It works for more than 1 cell too.
Its more laborious but you can target the entire book or ranges if needed.

RegEx for removing all spam links in a <div> The only identifier is overflow:hidden

I have just discovered around a thousands posts on our site with hidden links. They are all contained in divs the styles like this:
<div style='width:10px;height:13px;overflow:hidden'>
<div style='overflow:hidden;width:7px;height:13px'>
The width and height are all different, the only identifier is the overflow:hidden
Here is one example
<div style='width:10px;height:13px;overflow:hidden'>
<p>BRANDO CHANGED WILL IN LAST DAYS.(News)</p>
<p>The Mirror (London, England) July 8, 2004 Byline: IAN MARKHAM-SMITH HOLLYWOOD legend Marlon Brando changed his will days before his death, it emerged last night.</p>
<p>Movie mogul Mike Medavoy revealed that before the eccentric 80-year-old succumbed to illness on Friday, he summoned lawyers and some friends to make significant changes to his estate. lastnightmovienow.net last night movie</p>
</div>
How do I create a RegEx that finds every day with the style that contains overflow:hidden then any character, set of character etc up until the closing div.
I tried this, but didn't work
<div style='.*overflow:hidden'>(.*)</div>
I think it's due to not escaping the normal HTML.
I'm a RegEx noob.
Thanks
Ollie

Thanks mate, very detailed response :)
As you say it's sketchy, worked on some posts and not others.
We solved this by adding this to the functions.php file to strip all the problematic divs out server side.
RegEx was the incorrect approach.
function my_the_content_filter( $content ) {
$content = preg_replace("#<div[^>]*overflow:hidden[^>]*>.*?</div>#is", "", $content);
return $content;
}
add_filter( 'the_content', 'my_the_content_filter');
?>

Re.Sub Expected string or buffer

I read and try this, this and this solutions. But my error not solving.
def read_news(self, news_url):
try:
self.checkRequests(news_url)
result = self.soup.find("div", {'class', 'm-article__entry'})
if isinstance(result, bs4.element.Tag):
for r in result:
re.sub("<.*?>", "", r)
print ('result', result)
return result.lower()
return
except Exception, e:
self.insertErrorLog('thevergetech.read_news', news_url, e)
This is code's part occuring error. I run my code in debugging mode and result return this:
<div class="m-article__entry">
<p>Malloy Aeronautics, a UK-based company, has been slowly and publicly developing a hoverbike over the last few years — it even used Kickstarter to raise funds. But it looks like the project is now headed in a new direction, because the US Department of Defense just announced a deal with Malloy to develop the vehicle for the US Army.</p>
<p>The DoD is interested in the technology for a few reasons. For one, it's safe. The hoverbike's rotors are guarded so they won't tear into humans and other objects. It's also a cheaper option than, say, a helicopter. And it's more maneuverable in tight spaces, with options to operate it autonomously or with a human pilot.</p>
<div class="m-ad m-ad__article-body">
<div class="dfp_ad" data-cb-ad-id="Mobile article body" data-cb-dfp-id="unit=mobile_article_body" id="div-gpt-ad-mobile_article_body">
<script type="text/javascript">
SBN.Ads.showAd("mobile_article_body");
</script>
</div>
</div>
<p><q class="right">From Kickstarter to the US Military-industrial complex</q></p>
<p>Developers of the hoverbike told Reuters that they consider it ideal for search and rescue or cargo delivery missions. It could also be used for surveillance — plans for the full-scale version include an attachable humanoid figure with a head-mounted camera.</p>
<p>The first step in the deal will be to build a functioning full-scale model, and from there the DoD will reportedly design military-grade prototypes. In the meantime, Malloy Aeronautics will continue to make scale models while developing a commercial version of the hoverbike.</p>
<hr/>
<div class="video-wrap p-scalable-video"><div class="chorus-video-embed" data-analytics-placement="entry:middle" data-chorus-video-id="57712" id="chorus-video-57712"></div></div>
<b>Verge Video</b> <i>We rode a hoverboard</i>
<ul class="m-article__sources">
<li class="source"><strong>Source</strong>Reuters</li>
<li class="tags"><strong>Related Items</strong>
us army
department of defense
drones
hoverbike
malloy
</li>
</ul>
</div>
I try for loop just like for r in return but for cycle 2 times and get same error.

Looks like you have 8 space intend than 4

I fix my code. This code not get error;
def read_news(self, news_url):
try:
self.checkRequests(news_url)
result = self.soup.find("div", {'class', 'm-article__entry'})
if isinstance(result, bs4.element.Tag):
result = str(result)
result = re.sub("<.*?>", "", result)
return result.lower()
return
except Exception, e:
self.insertErrorLog('thevergetech.read_news', news_url, e)

Zurb Foundation 5 hiding content in Offcanvas

If you copy and paste Zurb's "Basic" code implementation of Foundation's Offcanvas layout, the paragraph content doesn't scroll. Doesn't this defeat the utility of this functionality? What am I missing here?
http://foundation.zurb.com/docs/components/offcanvas.html
This is the code I'm copy-pasting from that page:
<div class="off-canvas-wrap">
<div class="inner-wrap">
<a class="left-off-canvas-toggle" >Menu</a>
<!-- Off Canvas Menu -->
<aside class="left-off-canvas-menu">
<!-- whatever you want goes here -->
<ul>
<li>Item 1</li>
...
</ul>
</aside>
<!-- main content goes here -->
<p>THESE PARAGRAPHS WILL NOT SCROLL Set in the year 0 F.E. ("Foundation Era"), The Psychohistorians opens on Trantor, the capital of the 12,000-year-old Galactic Empire. Though the empire appears stable and powerful, it is slowly decaying in ways that parallel the decline of the Western Roman Empire. Hari Seldon, a mathematician and psychologist, has developed psychohistory, a new field of science and psychology that equates all possibilities in large societies to mathematics, allowing for the prediction of future events.</p>
<p>THESE PARAGRAPHS WILL NOT SCROLL Set in the year 0 F.E. ("Foundation Era"), The Psychohistorians opens on Trantor, the capital of the 12,000-year-old Galactic Empire. Though the empire appears stable and powerful, it is slowly decaying in ways that parallel the decline of the Western Roman Empire. Hari Seldon, a mathematician and psychologist, has developed psychohistory, a new field of science and psychology that equates all possibilities in large societies to mathematics, allowing for the prediction of future events.</p>
<p>THESE PARAGRAPHS WILL NOT SCROLL Set in the year 0 F.E. ("Foundation Era"), The Psychohistorians opens on Trantor, the capital of the 12,000-year-old Galactic Empire. Though the empire appears stable and powerful, it is slowly decaying in ways that parallel the decline of the Western Roman Empire. Hari Seldon, a mathematician and psychologist, has developed psychohistory, a new field of science and psychology that equates all possibilities in large societies to mathematics, allowing for the prediction of future events.</p>
<!-- close the off-canvas menu -->
<a class="exit-off-canvas"></a>
</div>
</div>

Make sure to include javascript files :
In the head section
<script src="/js/vendor/custom.modernizr.js"></script>
Just before the closing body tag
<script src="/js/vendor/jquery.js"></script>
<script src="/js/vendor/fastclick.js"></script>
<script src="/js/foundation.js"></script>
and the right dependency, in your case add this:
<script src="/js/foundation.offcanvas.js"></script>
for more details go to : http://foundation.zurb.com/docs/javascript.html
regards

Regex to pull out HTML items

Given the following HTML block, what would be the best Regex pattern to create the following list: (keep the url links in the Matches collection.
Abdominal Aortic Aneurysm see Aortic Aneurysm
Abdominal Pain
Abdominal Pregnancy see Ectopic Pregnancy
Abnormalities see Birth Defects
ABO Blood Groups see Blood and Blood Disorders
Abortion
About Your Medicines see Medicines; Over-the-Counter Medicines
ABPA see Aspergillosis
Abscess
Abuse see Child Abuse; Domestic Violence; Elder Abuse
Here is the raw input:
<li><span class="formod5"> </span></li>
<li class="item">Abdominal Aortic Aneurysm see Aortic Aneurysm</li>
<li class="item">Abdominal Pain</li>
<li class="item">Abdominal Pregnancy see Ectopic Pregnancy</li>
<li class="item">Abnormalities see Birth Defects</li>
<li class="item">ABO Blood Groups see Blood and Blood Disorders</li>
<li><span class="formod5"> </span></li>
<li class="item">Abortion</li>
<li class="item">About Your Medicines see Medicines; Over-the-Counter Medicines</li>
<li class="item">ABPA see Aspergillosis</li>
<li class="item">Abscess</li>
<li class="item">Abuse see Child Abuse; Domestic Violence; Elder Abuse</li>
<li><span class="formod5"> </span></li>
TIA

Ignore these DOM guys. They don’t know what they’re talking about, and even if they do, they haven’t answered your question, which is rude.
If that’s really all you’re trying to do, which I believe is strip tags and leave the rest, you can strip those particular tags up there that don’t contain fancy stuff with a simple:
s/<.*?>//g;
and you’ll have to convert the entities like
s/ //g
On arbitrary HTML, you have to be a lot more careful than this of course, because you have <script> tags and <style> tags and CDATA sections and alt=">" and all that jazz, but on the sample you presented, this will work just fine.
Don’t you have better ways of converting HTML to text than this, though?

Do not use regex for this kind of stuff (i think that you don't use hammer instead of the wrench when you need to screw a bolt?), use special tools that are used for this kind of operations : HTML DOM parser (http://simplehtmldom.sourceforge.net/) or something similar.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: Remove a <p></p> paragraph that has curly brackets inside - regex

Lazy / greedy quantifiers not always work as intended, instead of them match the string excluding <, this works for me: <p>[^<]\{[^<]</p>

I'd suggest a two step approach (parsing and analyzing the text node). Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):

Related

Remove all HTML tags from a cell

RegEx for removing all spam links in a <div> The only identifier is overflow:hidden

Re.Sub Expected string or buffer

Zurb Foundation 5 hiding content in Offcanvas

Regex to pull out HTML items

Categories

Resources