Delete all the content between <strong></strong> in Yahoo Pipes feed - regex

I'm pulling a feed from GMA, don't ask why. I'm using yahoo pipes because I can filter out certain articles based on their title. Then I run the feed through feedenlarger.com so I can get the full text pretty easily.
The problem I'm having is that the feeds contain bold links in them that are disrupting the articles. Each one is surrounded by a <strong>....</strong>. I am trying to just delete any content that exists between the <strong></strong>, but I can't seem to get it right.
I have tried:
item.description replace <strong>*?</strong> with (and left blank)
as well as
item.description replace <strong>*?</strong> with (also left blank)
I know regex and html are not meant for one another, but if someone has a suggestion or direction, I'd appreciate it very much.
Thanks

I'm not familiar with what you are doing exactly, but I would first try just removing the <strong> tag to make the escape is needed. By that I mean see if <strong> or <strong> works to make sure you are on the right track.
I believe the source of the issue is that it appears you are trying to match many > rather than the actual contents between the tags. Try using .*? (or [^<]*? if you know there will be no other tags within the tags) instead.

Related

RegEx for excluding a match with prefix

I first wanted to only match the first instance, but soon realized that is not possible. The tool I'm using only uses RegEx so I have no options as well.
Basically I got a text with HTML tags in it and I want to match the first paragraph's tags without the following tags.
For example out of this:
<p>erkfoijwdocndoufhwroguh</p><p>pijgoijkuohuhogiougwtg</p><p>pijgoijkuohuhogiougwtg</p><p>pijgoijkuohuhogiougwtg</p>
I want to match the first <p></p>
and nothing else.
So I figured I could exclude the tags that have a tag right next to them using negative lookahead. As in:
(?!>)(<|<\/)p>
But for some reason this still matches every <p> and </p> tag instead of leaving out those that have another tag before them. Any suggestions?
Edit to add: I only need to match the tags, not the text inside the tags. And lookbehind doesn't work with the tool I'm using. It seems that everything that works here, works also in my tool.
Second edit: I solved my problem, but I'm leaving the question open since the solution wasn't an answer and this seems like an interesting question and I might bump into similiar problem in the future. Basically if someone figures out how I can refer to <p> that doesn't have a > before it but also include the first </p>, I'd like to hear it.
I'm not sure I understood what you are trying to achieve, would this:
^<p>.*?<\/p>
Demo here: https://regex101.com/r/ZXgMPV/1

Regex for an anchor tag which contains a specific inline style

Unfortunately my blog was hacked and 1000+ posts have been infected with links to spam sites. As part of the cleaning process I'm trying to use a regex to find and replace the bad links in an XML file in Sublime Text.
The only consistency I can see is all the bad links contain an inline style changing the text colour to #676c6c, so I'm trying but failing to create a regular expression that can highlight all anchor tags containing this hex value - #676c6c
<a[\s]+([^>]+)>((?:.(?!\<\/a\>))*.)</a>
So far I've got this, which I believe highlights all anchor tags, can anyone help expand this to include anchors containing #676c6c between the first angled brackets? Here's an example of one of the bad links
spam keyword
I appreciate any help! Cheers.
Maybe you could use <a[^>]+#676c6c[^>]+>[^<]*<\/a>.
Try it out here.
If the anchor tag may contain other tags, use (?s)<a[^>]+#676c6c[^>]+>.*?<\/a> instead.

Best regex string to grab links from many different structures of html page

I am parsing links from many different structures of HTML data, and my current regex string have so far worked for most of them. However, I've come across a domain with link structures like this, and I would like to be able to grab them too..
<font face="Verdana" size="2"></font>
My regex string:
<[aA].*?[hrefHREF]=["']?([^'">\ ]*)["']?[^>]*>([\s\S]*?)<\/[aA]>
I'm pretty sure the problem is just that it has other html tags inside the a href. But I tried adding
.*?
inside the anchor text, but that didn't work, maybe I did it wrong?
This works for the sample you provided
/<a.*?href=["']?([^'">\ ]*)["']?[^>]*>.*<\/a>/is
https://regex101.com/r/yrdKi9/1

Regex select XML Element (containing hyphen) and inside content

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.
It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.

Get value between <b> tag using regex in Yahoo Pipes

I have searched up and down trying to find an answer that will work for me but haven't been able to figure this out. I'm using Yahoo Pipes for this.
Lake Harmony Estates <b>Sleeps: 16</b>
What I need to do is extract the Sleeps: 16 out from the B tag and output just that value and nothing else. I don't suspect this is very hard to do, but given my limited regex knowledge it's giving me troubles. I've tried adapting regex code pertaining to other tags, but just can't seem to get this one to work.
Any help on this would be appreciated. Thanks.
Edit:
Here is my pipe if you wanted to take a look at the regex horrible-ness I've created. The one I'm trying to work though is the item.sleeps, last entry in the 2nd regex
http://pipes.yahoo.com/pipes/pipe.info?_id=567026d850223b0075d80fd3c9bf7e75
This should fit your needs assuming the html isn't ladened with quotes and such. Note that the + will mean that empty <b> tags are ignored. Also, html is not truly passable via regex, so this will only work for basic tags. It should work even if the tag has an ID or a class property, but there are absolutely manners to break this regex.
/<b[^>]*>([^<]+)<\/b>/
I posted this question to Twitter and got a response back that worked for me.
(?s)^.*<b>(.*?)</b>.*
Replace with $1 and have G flag checked.
This solution did everything I needed. I had additional data that I had already excluded in my example that became unnecessary with this regex.