remove target and rel attributes from links with yahoo pipes regex - regex

I'm trying to remove rel and target attributes from a RSS description using regex but can't seem to figure it out. I'm able to remove classes and styles, but rel and target just won't work. I've tried quite a few ways with gskinner that work there, but do not work when I bring them over to Pipes.
I've tried...
<a(.*?) target="(.*?)" replacing with <a$1
which works in when re-engineered to remove classes or styles from tags, but not this?
Any suggestions??
Here's a sample...
<p>Find out more at <a rel="nofollow" target="_blank" href="http://someurl.org/">someurl.org</a>. </p>
Which I would like to be
<p>Find out more at someurl.org. </p>

I believe this will do what you want:
(.*)rel=".*" target=".*" (.*)
Then your replace would just use \1\2 or whatever the software uses to do the replacing.
Example:
http://regexr.com?353i5

Related

Regex to match HTML with or without link

I would like to be able to get "Target" out of this block of HTML when it appears in a page:
<h3>
<a href="http://link"> Target
</a> </h3>
I can count on the spacing being reliably there. What I can't count on is that "Target" will always be included in an anchor tag. Sometimes, it looks like this:
<h3>
Target
</h3>
I can match the first version and extract "Target" pretty easily with this regex:
/<h3>\s+<a href=.*>\s+(.*)\s+<\/a>\s+<\/h3>/
But I'm struggling to write one that will match both. Any ideas?
Don't use regular expressions to parse HTML. It is more painful then it is worth in most cases. Use a library designed to parse HTML.
#!/usr/bin/perl
use v5.16;
use strict;
use warnings;
use HTML::TreeBuilder;
my $data = qq{<body><h3>
<a href="http://link"> Target
</a> </h3></body>
};
my $otherdata = qq{<body><h3>
Target
</h3></body>
};
my $t = HTML::TreeBuilder->new_from_content($data);
say $t->look_down(_tag => "h3")->as_text();
$t = HTML::TreeBuilder->new_from_content($otherdata);
say $t->look_down(_tag => "h3")->as_text();
Just to put my two cents in, why not use an xpath query with a decent Dom library?
//html/body/h3/text()[contains(.,'Target')
The actual query may vary depending upon your html structure.
Try this one as a regex:
<h3>\s+(<a href=.*>)?\s+(.*)\s+(<\/a>)?\s+<\/h3>
It should match both your cases.
Even though this is not a recommended way to search html, if this is what you want to try, I won't stop you.

Regex, I just don't get it right

I just don't get my Regex right:
I have the following template:
<!-- Defines the template for the tabs. -->
{{TMPL:Import=../../../../Data/Templates/Ribbon/tabs.tmpl; Name=Tabs}}
<div class="tabs">
<ul role="tablist">
{{BOS:Sequence}}
<li role="tab" class="{{TabType}}" id="{{tabId}}">
<span>{{TabFile}}</span>
</li>
{{EOS:Sequence}}
</ul>
</div>
{{Render:Tabs}}
I would like to find everything between {{}} except the tags that begins with {{BOS, {{EOS, {{TMPL, {{Render
Here are a couple approaches:
Attempt 1:
({{).*(}})
This selects everything between {{ }} tags, which is not good.
Attempt 2:
({{)[^TMPL][^BOS][^EOS][^Render].*(}})
This will make that {{TabType}} and {{TabFile}} are not selected anymore and I just don't know why.
With some other regex, I get that {{TabType}}" id="{{tabId}} is selected as one match.
Does anyone have a clue on how to solve this, I really need a regex Guru :-)
You can use negative lookahead based regex like this:
{{(?!TMPL|[BE]OS|Render).*?}}
RegEx Demo
You have to use the following regex to get the content between braces:
\{\{(.*?)\}\}
Working Demo
If you want to exclude the content from the comment you posted you can use a regex technique to exclude what you don't want and keep what you want at the end of the regex:
\{\{BOS:Sequence\}\}|\{\{EOS:Sequence\}\}|\{\{TMPL:Import.*?\}\}|\{\{Render:Tabs\}\}|\{\{(.*?)\}\}
Working demo
By the way, if you want to have a shortcut for above regex you can use:
\{\{(?:BOS|EOS):Sequence\}\}|\{\{TMPL:Import.*?\}\}|\{\{Render:Tabs\}\}|\{\{(.*?)\}\}
This is a very useful technique for pattern exclusion that I glad to learn it from Anubhava and zx81 (they rock using regex pattern). For this regex technique you can find the content you need using capturing groups (check the green highlights on the screenshot below):
Using [^TMPL] and the like won't work because these are character classes. You could use a negative lookahead, though (or even lookbehind depending upon the regex library you are using).
\{\{(?!BOS:)(?!EOS:)(?!Render:)(?!TMPL:)(.*?)\}\}
Still I get the feeling that you want the BOS, EOS, etc. to just be strings in the template with {{ and other values to be interpolated. If you are using handlebars or something, you can have strings interpolated:
{{'{{BOS:Sequence}}'}}

Using SED to remove specific anchor tags within html in database

I've got a table which contains hundreds of guides with screenshots. The screenshots images were surrounded by anchor tags as they were clickable before but now I need to remove the anchor tags. All the anchor tags to be removed have an href=#screenshot followed by a number as in the example below. My plan is to dump the table using mysqldump and then use sed to find and replace the correct strings.
<p>Choose components to install and click next.</p>
<div class="screen">
<img src="/images/screens/install/step3.jpg" alt="Step 3">
</div>
Should be
<p>Choose components to install and click next.</p>
<div class="screen">
<img src="/images/screens/install/step3.jpg" alt="Step 3">
</div>
I can match the first tag using <a\shref\=\"#screenshot\d+\"\> but I also need to match its second closing tag so that both can be removed whilst not removing other anchor tags. Any help would be greatly appreciated!
You can try replacing
<a\shref\=\"#screenshot\d+\"\>(.*)<\/a>
with \1.
The parenthesis will capture everything that is found between them so you can restore it using \1, \2...
Keep in mind though that regexes are not the right weapon to use when trying to modify HTML. Read this (and the comments around it) for an explanation.

ColdFusion Regex for finding empty html tags

Hey all, I'm trying to dynamically strip out some empty html tags. I'm kind of new to Regex, and it seems like the engine for coldfusion isn't as robust/similar to other regex engines (like javascript and as3).
What's the trick for building a regex that ignores spaces in coldfusion 8? So, if I build this thing out I want it to work on either of the examples below.
<p > </p>
<p> </p>
<P></p>
Any help would be really greatful!
This should work: <\w+[^>]*(/>|>\s*?</\w+>). I think. There are no complex, language specific features (i.e. loohaheads, lookbehinds, etc.)
Modified from here: Regular expression to remove empty <span> tags

replacing html link with regex and yahoo pipes

my question is similar to the one here:
Regular expression on Yahoo! pipes
i have my gmail status hooked up to twitter through friendfeed, but unfortunately, they truncate the link text, and my links aren't working once they get to twitter. I need to be able to take this:
<div style="margin-top:2px;color:black;">/good jquery tips
<a rel="nofollow" style="text-decoration:none;color:#00c;" target="_blank" href="http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/" title="http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/">
http://james.padolsey.com/javascr...
</a>
</div>
and replace the truncated link with the href attribute, so it looks like this:
<div style="margin-top:2px;color:black;">/good jquery tips
<a rel="nofollow" style="text-decoration:none;color:#00c;" target="_blank" href="http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/" title="http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/">
http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/
</a>
</div>
thanks for the help!
Anthony's warning stands, you can't parse HTML safely with regexes, so please weigh up the risks involved if you choose to use Pipes.
Assuming you want to replace only this specific structure, and expect it to break if anything in the source changes subtly, the following will work well enough for your purposes:
replace:
(<a\s[^>]*href=")(.*?)("[^>]*>).*?(</a>)
with:
$1$2$3$2$4
(Using s and i options)