replacing html link with regex and yahoo pipes - regex

my question is similar to the one here:
Regular expression on Yahoo! pipes
i have my gmail status hooked up to twitter through friendfeed, but unfortunately, they truncate the link text, and my links aren't working once they get to twitter. I need to be able to take this:
<div style="margin-top:2px;color:black;">/good jquery tips
<a rel="nofollow" style="text-decoration:none;color:#00c;" target="_blank" href="http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/" title="http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/">
http://james.padolsey.com/javascr...
</a>
</div>
and replace the truncated link with the href attribute, so it looks like this:
<div style="margin-top:2px;color:black;">/good jquery tips
<a rel="nofollow" style="text-decoration:none;color:#00c;" target="_blank" href="http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/" title="http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/">
http://james.padolsey.com/javascript/things-you-may-not-know-about-jquery/
</a>
</div>
thanks for the help!

Anthony's warning stands, you can't parse HTML safely with regexes, so please weigh up the risks involved if you choose to use Pipes.
Assuming you want to replace only this specific structure, and expect it to break if anything in the source changes subtly, the following will work well enough for your purposes:
replace:
(<a\s[^>]*href=")(.*?)("[^>]*>).*?(</a>)
with:
$1$2$3$2$4
(Using s and i options)

Related

How to Match only url from a tag node js

I have a tag <span style="color: rgb(255,255,255);">[1]</span>
I am using this regex <a href="(.*)">(.*)<\/a> But its not parsing the url only. Its also parsing <span style="color: rgb(255,255,255);">[1]</span>
How can i get only the url from a tags?
Easily! Capture everything that is between href, that is a key word, and <.
href=(.*?)>
If you don't want to capture "", try this one:
href="(.*?)">
Although I am not much experienced with node.js, I think this one may work, but it won't be hard for you if you know the Regex.
var pattern = new RegExp(/href="(.*?)">/);
Here is Regex101.

Replacing an open-ended HTML regex match in Sublime Text

I've been doing a lot of finding and replacing in Sublime Text and decided I needed to learn RegEx. So far, so good. I'm no expert by any means, but I'm learning quickly.
The trouble is knowing how to replace open-ended HTML matches.
For example, I wanted to find all <button>s that didn't have a role attribute.
After some hacking and searching, I came up with the following pattern (see in action):
<button(?![^>]+role).*?>(.*?)
Great! Except, in the code base I'm working in, there are tons of results.
How do I do about replacing the results safely by injecting role=button at the end of <button, just before the closing > in the opening tag?
Desired results
Before: <button type="button">
After: <button type="button" role="button">
Before: <button class="btn-lg" type="button">
After: <button class="btn-lg" type="button" role="button">
You can capture everything before the ending > and put it back, before the insertion of role=button:
<(button(?![^>]+role).*?)>
This captures everything in the tag.
Replace by:
<$1 role="button">
The $1 contains what the first regex captured.
See the updated regexr.

remove target and rel attributes from links with yahoo pipes regex

I'm trying to remove rel and target attributes from a RSS description using regex but can't seem to figure it out. I'm able to remove classes and styles, but rel and target just won't work. I've tried quite a few ways with gskinner that work there, but do not work when I bring them over to Pipes.
I've tried...
<a(.*?) target="(.*?)" replacing with <a$1
which works in when re-engineered to remove classes or styles from tags, but not this?
Any suggestions??
Here's a sample...
<p>Find out more at <a rel="nofollow" target="_blank" href="http://someurl.org/">someurl.org</a>. </p>
Which I would like to be
<p>Find out more at someurl.org. </p>
I believe this will do what you want:
(.*)rel=".*" target=".*" (.*)
Then your replace would just use \1\2 or whatever the software uses to do the replacing.
Example:
http://regexr.com?353i5

A regular expression question

I have content something like
<div class="c2">
<div class="c3">
<p>...</p>
</div>
</div>
What I want is to match the div.c2's inner HTML. The contents of it may vary a lot. The only problem I am facing here is that how can I make it to work so that the right closing div is taken?
You can't. This problem is unsolvable with classic regular expressions, and with most of the existing regex implementations.
However, some regex engines have special support for balanced pair matching. See, e.g., here (.NET). Though even in this case your regex will be able to parse only a subset of syntactically correct texts (e.g., what if a < /div > is embedded in a comment?). You need an HTML parser to get reliable results.
Any chance this will always be valid XHTML? If so, you'd be better off parsing it as XML than trying to regex this.
Delete the first line, delete the last line. Problem solved. No need for RegEx.
The following pattern works well with .Net RegEx implementation:
\<div class="c2"\>{[\n a-z.<>="0-9/]+}\</div\>
And we replace that with \1.
Input:
<div class="c2">
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>
</div>
Output:
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>

Regular expression to parse html links

I have this html with this type of snippit below all over:
<li><label for="summary">Summary:</label></li>
<li class="in">
<textarea class="ta" id="summary" name="summary" rows="4" cols="10" tabindex="4">
${fieldValue(bean: book, field: 'summary')}</textarea>
<a href="#" class="tt">
<img src="<g:createLinkTo dir='images/buttons/' file='icon.gif'/>" alt="Help icon for the summary field">
<span class="tooltip">
<span class="top"></span>
<span class="middle">Help text for summary</span>
<span class="bottom"></span>
</span>
</a>
</li>
I want to pull off the alt value and the text between XXXX and replace the a tag with the code below.
This is my stab at the reg ex
<a href="#" class="tt">.*alt="(.*)".*<span class="middle">(.*)<\/span><\/a>
Output with the callbacks
<ebs:cssToolTip alt="$1" text="$2"/>
I tried it out on http://rubular.com/ and it does not quite work. Any suggestions
You may want to ensure your regexp isn't greedily picking up characters - use ".*?" rather than straight ".*".
What do you mean, "it does not quite work"? How does it fail?
A suggestion (not tested your regexp): note that * is a greedy operator, so .* is rarely a good idea because it may match a lot more than what you intended.
Try:
<a href="#" class="tt">.*alt="([^"]*)".*<span class="middle">([^"]*)<\/span><\/a>
Think i solved it by getting an idea from another stackoverflow question
<a href="#" class="tt">.*alt="([^"]*)".*<span class="middle">([^<]*).*<\/a>
This seems to work on the http://rubular.com/ site
Here you go:
http://rubular.com/regexes/8434
You were facing two potential problems. First, without adding the //m option, '.' will not match newline characters. Second, you were using greedy matching. Adding the '*?' makes it better.
/<a href="#" class="tt">.*?alt="([^"]*)">.*?<span class="middle">(.*?)<\/span>/m