Remove content from a wordpress.com feed using yahoo pipes - regex

I am using yahoo pipes to get content matching a certian category from my WordPress.com Blog. Everything is working fine but WordPress adds "share" links to the bottom of the feed that I would like to remove.
Here is what's being added:
<a rel="nofollow" target="_blank" href="http://feeds.wordpress.com/1.0/gocomments/bandonrandon.wordpress.com/87/">
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bandonrandon.wordpress.com/87/"/></a>
<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bandonrandon.wordpress.com&blog=1046814&post=87&subd=bandonrandon&ref=&feed=1" width="1" height="1"/>
I edited out some of the services but you get the idea. I tried to use regex to remove this content what I tried was this:
<a rel="nofollow" target="_blank" href="http://feeds.wordpress.com/.*?><img alt="" border="0" src="http://feeds.wordpress.com.*?></a>
and
<img alt="" border="0" src="http://stats.wordpress.com.*?>
however it didn't fileter the results at all.
Using this would filter ALL images and works fine
<a.*?><img.*?></a>

<a[^>]+href="http://feeds.wordpress.com[^"]*"[^>]*>\s*<img[^>]+src="http://feeds.wordpress.com/[^"]*"[^>]*>\s*</a>\s*<img[^>]+src="http://stats.wordpress.com/[^"]*"[^>]*>
Regex updated, try that to match the whole lot.

Related

Regex for modifying html 'img' tag

I have a requirement where I need to modify html 'img' tags in an html string that do not end with a '/>'
ex: <img src=""> needs to be changed to <img src=""/>
I am using following regex: <img(.*[^/])> to replace with <img$1/>
This works fine however for cases like: <center><img src=""/></center> the regex returns: <center><img src=""></center/>
Any suggestions how to impact this regex only upto the end of the img tag? Thanks.
You may use this:
<\s*img\s+([^>]*=(?:\".*?\"|\'.*?\'))[\s\w\-]*>
with following replace by:
<img $1/>
this will match these simple and complex cases:
<img src="images/a.jpg" title="test"><br/>
<img src="a/b.jpg" >
<span><img src="a.jpg"></span>
<img src="" title="">
<img src="" data-val>
<img src="a.jpg" title="a'>b">
<img src="a.jpg" title='a">b'>
<img src="a.jpg" title='a>=b"=>' >
but not following:
<img src="a.jpg" />
<imgXTag src="b.jpg" >
<img src="a.jpg" / >
Sample Demo

Need help to write a regular expression statement (Newbie alert!)

I use photobucket to host my imagery for my ebay ads when I sell things, so I copy the html out of photobucket into notepad, and I'm always left the <img> tag being wrapped in photobucket's <a> tag, and I have to go through each line and manually delete each <a></a>, which on 26 lines across multiple items can soon equate too hundreds of "highlight and delete" actions.
I already do a search for the closing tag </a> and just do a "replace" with nothing, thus removing it, but the string I cannot fathom to remove, due to the image file name being different on every line is as the following example demonstrates:
So it's essentially the section of the anchor tag up to and including the > I need to be able to remove on a mass scale - Any help would be greatly appreciated!
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC02424c_zpslt9m0cuu.jpg" border="0" alt=" photo DSC05653_zpslt9m0cuu.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC04444_zpspkgjw6vf.jpg" border="0" alt=" photo DSC05654_zpspkgjw6vf.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC05655_zpsxuev7czs.jpg" border="0" alt=" photo DSC05655_zpsxuev7czs.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC06624_zpsifjidypy.jpg" border="0" alt=" photo DSC05656_zpsifjidypy.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC07777_zpsacyjrnnr.jpg" border="0" alt=" photo DSC05663_zpsacyjrnnr.jpg"/>
<a href="[^"]+?" target="_blank">
would do what you want, or even more general:
<a href=[^>]+?>

How to skip a particular tag and crawl other tag's text in Beautifulsoup

I am crawling a webpage and i am using Beautifulsoup. There is a condition where i want to skip the content of one particular tag and get other tag contents. In the below code i don't want div tag contents. But i couldn't solve this. Please help me.
HTML code,
<blockquote class="messagetext">
<div style="margin: 5px; float: right;">
unwanted text .....
</div>
Text..............
<a class="externalLink" rel="nofollow" target="_blank" href="#">text </a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
,text
</blockquote>
I have tried like this,
content = soup.find('blockquote',attrs={'class':'messagetext'}).text
But it is fetching unwanted text inside div tag also.
Use the clear function like this:
soup = BeautifulSoup(html_doc)
content = soup.find('blockquote',attrs={'class':'messagetext'})
for tag in content.findChildren():
if tag.name == 'div':
tag.clear()
print content.text
This yields:
Text..............
text
text
text
,text

Adding class value to sc:image doesnt show up

I am adding a class value to sc:image but when it renders it doesnt show up correct. Here how it looks like in HTML without render:
<a href="/">
<sc:Image ID="Logo" runat="server" Field="Header Logo" class="logo" />
</a>
But when it renders to the webpage it shows up like this:
<a href="/">
<img src="/~/media/logo.png" alt="" width="196" height="34">
</a>
However, I want to accomplish something like this:
<a href="/">
<img src="/~/media/logo.png" alt="" width="196" height="34" class="logo">
</a>
How should I approach this problem?
One way to apply a class to the image would be to place the CSS class at a higher block level that is not a web control, perhaps on a wrapping DIV. This might allow you to leverage styling across the whole block and not just the image itself.
To apply the class directly to the IMG tag, you should use the CssClass property of the Image control so that it will render out as a "class" tag:
<a href="/">
<sc:Image ID="Logo" runat="server" Field="Header Logo" CssClass="logo" />
</a>

replace all image url in my post

I have deleted all tags in my blog with this regex expression:
<a\s[^>]*> $1 and now I need to change all my image URLs from:
<img src="http://files.tampo.ua/files/news/part_38/388705/1.jpg" width="500" height="291" border="0" class="c24" />
to:
<img src="https://dl.dropbox.com/u/85819604/1.jpg" width="500" height="291" border="0" class="c24" />
So I need to replace the main path to the image server.
http://clip2net.com/s/22PwP
Try replacing http://files.tampo.ua/files/news/[^/]*/[^/]*/ with https://dl.dropbox.com/u/85819604/