Need help to write a regular expression statement (Newbie alert!) - regex

I use photobucket to host my imagery for my ebay ads when I sell things, so I copy the html out of photobucket into notepad, and I'm always left the <img> tag being wrapped in photobucket's <a> tag, and I have to go through each line and manually delete each <a></a>, which on 26 lines across multiple items can soon equate too hundreds of "highlight and delete" actions.
I already do a search for the closing tag </a> and just do a "replace" with nothing, thus removing it, but the string I cannot fathom to remove, due to the image file name being different on every line is as the following example demonstrates:
So it's essentially the section of the anchor tag up to and including the > I need to be able to remove on a mass scale - Any help would be greatly appreciated!
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC02424c_zpslt9m0cuu.jpg" border="0" alt=" photo DSC05653_zpslt9m0cuu.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC04444_zpspkgjw6vf.jpg" border="0" alt=" photo DSC05654_zpspkgjw6vf.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC05655_zpsxuev7czs.jpg" border="0" alt=" photo DSC05655_zpsxuev7czs.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC06624_zpsifjidypy.jpg" border="0" alt=" photo DSC05656_zpsifjidypy.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC07777_zpsacyjrnnr.jpg" border="0" alt=" photo DSC05663_zpsacyjrnnr.jpg"/>

<a href="[^"]+?" target="_blank">
would do what you want, or even more general:
<a href=[^>]+?>

Related

Regex to add style attribute to a specific <table> tag

** >> Please see Update near the bottom**
I am having to deal with a large amount of imported HTML code that is poorly formatted.
I have around 200 similar (but not identical) instances of the code, and each instance includes a specific set of <img> tags. In some instances, the <img> tags run from one to the next, with no line breaks in between. In other instances there are line breaks in the code, and these result in <br> tags being inserted into the final code sent to the browser.
This will make more sense once I illustrate what I mean:
Example #1: There are no breaks between the <img> tags...
<table align="center" border="0px"> <tbody><tr> <td> <img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/CustomerSatisfaction.png" alt="100% Customer Satisfaction" height="60" align="middle" width="140"> <img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/PaypalVerified.png" alt="Paypal Verified" height="60" align="middle" width="140"> <img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/FastDelivery.png" alt="Fast Delivery" height="60" align="middle" width="140"> <img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/Recycled.png" alt="100% Recyled Pre-owned Products" height="60" align="middle" width="140"> <img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/TopSellerRated.png" alt="Top Seller Rated" height="60" align="middle" width="140"> <img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/PhoneSupport.png" alt="Phone Support" height="60" align="middle" width="140"> </td> </tr> </tbody></table>
Example #2: There are breaks between the <img> tags...
<table align="center" border="0px">
<tbody><tr>
<td>
<img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/CustomerSatisfaction.png" alt="100% Customer Satisfaction" align="middle" height="60" width="140">
<img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/PaypalVerified.png" alt="Paypal Verified" align="middle" height="60" width="140">
<img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/FastDelivery.png" alt="Fast Delivery" align="middle" height="60" width="140">
<img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/Recycled.png" alt="100% Recyled Pre-owned Products" align="middle" height="60" width="140">
<img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/TopSellerRated.png" alt="Top Seller Rated" align="middle" height="60" width="140">
<img src="http://simplicitywebsitedesign.com/iOutlet/images/buttons/PhoneSupport.png" alt="Phone Support" align="middle" height="60" width="140">
</td>
</tr>
</tbody></table>
As mentioned, for reasons unknown to me, the Wordpress site on which this code is utilised throws in <br> tags when code example #2 is parsed through to the browser.
That results in the images displaying as follows (on Firefox):
Code sample #1 displays link this:
I am thinking the best way to resolve this is do to a search/replace via MySQL on the DB, using a regular expression that will identify instances of code example #2 and make it like code example #1. In other words, the line breaks will be removed from between the relevant <img> tags.
Two questions:
1) Is that in fact the best way to go about this, or is there a potentially better way?
2) If that is a valid and suitable way to do it, would you suggest a suitable regular expression.
(With question 2, I am not sure what to suggest as the correct regex engine. This regex will be parsed within MySQL, using the Mac app Sequel Pro.app (http://www.sequelpro.com/).
My take on the possible Regex logic
My guess is that we need to:
1) Find instance of <table...> ... </table>
2) Find instances of </img> (soft line break) <img ...> within code identified by #1 above
3) Remove (soft line break)
There is one other <table> ... </table> set within the code that will be searched. There is only one <img> within that instance. There are exactly 6 <img> instances within the <table> ... </table>
Update, taking comments into account
It has been suggested that I use the flex CSS display attribute, and apply it to the table row. I've done that, and it works well. I am a little concerned about compatibility on older browsers, as I gather it's a relatively recent CSS addition.
I do, however, still need to do a search/replace to locate the correct <table> in the HTML.
In most of the HTML instances, there are two instances of <table> ... </table>. So I suspect the regex would need to do a negative forward check for something like /stars/ which exists in a URL that's in the <table> instance I don't want modified. Then it would be a matter of replacing <table> with <table id="green-icons">
Thanks.
Jonathan
P.S. I am aware there is a LOT of contention around whether or not regex is a valid way to make changes to HTML. As this is a relatively fixed and known set of HTML, I suspect it'll be okay. But I am also open to other suggestions.

How to skip a particular tag and crawl other tag's text in Beautifulsoup

I am crawling a webpage and i am using Beautifulsoup. There is a condition where i want to skip the content of one particular tag and get other tag contents. In the below code i don't want div tag contents. But i couldn't solve this. Please help me.
HTML code,
<blockquote class="messagetext">
<div style="margin: 5px; float: right;">
unwanted text .....
</div>
Text..............
<a class="externalLink" rel="nofollow" target="_blank" href="#">text </a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
,text
</blockquote>
I have tried like this,
content = soup.find('blockquote',attrs={'class':'messagetext'}).text
But it is fetching unwanted text inside div tag also.
Use the clear function like this:
soup = BeautifulSoup(html_doc)
content = soup.find('blockquote',attrs={'class':'messagetext'})
for tag in content.findChildren():
if tag.name == 'div':
tag.clear()
print content.text
This yields:
Text..............
text
text
text
,text

JAWS screen reader adding tabIndex of -1 to anchors with images

I have three anchor tags, one with text inside and two with images with valid alt text. The anchor tag with text inside works fine with JAWS and is read properly. However, for some reason, with the anchors with the image inside, a tabIndex of -1 is being applied, which means they are being skipped over.
This is being tested in IE 9. Is there any reason why this should be occurring? Is there a way to prevent it?
I had a similar issue with JAWS setting the tabindex of links to -1. This was with IE9 with JAWS 14.0
The problem ended up being caused by a setting in JAWS under "Web / HTML / PDFs" -> "Links" called "Filter Consecutive Duplicate Links". JAWS describes the feature as follows:
This option controls whether consecutive links that point to the same location, one graphical and one text, are filtered. When selected, only the text link is announced. This check box is selected by default.
For example, let's say you have a icon / text link pair that both do the same thing:
<a href="javascript:void(0)" onclick="test();">
<img src="untitled.png" title="Test" alt="Test">
</a>
TEST
With the setting checked JAWS will remove the image from the tab order leaving only the text link like this:
<a tabindex="-1" href="javascript:void(0)" onclick="test();">
<img src="untitled.png" title="Test" alt="Test">
</a>
TEST
From my experience and some basic tests I believe this only applies when an image link is followed by a duplicate text link and not vice versa. Also it applies to any duplicate image link following the image / text pair.
The problem I ran into was that JAWS only seemed to compare the href attribute and did not take into account other attributes such as onclick or onkeydown. Pair this up with the duplicate removal applying to any image links following the initial image / text link pair and you can end up with a case where the an image link following a image/ text link pair gets when it should not. Example:
<a href="javascript:void(0)" onclick="test();">
<img src="untitled.png" title="Test" alt="Test">
</a>
TEST
<a href='javascript:void(0)' onclick="dontTest();">
<img src="untitled2.png" title="Test" alt="Test">
</a>
Result:
<a tabindex="-1" href="javascript:void(0)" onclick="test();">
<img src="untitled.png" title="Test" alt="Test">
</a>
TEST
<a tabindex="-1" href='javascript:void(0)' onclick="dontTest();">
<img src="untitled2.png" title="Test" alt="Test">
</a>
Note: the fact that the href is set to javascript:void(0) is purely coincidental. This behavior should be reproducible using any value for the href as long as the value is the same for all the links.
Hope this helps someone.
JAWS automatically add tabindex="-1" to anchor tags which have href="javascript:void(0)". I used href="#" to solve the same problem as yours.

Finding quote marks within alt tags

I'm wondering how I would approach the following problem with regular expressions. We run into the occasional problem of quote marks (") inside Alt tags which can cause rendering issues. Would it be possible to write a regular expression to find Img tags, but only when the ALT contains quotes?
For example these would be found
<img src="theImage.gif" width="81" height="24" border="0" style="display:block;" alt="Check "it" out">
<img src="theImage.gif" width="81" height="24" alt="Check "it" out" style="display:block;">
But not these
<img src="theImage.gif" width="81" height="24" border="0" style="display:block;" alt="Check 'it' out">
<img src="theImage.gif" width="81" height="24" border="0" style="display:block;" alt="">
<img src="theImage.gif" width="81" height="24" border="0" style="display:block;">
Thanks in advance!
This problem is intractable because you might end up with something like:
<img src="theImage.gif" width="81" height="24" alt="foo" border="bar">
Would you interpret that as an alt value of foo and a border of bar, or as an alt value of foo" border="bar?
This is why you must properly escape your data before rendering it into HTML. You can't unstir a cup of tea.
The problem is likely that the attribute value needs to be HTML encoded when rendered.

Remove content from a wordpress.com feed using yahoo pipes

I am using yahoo pipes to get content matching a certian category from my WordPress.com Blog. Everything is working fine but WordPress adds "share" links to the bottom of the feed that I would like to remove.
Here is what's being added:
<a rel="nofollow" target="_blank" href="http://feeds.wordpress.com/1.0/gocomments/bandonrandon.wordpress.com/87/">
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bandonrandon.wordpress.com/87/"/></a>
<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bandonrandon.wordpress.com&blog=1046814&post=87&subd=bandonrandon&ref=&feed=1" width="1" height="1"/>
I edited out some of the services but you get the idea. I tried to use regex to remove this content what I tried was this:
<a rel="nofollow" target="_blank" href="http://feeds.wordpress.com/.*?><img alt="" border="0" src="http://feeds.wordpress.com.*?></a>
and
<img alt="" border="0" src="http://stats.wordpress.com.*?>
however it didn't fileter the results at all.
Using this would filter ALL images and works fine
<a.*?><img.*?></a>
<a[^>]+href="http://feeds.wordpress.com[^"]*"[^>]*>\s*<img[^>]+src="http://feeds.wordpress.com/[^"]*"[^>]*>\s*</a>\s*<img[^>]+src="http://stats.wordpress.com/[^"]*"[^>]*>
Regex updated, try that to match the whole lot.