I would like to know how I can use a regex for removing everything EXCEPT all image tags.
I already tried these:
(?s)^[^<](.) -> removes all the text before the img tag
(?s)^([^>]+>).* -> removes all the text after the img tag
Does anybody know how to combine these 2 for multiple images?
Here is an example of the content I want to apply it to:
Text text text. <img alt="alt text" src="path/to-image.png" />Text text text. <img alt="alt text" src="path/to-image.png" />Text text text. <img alt="alt text" src="path/to-image.png" />Text text text. <img alt="alt text" src="path/to-image.png" />
My desired result should be:
<img alt="alt text" src="path/to-image.png" /><img alt="alt text" src="path/to-image.png" /><img alt="alt text" src="path/to-image.png" /><img alt="alt text" src="path/to-image.png" />
Your example expressions do not work for me. However, turn "removing everything except all image tags" into "extract only image tags" and you can easily get what you want:
<img [^>]*\/> <!-- EDIT: XHTML only -->
<img [^>]*\/?> <!-- covers HTML and XHTML -->
Try: http://www.regexr.com/3eq09
Related
Hello I have a html file with several img tags:
<img src="https://www.pokeyplay.com/imagenes/backend/publicidad.gif" alt="Publicidad" align="left" />
<img src="https://www.pokeyplay.com/imagenes/backend/spacer.gif" alt="sp" />
<img src="imagenes/backend/etiqueta-pyp-pokedex.gif" alt="P&P PokéDex" width="184" height="100" />
<img src="imagenes/backend/spacer.gif" alt="sp" />
<img src="http://urpgstatic.com/img_library/pokemon_sprites/187.png" style="vertical-align:middle" />
In order to stract all img tags I am using the following regexp:
'<img[^>]* src=\"([^\"]*)\"[^>]*>'
But I want to extract only all IMG tags from urpgstatic.com
How can do this?
I did several tries like this:
<img.*?src="(http[s]?:\/\/)urpgstatic.com?([^\/\s]+\/)(.*)[png]$"[^\>]+>
Thanks
Try this
<img[^>]*(?=\"https?:\/\/(www\.)?urpgstatic\.com)\"([^\"]*)\"[^>]*>
Demo
Also, this will work with grep
grep -iP '<img[^>]*(?=\"https?:\/\/(www\.)?urpgstatic\.com)\"([^\"]*)\"[^>]*>' index.html
You may use this grep command:
grep -ioE '<img [^>]*src="https?://(www\.)?urpgstatic\.com/[^>]*>' file.html
<img src="http://urpgstatic.com/img_library/pokemon_sprites/187.png" style="vertical-align:middle" />
Though please remember that parsing HTML using regex may be error prone and using a HTML parser such as DOM in php is more reliable.
RegEx Details:
<img [^>]*src=: Match <img <anything-except->src= text
"https?://: Match http://orhttps://`
(www\.)?urpgstatic\.com/: Match optional www. followed by urpgstatic.com/
I have a requirement where I need to modify html 'img' tags in an html string that do not end with a '/>'
ex: <img src=""> needs to be changed to <img src=""/>
I am using following regex: <img(.*[^/])> to replace with <img$1/>
This works fine however for cases like: <center><img src=""/></center> the regex returns: <center><img src=""></center/>
Any suggestions how to impact this regex only upto the end of the img tag? Thanks.
You may use this:
<\s*img\s+([^>]*=(?:\".*?\"|\'.*?\'))[\s\w\-]*>
with following replace by:
<img $1/>
this will match these simple and complex cases:
<img src="images/a.jpg" title="test"><br/>
<img src="a/b.jpg" >
<span><img src="a.jpg"></span>
<img src="" title="">
<img src="" data-val>
<img src="a.jpg" title="a'>b">
<img src="a.jpg" title='a">b'>
<img src="a.jpg" title='a>=b"=>' >
but not following:
<img src="a.jpg" />
<imgXTag src="b.jpg" >
<img src="a.jpg" / >
Sample Demo
I use photobucket to host my imagery for my ebay ads when I sell things, so I copy the html out of photobucket into notepad, and I'm always left the <img> tag being wrapped in photobucket's <a> tag, and I have to go through each line and manually delete each <a></a>, which on 26 lines across multiple items can soon equate too hundreds of "highlight and delete" actions.
I already do a search for the closing tag </a> and just do a "replace" with nothing, thus removing it, but the string I cannot fathom to remove, due to the image file name being different on every line is as the following example demonstrates:
So it's essentially the section of the anchor tag up to and including the > I need to be able to remove on a mass scale - Any help would be greatly appreciated!
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC02424c_zpslt9m0cuu.jpg" border="0" alt=" photo DSC05653_zpslt9m0cuu.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC04444_zpspkgjw6vf.jpg" border="0" alt=" photo DSC05654_zpspkgjw6vf.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC05655_zpsxuev7czs.jpg" border="0" alt=" photo DSC05655_zpsxuev7czs.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC06624_zpsifjidypy.jpg" border="0" alt=" photo DSC05656_zpsifjidypy.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC07777_zpsacyjrnnr.jpg" border="0" alt=" photo DSC05663_zpsacyjrnnr.jpg"/>
<a href="[^"]+?" target="_blank">
would do what you want, or even more general:
<a href=[^>]+?>
I have a string with HTML tags. I have to write PowerShell script to split this string using regular expression for HTML tags both opening and closing. I have tried many times but with no luck.
<([A-Z][A-Z0-9])[^>]>
I have tried this for opening tags. But it only removes the '<' and '>' from string not the whole tag.
My string is something like this:
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
My desired output is: some text inside. This is text inside font. this is h1 text. This is a new paragraph.
Not sure how you're doing your split, but it shouldn't be that difficult:
$Text =
#'
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
'#
$text -split '<.+?>' -match '\S'
some text inside.
this is text inside font.
this is h1 text.
This is a new paragraph.
I am crawling a webpage and i am using Beautifulsoup. There is a condition where i want to skip the content of one particular tag and get other tag contents. In the below code i don't want div tag contents. But i couldn't solve this. Please help me.
HTML code,
<blockquote class="messagetext">
<div style="margin: 5px; float: right;">
unwanted text .....
</div>
Text..............
<a class="externalLink" rel="nofollow" target="_blank" href="#">text </a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
,text
</blockquote>
I have tried like this,
content = soup.find('blockquote',attrs={'class':'messagetext'}).text
But it is fetching unwanted text inside div tag also.
Use the clear function like this:
soup = BeautifulSoup(html_doc)
content = soup.find('blockquote',attrs={'class':'messagetext'})
for tag in content.findChildren():
if tag.name == 'div':
tag.clear()
print content.text
This yields:
Text..............
text
text
text
,text