REGEX: Remove everything except ALL images

REGEX: Remove everything except ALL images - regex

I would like to know how I can use a regex for removing everything EXCEPT all image tags.
I already tried these:
(?s)^[^<](.) -> removes all the text before the img tag
(?s)^([^>]+>).* -> removes all the text after the img tag
Does anybody know how to combine these 2 for multiple images?
Here is an example of the content I want to apply it to:
Text text text. <img alt="alt text" src="path/to-image.png" />Text text text. <img alt="alt text" src="path/to-image.png" />Text text text. <img alt="alt text" src="path/to-image.png" />Text text text. <img alt="alt text" src="path/to-image.png" />
My desired result should be:
<img alt="alt text" src="path/to-image.png" /><img alt="alt text" src="path/to-image.png" /><img alt="alt text" src="path/to-image.png" /><img alt="alt text" src="path/to-image.png" />

Your example expressions do not work for me. However, turn "removing everything except all image tags" into "extract only image tags" and you can easily get what you want:
<img [^>]*\/> <!-- EDIT: XHTML only -->
<img [^>]*\/?> <!-- covers HTML and XHTML -->
Try: http://www.regexr.com/3eq09

Related

How can I fix this regex in order to get html tag only from a particular url?

Hello I have a html file with several img tags:
<img src="https://www.pokeyplay.com/imagenes/backend/publicidad.gif" alt="Publicidad" align="left" />
<img src="https://www.pokeyplay.com/imagenes/backend/spacer.gif" alt="sp" />
<img src="imagenes/backend/etiqueta-pyp-pokedex.gif" alt="P&P PokéDex" width="184" height="100" />
<img src="imagenes/backend/spacer.gif" alt="sp" />
<img src="http://urpgstatic.com/img_library/pokemon_sprites/187.png" style="vertical-align:middle" />
In order to stract all img tags I am using the following regexp:
'<img[^>]* src=\"([^\"]*)\"[^>]*>'
But I want to extract only all IMG tags from urpgstatic.com
How can do this?
I did several tries like this:
<img.*?src="(http[s]?:\/\/)urpgstatic.com?([^\/\s]+\/)(.*)[png]$"[^\>]+>
Thanks

Try this
<img[^>]*(?=\"https?:\/\/(www\.)?urpgstatic\.com)\"([^\"]*)\"[^>]*>
Demo
Also, this will work with grep
grep -iP '<img[^>]*(?=\"https?:\/\/(www\.)?urpgstatic\.com)\"([^\"]*)\"[^>]*>' index.html

You may use this grep command:
grep -ioE '<img [^>]*src="https?://(www\.)?urpgstatic\.com/[^>]*>' file.html
<img src="http://urpgstatic.com/img_library/pokemon_sprites/187.png" style="vertical-align:middle" />
Though please remember that parsing HTML using regex may be error prone and using a HTML parser such as DOM in php is more reliable.
RegEx Details:
<img [^>]*src=: Match <img <anything-except->src= text
"https?://: Match http://orhttps://`
(www\.)?urpgstatic\.com/: Match optional www. followed by urpgstatic.com/

Regex for modifying html 'img' tag

I have a requirement where I need to modify html 'img' tags in an html string that do not end with a '/>'
ex: <img src=""> needs to be changed to <img src=""/>
I am using following regex: <img(.*[^/])> to replace with <img$1/>
This works fine however for cases like: <center><img src=""/></center> the regex returns: <center><img src=""></center/>
Any suggestions how to impact this regex only upto the end of the img tag? Thanks.

You may use this:
<\s*img\s+([^>]*=(?:\".*?\"|\'.*?\'))[\s\w\-]*>
with following replace by:
<img $1/>
this will match these simple and complex cases:
<img src="images/a.jpg" title="test"><br/>
<img src="a/b.jpg" >
<span><img src="a.jpg"></span>
<img src="" title="">
<img src="" data-val>
<img src="a.jpg" title="a'>b">
<img src="a.jpg" title='a">b'>
<img src="a.jpg" title='a>=b"=>' >
but not following:
<img src="a.jpg" />
<imgXTag src="b.jpg" >
<img src="a.jpg" / >
Sample Demo

Need help to write a regular expression statement (Newbie alert!)

I use photobucket to host my imagery for my ebay ads when I sell things, so I copy the html out of photobucket into notepad, and I'm always left the <img> tag being wrapped in photobucket's <a> tag, and I have to go through each line and manually delete each <a></a>, which on 26 lines across multiple items can soon equate too hundreds of "highlight and delete" actions.
I already do a search for the closing tag </a> and just do a "replace" with nothing, thus removing it, but the string I cannot fathom to remove, due to the image file name being different on every line is as the following example demonstrates:
So it's essentially the section of the anchor tag up to and including the > I need to be able to remove on a mass scale - Any help would be greatly appreciated!
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC02424c_zpslt9m0cuu.jpg" border="0" alt=" photo DSC05653_zpslt9m0cuu.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC04444_zpspkgjw6vf.jpg" border="0" alt=" photo DSC05654_zpspkgjw6vf.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC05655_zpsxuev7czs.jpg" border="0" alt=" photo DSC05655_zpsxuev7czs.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC06624_zpsifjidypy.jpg" border="0" alt=" photo DSC05656_zpsifjidypy.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC07777_zpsacyjrnnr.jpg" border="0" alt=" photo DSC05663_zpsacyjrnnr.jpg"/>

<a href="[^"]+?" target="_blank">
would do what you want, or even more general:
<a href=[^>]+?>

PowerShell regular expression to get all the HTML tags

I have a string with HTML tags. I have to write PowerShell script to split this string using regular expression for HTML tags both opening and closing. I have tried many times but with no luck.
<([A-Z][A-Z0-9])[^>]>
I have tried this for opening tags. But it only removes the '<' and '>' from string not the whole tag.
My string is something like this:
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
My desired output is: some text inside. This is text inside font. this is h1 text. This is a new paragraph.

Not sure how you're doing your split, but it shouldn't be that difficult:
$Text =
#'
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
'#
$text -split '<.+?>' -match '\S'
some text inside.
this is text inside font.
this is h1 text.
This is a new paragraph.

How to skip a particular tag and crawl other tag's text in Beautifulsoup

I am crawling a webpage and i am using Beautifulsoup. There is a condition where i want to skip the content of one particular tag and get other tag contents. In the below code i don't want div tag contents. But i couldn't solve this. Please help me.
HTML code,
<blockquote class="messagetext">
<div style="margin: 5px; float: right;">
unwanted text .....
</div>
Text..............
<a class="externalLink" rel="nofollow" target="_blank" href="#">text </a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
,text
</blockquote>
I have tried like this,
content = soup.find('blockquote',attrs={'class':'messagetext'}).text
But it is fetching unwanted text inside div tag also.

Use the clear function like this:
soup = BeautifulSoup(html_doc)
content = soup.find('blockquote',attrs={'class':'messagetext'})
for tag in content.findChildren():
if tag.name == 'div':
tag.clear()
print content.text
This yields:
Text..............
text
text
text
,text

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

REGEX: Remove everything except ALL images - regex

Your example expressions do not work for me. However, turn "removing everything except all image tags" into "extract only image tags" and you can easily get what you want: <img [^>]\/>  <img [^>]\/?>  Try: http://www.regexr.com/3eq09

Related

How can I fix this regex in order to get html tag only from a particular url?

Regex for modifying html 'img' tag

Need help to write a regular expression statement (Newbie alert!)

PowerShell regular expression to get all the HTML tags

How to skip a particular tag and crawl other tag's text in Beautifulsoup

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

REGEX: Remove everything except ALL images - regex

Your example expressions do not work for me. However, turn "removing everything except all image tags" into "extract only image tags" and you can easily get what you want: <img [^>]*\/>  <img [^>]*\/?>  Try: http://www.regexr.com/3eq09

Related

How can I fix this regex in order to get html tag only from a particular url?

Regex for modifying html 'img' tag

Need help to write a regular expression statement (Newbie alert!)

PowerShell regular expression to get all the HTML tags

How to skip a particular tag and crawl other tag's text in Beautifulsoup

Categories

Resources

Your example expressions do not work for me. However, turn "removing everything except all image tags" into "extract only image tags" and you can easily get what you want: <img [^>]\/>  <img [^>]\/?>  Try: http://www.regexr.com/3eq09