I need to process a HTML content and replace the IMG SRC value with the actual data. For this I have choose Regular Expressions.
In my first attempt I need to find the IMG tags. For this I am using the following expression:
<img.*src.*=\s*".*"
Then within the IMG tag I am looking for SRC="..." and replace it with the new SRC value. I am using the following expression to get the SRC:
src\s*=\s*".*"\s*
The second expression having issues:
For the following text it works:
<img alt="3D""" hspace=
"3D0" src="3D"cid:TDCJXACLPNZD.hills.jpg"" align=
"3dbaseline" border="3d0" />
But for the following it does not:
<img alt="3D""" hspace="3D0" src=
"3D"cid:UHYNUEWHVTSH.lilies.jpg"" align="3dbaseline"
border="3d0" />
What happens is the expression returns
src="3D"cid:TDCJXACLPNZD.hills.jpg"" align=
"3dbaseline"
It does not return only the src part as expected.
I am using C++ Boost regex library.
Please help me to figure out the problem.
Thanks,
Hilmi.
The problem is that .* is a "greedy" match - it will grab as much text as it possibly can while still allowing the regex to match. What you probably want is something like this:
src\s*=\s*"[^"]*"\s*
which will only match non-doublequote characters inside the src string, and thus not go past the ending doublequote.
Your first regex doesn't work on your sample text for me. I usually use this instead, when looking for specific HTML tags:
<img[^>]*>
Also, try this for your second expression:
src\s*=\s*"[^"]*"\s*
Does that help?
Related
I'm using a regex to parse some HTML I have the following regex which matches all tags except img and a.
\<(?!img|a)[^\>]+\>
This works well but I also want it to match the closing tags, I've tried the following but it doesn't work:
\</?(?!img|a)[^\>]+\>
What would be the best way to do this?
(Also before there is a plethora of comments saying not to use regexes to parse HTML I'd just like to say that this HTML is generated by a tool and is very uniform.)
EDIT:
<p>So in this</p>
<p>HTML <strong>with nested tags</strong></p>
<p>It should remove <i>everything</i> except This link
and this <img src="#" alt="image" /> but it also needs to kep the textual content</p>
I think that the simplest solution would be the following:
<\/?(?!img|a)[^>]+>
It simply matches:
a <,
a / (escaped with \) if there is any (quantifier ?),
asserts that there is neither img nor a,
a sequence of anything but > ([^>]+) and
a >
See it working here on regex101.
Ok here is a pretty wasteful solution:
<(?!img|a|\/img|\/a)[^>]+>
It would be great if someone could find a better one.
I'm using the following expression in classic asp that successfully grabs any image tag with a .jpg and .png suffix.
re.Pattern = " ]*src=[""'][^ >]*(jpg|png)[""']"
The problem that I've found is many sites that I need to use do not actually use a suffix. So, I need to new regex that finds an image tag and grabs whatever is in the src attribute.
As simple as this sounds, finding an regular expression to accomplish this in Classic ASP seems impossible without writing it myself (which IS impossible).
Please advise.
To match plainly on the img src you can do:
\<img src\=\"(\w+\.(gif|jpg|png)\")
And then if you only want the value that's in the img src, you can do a match for anything in quotes ending in a picture extension (but this may get you false positives depending on what you want):
\w+\.(gif|jpg|png)
But to match just the value while ensuring that it follows img src, you need a negative lookahead to do this (note that I added a matching group there):
(?!.*\<img src\=\")(\w+\.(gif|jpg|png))
Now to include the possibility of having image links in your image source:
(?!.*\<img src\=\")([\/\.\-\:\w]+\.(gif|jpg|png)?[\?\w+\%]+)
And then let's remove the false positives we get by fixing that lazy quantifier after (gif|jpg|png) and moving it to after the next set (which matches data you may get in a JS link, etc.) and making sure we have an end quote:
(?!.*\<img src\=\")([\/\.\-\:\w]+\.(gif|jpg|png)([\?\w+\%]+)?)(?=\")
Note: This will match this data, but regular expressions don't parse HTML, and I personally don't recommend using regular expressions to look through HTML data unless you're doing it on a case-by-case basis. If you're wanting to do some URL/Image scraping via a script, look into an XML/HTML parser.
Sample data:
<img src="picture.gif">
<img src="pic859.jpg">
<img src="859.png">
<img id="test1" class="answer1" src="text.jpg">
<img src="http://media.site.com/media/img/staff/2013/ROTHBARD-350_s90x126.jpg?e3e29f4a7131cd3bc7c4bf334be801215db5e3c2%22%3E">
<img src="yahoo.com/images/imagename.gif">
HTML Source
I've tried some solutions found in web, but it didn't help.
Given:
<p><img alt="" src="images/img2.jpg" style="float:left; height:300px; width:600px" /></p><p>bla-bla-bla</p>
I need to get:
images/img2.jpg.
Using now: preg_match('$<img.*src="(.*)"$', $text, $matches); and it does not give a result.
Use the regex: <img.*src="(.*)".*/>
This will match your image tags and the first capture group will give you your path. Your specific language may require some massaging of the regex.
In general, parsing tags with regex is not a good idea, however (if your tag spans lines it won't hit it, for instance).
I want to grab an img tag from text returned from JSON data like that. I want to grab this from a string:
<img class="img" src="https://fbcdn-photos-c-a.akamaihd.net/hphotos-ak-frc3/1239478_598075296936250_1910331324_s.jpg" alt="" />
What is the regular expression I must use to match it?
I used the following, but it is not working.
"<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>"
You could simply use this expression to match an img tag as in the example :
<img([\w\W]+?)/>
Your regex doesn't match the string, because it's missing the closing /.
Edit - No, the / is not necessary, so your regex should have worked. But you can relax it a bit like below.
Slightly modified:
<img\s[^>]*?src\s*=\s*['\"]([^'\"]*?)['\"][^>]*?>
Please note you shouldn't use regular expressions to parse HTML for the various reasons
<img\s+[^>]*src="([^"]*)"[^>]*>
Or use Jsoup...
String html = "<img class=\"img\" src=\"https://fbcdn-photos-c-a.akamaihd.net/
hphotos-ak-frc3/1239478_598075296936250_1910331324_s.jpg\" alt=\"\" />";
Document doc = Jsoup.parse(html);
Element img = doc.select("img").first();
String src = img.attr("src");
System.out.println(src);
I face the same situation and I tried this and it worked for me.
(<img)[^/>]*(/>|>)
Here is the explanation:
This explanation is from the website https://extendsclass.com/regex-tester.html
I have the following code grabbed from a webpage source code:
<span>41,396</span>
And the following regex:
("<span>.*</span>")
Which returns
<span>New Users</span>
However, I don't want to have the tags in the results. I've tried a few things, but Regular Expressions are new to me.
More so than this I need to get the Regex for the following code:
<span>41,396</span>
</span>
<span class="levelColumn">
<span>2,150</span>
</span>
<span class="xpColumn">
<span>161,305,807</span>
I was thinking this may involve line breaks and more, which is why I threw this is separately.
You could try something like
<span( class=\".+\")?>(.*)</span>
And then get capture group 2 for the tag's body. But be aware that regular expressions are NOT good for parsing HTML/XML. What would happen if you had nested <span> tags?
If the input gets even the slightest bit more complicated than what you've shown, look for an HTML parser and try using that instead.
You can use capturing group differently to get the value instead of tag + value
"<span>(.*)</span>"
Think to use a HTML parsing library in your language of choice if regex become more complicated.
As far as I know regex will lookup line by line, but you could have an expression that would work that out.
Try: <span>(.*)</span>
You should be able to retrieve the information you want with \1
In the case of <span class="xpColumn"> it would just not match and \1 would be empty..
Cheers :)