I'm using the following expression in classic asp that successfully grabs any image tag with a .jpg and .png suffix.
re.Pattern = " ]*src=[""'][^ >]*(jpg|png)[""']"
The problem that I've found is many sites that I need to use do not actually use a suffix. So, I need to new regex that finds an image tag and grabs whatever is in the src attribute.
As simple as this sounds, finding an regular expression to accomplish this in Classic ASP seems impossible without writing it myself (which IS impossible).
Please advise.
To match plainly on the img src you can do:
\<img src\=\"(\w+\.(gif|jpg|png)\")
And then if you only want the value that's in the img src, you can do a match for anything in quotes ending in a picture extension (but this may get you false positives depending on what you want):
\w+\.(gif|jpg|png)
But to match just the value while ensuring that it follows img src, you need a negative lookahead to do this (note that I added a matching group there):
(?!.*\<img src\=\")(\w+\.(gif|jpg|png))
Now to include the possibility of having image links in your image source:
(?!.*\<img src\=\")([\/\.\-\:\w]+\.(gif|jpg|png)?[\?\w+\%]+)
And then let's remove the false positives we get by fixing that lazy quantifier after (gif|jpg|png) and moving it to after the next set (which matches data you may get in a JS link, etc.) and making sure we have an end quote:
(?!.*\<img src\=\")([\/\.\-\:\w]+\.(gif|jpg|png)([\?\w+\%]+)?)(?=\")
Note: This will match this data, but regular expressions don't parse HTML, and I personally don't recommend using regular expressions to look through HTML data unless you're doing it on a case-by-case basis. If you're wanting to do some URL/Image scraping via a script, look into an XML/HTML parser.
Sample data:
<img src="picture.gif">
<img src="pic859.jpg">
<img src="859.png">
<img id="test1" class="answer1" src="text.jpg">
<img src="http://media.site.com/media/img/staff/2013/ROTHBARD-350_s90x126.jpg?e3e29f4a7131cd3bc7c4bf334be801215db5e3c2%22%3E">
<img src="yahoo.com/images/imagename.gif">
HTML Source
Related
I'm struggling here, trying to figure out how to replace all double slashes that come after a specific word.
Example:
<img alt="" src="/pt/webf//2015//47384_1.JPG" height="235" width="378" />
<div>Don't remove this // or this//</div>
I want the string above to look like this:
<img alt="" src="/pt/webf/2015/47384_1.JPG" height="235" width="378" />
<div>Don't remove this // or this//</div>
Notice the double slashes have been replaced with just one slash in the img tag but left unscathed in the div tag. I only want to replace the double slashes IF they come after the word: pt.
I tried something like this:
(?=pt)((.*?)\/\/)+
However, the first thing wrong with it is (?=) does not do pattern backtracking, as far as I'm aware. That is, it'll only look for the first matching pattern. The second thing wrong with it is it doesn't work as I intended it to.
https://regex101.com/r/kC4tA5/1
Or maybe I'm going about this the wrong way, since regular expression support is not expansive in VBScript/Classic ASP and I should try to break up the string and process, instead of trying to do everything in one regular expression???
Any help would be appreciated.
Thank you.
I am interpreting your issue as "Removing repeated slashes in all <img src> attributes."
As I said in the comments, working with HTML requires a parser. HTML is too complex for regular expressions, all kinds of things can go wrong.
Luckily, there is a parser available to VBScript: The htmlfile object. It creates a standard DOM from your HTML string. So the solution becomes exactly as described:
Function FixHtml(htmlString)
Dim doc, img, slashes
Set slashes = New RegExp
slashes.Pattern = "/+"
slashes.Global = True
Set doc = CreateObject("htmlfile")
doc.Write htmlString
For Each img In doc.getElementsByTagName("IMG")
img.src = slashes.Replace(img.src, "/")
img.src = Replace(Replace(img.src, "about:blank", ""), "about:", "")
Next
FixHtml = doc.body.innerHTML
End Function
Unfortunately, htmlfile is not the most advanced HTML parser in the world, but rest assured that it will still do way better than any regex.
There are two minor issues:
I found in my tests that for some reason it insists on prepending the img.src with about: or about:blank. This should not happen, but it does. The second line of Replace() calls gets rid of the unwanted additions.
The .innerHTML will produce tags names in upper case, so <img> becomes <IMG> in the output. Also insignificant line breaks in the HTML source might be removed. This is a minor annoyance, I recommend you don't obsess over it.(*)
But there are two big plus sides as well:
The DOM puts you in a position where you can work with the input in a structured way. You can put in any number of complex fixes now that would have been impossible to do with regex.
The return value of .innerHTML is sane HTML. It will fix any gross blunder in the input and turn it into something that is well-nested, well-escaped and otherwise well-behaved.
(*) If you do find yourself obsessing over it, you can use the wisdom from this blog post to create a function that replaces all uppercase tags that come out of .innerHTML with lowercase versions of themselves. This actually is something you can use regex for ("(</?[A-Z]+)", to be exact), because we know that there will be no stray < not belonging to a tag anywhere in the string, because that's .innerHTML's guarantee. While it would be a nice exercise (and it introduces you to the little-known fact that VBScript has function pointers), I would say it's not really worth it.
I need to batch change a folder full of files, changing all image links to lower case and replacing underscores with dashes. Thus, <img src="/images/Maps/South_America.png"> would become <img src="images/maps/south-america.png">
I already performed similar operations on all local links in the same files. I used this regex to change them to lower case:
(?<=(?i)href=")((?:<\?php(?:(?!\?>).)+\?>)?)((?:'[^']+')?)([^"]+)(?=")
\1\2\L\3
And I used this one to replace underscores with dashes:
(href="(?!http)[^_"]+)\_([^"]+")
$1-$2
I'm not even sure if they're the same "language;" I think one only works in Dreamweaver, the other in TextWrangler. Anyway, I haven't figured out how to modify to match images, rather than links. I should emphasize that I only to change the image paths and names, not any classes, ID's or alt tags.
For example, <img src="Buffalo_Bill.jpg" alt="Buffalo Bill" class="People"> would become <img src="buffalo-bill.jpg" alt="Buffalo Bill" class="People">
Also, I think this covers all the bases if defining image extensions is necessary...
(?:jpe?g|gif|png|svg|swf)
The regexes I posted above are just examples. If you have a regex that's totally different, that's fine - just as long as it will work in a common text editor like Dreamweaver or TextWrangler. (I'm on a Mac.)
With an input like this:
<img id="BoringSnowDay" class="FunkySmellsFromGarden" src="/images/Maps/South_America.png" alt="Powerball Winner!" /> <img id="ExcitingSunNight" class="SmoothTasteInKitchen" src="/images/Flags/Antartica.jpg" alt="Racecar racecaR!" />
This regex in TextWrangler:
(<img [^>]+)(src="[^"]+")
Replace:
\1\L\2
Gives me something that ONLY affects the src="..." portion and nothing else.
Unfortunately, combining that to a "...and replace _ to -" tends to get a little tricky.
I've tried some solutions found in web, but it didn't help.
Given:
<p><img alt="" src="images/img2.jpg" style="float:left; height:300px; width:600px" /></p><p>bla-bla-bla</p>
I need to get:
images/img2.jpg.
Using now: preg_match('$<img.*src="(.*)"$', $text, $matches); and it does not give a result.
Use the regex: <img.*src="(.*)".*/>
This will match your image tags and the first capture group will give you your path. Your specific language may require some massaging of the regex.
In general, parsing tags with regex is not a good idea, however (if your tag spans lines it won't hit it, for instance).
im rubbish with regex if someone could help id be very appreciative.
its going to be a bit of a tough one i imagine - so my hats off too anyone that can solve it!
so say we have file that contains 2 html tags in the following formats:
abc1234
Some Text <P>
Some Text
abc1234
im trying to remove everything in those tags except the url (and leaving other text) so the output of the regex in this document would be
abc1234
http://google.com <P>
http://www.google.com
abc1234
Can any guru figure this one out? Id prefer one regex expression to handle both cases but two seperate ones would be fine too.
Thanks in advance/
ScottStevens, it is well known that trying to parse html with regex is difficult, in fact, there is quite a verbose post on this issue. However, if those are the only two formats the <a> ever takes, here is the approach to the problem:
Your first clue on how to approach this problem is that both tags start with <a href=", and you want to take that out, and for that, a simple remove on '<a href="' will do, no regex required.
Your next clue is that sometimes, your end tag sometimes has ">...</a> and sometimes has " rel=...</a> (what goes between rel= and doesn't matter from a regex point of view). Now notice that " rel="...</a> contains within it somewhere a ">...</a>. This means you can remove " rel="...</a> in two steps, remove " rel="... up to the ">, and then remove ">...</a>. Additionally, to make sure you remove between only one tag of <a...>...</a>, add the additional constraint that in the ... of ">...</a>, there cannot be any <a.
That and a regex cheat sheet can help you get started.
That said, you should really use an html parser. Robust and Mature HTML Parser for PHP
I'm a Rubyist, so my example is going to be in Ruby. I'd recommend using two regexes, just to keep things straight:
url_reg = /<a href="(.*?)"/ # Matches first string within <a href=""> tag
tag_reg = /(<a href=.*?a>)/ # Matches entire <a href>...</a> tag
You'll want to pull the URL with the first regex out and store it temporarily, then replace the entire contents of the tag (matched with the tag_reg) with the stored URL.
You might be able to combine it, but it doesn't seem like a good idea. You're fundamentally altering (by deleting) the original tag, and replacing it with something inside itself. Less chance of things going wrong if you separate those two steps as much as possible.
Example in Ruby
def replace_tag(input)
url_reg = /<a href="(.*?)"/ # Match URLS within an <a href> tag
tag_reg = /(<a href=.*?a>)/ # Match an entire <a href></a> tag
while (input =~ tag_reg) # While the input has matching <a href> tags
url = input.scan(url_reg).flatten[0] # Retrieve the first URL match
input = input.sub(tag_reg, url) # Replace first tag contents with URL
end
return input
end
File.open("test.html", "r") do |html_input| # Open original HTML file
File.open("output.html", "w") do |html_output| # Open an output file
while line = html_input.gets # Read each line
output = replace_tag(line) # Perform necessary substitutions
html_output.puts(output) # Write output lines to file
end
end
end
Even if you don't use Ruby, I hope the example makes sense. I tested this on your given input file, and it produces the expected output.
I need to process a HTML content and replace the IMG SRC value with the actual data. For this I have choose Regular Expressions.
In my first attempt I need to find the IMG tags. For this I am using the following expression:
<img.*src.*=\s*".*"
Then within the IMG tag I am looking for SRC="..." and replace it with the new SRC value. I am using the following expression to get the SRC:
src\s*=\s*".*"\s*
The second expression having issues:
For the following text it works:
<img alt="3D""" hspace=
"3D0" src="3D"cid:TDCJXACLPNZD.hills.jpg"" align=
"3dbaseline" border="3d0" />
But for the following it does not:
<img alt="3D""" hspace="3D0" src=
"3D"cid:UHYNUEWHVTSH.lilies.jpg"" align="3dbaseline"
border="3d0" />
What happens is the expression returns
src="3D"cid:TDCJXACLPNZD.hills.jpg"" align=
"3dbaseline"
It does not return only the src part as expected.
I am using C++ Boost regex library.
Please help me to figure out the problem.
Thanks,
Hilmi.
The problem is that .* is a "greedy" match - it will grab as much text as it possibly can while still allowing the regex to match. What you probably want is something like this:
src\s*=\s*"[^"]*"\s*
which will only match non-doublequote characters inside the src string, and thus not go past the ending doublequote.
Your first regex doesn't work on your sample text for me. I usually use this instead, when looking for specific HTML tags:
<img[^>]*>
Also, try this for your second expression:
src\s*=\s*"[^"]*"\s*
Does that help?