Regex to Match HTML Style Properties - regex

In need of a regex master here!
<img src="\img.gif" style="float:left; border:0" />
<img src="\img.gif" style="border:0; float:right" />
Given the above HTML, I need a regex pattern that will match "float:right" or "float:left" but only on an img tag.
Thanks in advance!

/<img\s[^>]*style\s*=\s*"[^"]*\bfloat\s*:\s*(left|right)[^"]*"/i
Have to advise you, though: in my experience, no matter what regex you write, someone will be able to come up with valid HTML that breaks it. If you really want to do this in a general, reliable way, you need to parse the HTML, not throw regexes at it.

You really shouldn't use regex to parse html or xml, it's impossible to design a foolproof regex that will handle all corner cases. Instead, I would suggest finding an html-parsing library for your language of choice.
That said, here's a possible solution using regex.
<img\s[^>]*?style\s*=\s*".*?(?<"|;)(float:.*?)(?=;|").*?"
The "float:" will be captured in the only capturing group there, which should be number 1.
The regex basically matches the start of an img tag, followed by any type of character that isn't a close bracket any number of times, followed by the style attribute. Within the style attribute's value, the float: can be anywhere within the attribute, but it should only match the actual float style (i.e. it's preceded by the start of the attribute or a semicolon and followed by a semicolon or the end of the attribute).

I agree with Sean Nyman, it's best not to use a regex (at least not for anything permanent). For something ad-hoc and a bit more durable, you might try:
/<img\s(?:\s*\w+\s*=\s*(?:'[^']*'|"[^"]*"))*?\s*\bstyle\s*=\s*(?:"[^"]*?\bfloat\s*:\s*(\w+)|'[^']*?float\s*:\s*(\w+)/i

Related

How do I select src between <> if img exists?

I need to select src=" using a regular expression in the form: //, but only if it is within an image tag.
This should return true:
<img alt="Alt text" src="/directory/Images/my-image.jpg" />
This to return false:
<script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script>
The end result will be replacing the scr=", which the application I am using performs, I need the regex for the find.
First, the standard disclaimer: if you are using regexes to parse a HTML DOM, you are DOING IT WRONG. With all structured data (XML, JSON, and so forth), the right way to parse HTML is to use something built for that purpose, and query it using its querying system.
That said, it is often the case that what you want is a quick hack on the commandline or the search field of an editor or whatever, and you don't want or need to faff with writing an application that loads in DOM-parsing libraries.
In that case, if you're not actually writing a program, and you don't mind that there are edge-cases where any regex you try will break, then consider something like this:
/<img\b[^<>]+\bsrc\s*=\s*"([^"]+)"/i ... maybe replacing the leading / and trailing /i with whatever other thing your language uses to denote a case-insensitive regular expression.
Note that this makes assumptions, that the url is quoted with doublequotes, the tag is correctly formed, there are no extraneous <img strings in the document, there are no doublequotes in the URL, and countless others that I didn't think of, but a proper parser would. These assumptions are a large part of why using a parser is so important: it makes no such assumptions, and if fed garbage, will correctly let you know that you did so, rather than trying to digest it and giving you pain later on.
<img\b - an img tag. The word boundary ensures this isn't an imgur tag or whatever.
[^<>]+ - one or more characters, with no closing tag, and for safety, no opening tags either.
\bsrc\s*=\s* - 'src=', but with optional whitespace, and another word-boundary check.
"([^"]+)" - some URL consisting of non-quote characters, within quotes.
Now, be aware that since we're doing NO security checking on the URL, you could be grabbing anything, such as javascript:...something malicious..., or it could be 6GB long - you just don't know. You could add in checking for such things, but you'll always miss something, unless you control the input and know exactly what you're parsing.
Your mention of "my application" does mean that I must reiterate: the above is almost certainly the wrong way to do it if you are writing an application, and the question you should be asking is probably closer to "how do I get the value of the src attribute of an img tag from a HTML page, in my chosen programming language?" rather than "how do I use regexes to extract this token from this HTML tag?"
When I say this, I don't mean "ivory-tower computer scientists will look down their nose at you" - though I admit there can be a lot of that kind of snootiness in programming :D
I mean something more like... "you're setting yourself up for pain as you run into edge-case after edge-case, and spiral down into a deep rabbit-hole of infinitely refining your regex. And you can likely avoid the pain with a simple one-liner, infinitely nicer than regex, perhaps document.querySelector('img[src^="/directory/Images"]') as #LGSon suggests in a comment.
People will say this because they've had this pain, and they're wincing at the idea that you might suffer it too.
There are several ways to match that. This RegEx is just an example and it is not certainly the best expression:
(src=")(.+)(.jpg|.JPG|.PNG|.png|.JPEG)"
You can wrap your target image URLs with a capturing group (), maybe similar to this expression:
(src=")((.+)(.jpg|.JPG|.PNG|.png|.JPEG))"
and simply call it using $2 (group #2).
You can also simplify it as you wish by adding ignore flag such as this expression:
src="((.+)(\.[a-rt-z]+))"

Removing empty bbcode tags using regex

Using regex I'm trying to remove empty bbcode tags. By empty I mean nothing in between them:
[tag][/tag]
If there is something between them then it should be kept.
I've searched a lot and played around with a regex tester but haven't come up with anything that works right.
Edit: I realize now why I was having a hard time with this. In addition to the example above, I also have one's like:
[url=http://www.somedomain.com/][/url]
I'm trying to cleanup bbcode when a form is submitted so it's not stored since it's unneeded.
In Javascript, you could do :
str.replace(/\[([^\[\]]*)\]\[\/\1\]/g, '');
The operative aspect of regex in this case is the use of internal backrefs; I'm not sure, off the top of my head, whether this is universally supported, but .NET, in any case, seems to use PCRE (is this true?).
The pattern, then, is [, a word, ][/, the same word, ]. If we assume the word has simply the quality of "does not contain ]", then an appropriate regex to match an empty tag is \[([^\]]+)\]\[/\1\], escaped as necessary in context.
For the second case, if assume the form [tag=arg][/tag], and that tag and arg each don't contain any ']' (not a reasonable assumption! but dealing with it is left as an exercise for the reader -- and I'm quite sure most bbcode implementations don't actually deal with that problem, either), one could use a regex \[([^\]=]+)(=[^\]]*)?\]\[/\1\].

How can I parse <img src> with a regex?

I need a clever regex to match ... in these:
<img src="..."
<img src='...'
<img src=...
I want to match the inner content of src, but only if it is surrounded by ", ' or none. This means that <img src=..." or <img src='... must not be accepted.
Any ideas how to match these 3 cases with one regex.
So far I use something like this ("|'|[\s\S])(.*?)\1 and the part that I want to get loose is the hacky [\S\s] which I use to match "missing symbol" on the beginning and the end of the ....
Wow, second one I'm answering today.
Don't parse HTML with regex. Use an HTML/XML parser and your life will be much easier. Tidy will clean up your HTML code for you, so you can run the HTML through Tidy first and then through a parser. Some tidy-based libraries will perform parsing in addition to santizing, and so you may not even have to run it through another parser.
Java, for example has JTidy and PHP has PHP Tidy.
UPDATE
Against my better judgement, I'm giving you this:
/<img\s+src\s*=\s*(["'][^"']+["']|[^>]+)>/
Which works only for your specific case. Even so, it will not take into account escaped " or ' in your image-source names, or the > character. There are probably a bunch of other limitations as well. The capturing group gives you your image names (in the case of names surrounded by single or double quotes, it gives you those as well, but you can strip those out).
Depending on what scripting or programming language you are using to solve this, it can be done with either multiple regex, or simply one regex that checks groups.
<img[^s]+src=("(.+)"|'(.+)'|(.+))[^/<]+(/>|</img>)
If all you want is the image src attribute, you don't have to parse using a parser. In fact, if you're wanting other attributes, just use a different regex. You will run into issues with multiple matches of the image tag, but in that case just match image tags, and for each one perform your desired regex.

Regex match an attribute value

What would the regular expression be to return 'details.jsp' (without quotes!) from this original tag. I can quite easily match all of value="details.jsp" but am having trouble just matching the contents within the attribute.
<s:include value="details.jsp" />
Any help greatly appreciated!
Thanks
Lawrence
/value=["']([^'"]+)/ would place "details.jsp" in the first capture group.
Edit:
In response to ircmaxell's comment, if you really need it, the following expression is more flexible:
/value=(['"])(.+)\1/
It will match things like <s:include value="something['else']">, but just note that the value will be placed in the second capture group.
But as mentioned before, regex is not what you want to use for parsing XML (unless it's a really simple case), so don't invest too much time into complex regexes when you should be using a parser.

Regex to extract part of a url

I'm being lazy tonight and don't want to figure this one out. I need a regex to match 'jeremy.miller' and 'scottgu' from the following inputs:
http://codebetter.com/blogs/jeremy.miller/archive/2009/08/26/talking-about-storyteller-and-executable-requirements-on-elegant-code.aspx
http://weblogs.asp.net/scottgu/archive/2009/08/25/clean-web-config-files-vs-2010-and-net-4-0-series.aspx
Ideas?
Edit
Chris Lutz did a great job of meeting the requirements above. What if these were the inputs so you couldn't use 'archive' in the regex?
http://codebetter.com/blogs/jeremy.miller/
http://weblogs.asp.net/scottgu/
Would this be what you're looking for?
'/([^/]+)/archive/'
Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.
EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:
'/([^/]+)/$'
If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:
'/(jeremy\.miller|scottgu)/'
As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:
'http://[^/]+/(?:blogs/)?([^/]+)/'
This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.
Wow. That's a lot of talking about regexes. I need to shut up and post already.
Try this one:
/\/([\w\.]+)\/archive/