How do I select src between <> if img exists? - regex

I need to select src=" using a regular expression in the form: //, but only if it is within an image tag.
This should return true:
<img alt="Alt text" src="/directory/Images/my-image.jpg" />
This to return false:
<script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script>
The end result will be replacing the scr=", which the application I am using performs, I need the regex for the find.

First, the standard disclaimer: if you are using regexes to parse a HTML DOM, you are DOING IT WRONG. With all structured data (XML, JSON, and so forth), the right way to parse HTML is to use something built for that purpose, and query it using its querying system.
That said, it is often the case that what you want is a quick hack on the commandline or the search field of an editor or whatever, and you don't want or need to faff with writing an application that loads in DOM-parsing libraries.
In that case, if you're not actually writing a program, and you don't mind that there are edge-cases where any regex you try will break, then consider something like this:
/<img\b[^<>]+\bsrc\s*=\s*"([^"]+)"/i ... maybe replacing the leading / and trailing /i with whatever other thing your language uses to denote a case-insensitive regular expression.
Note that this makes assumptions, that the url is quoted with doublequotes, the tag is correctly formed, there are no extraneous <img strings in the document, there are no doublequotes in the URL, and countless others that I didn't think of, but a proper parser would. These assumptions are a large part of why using a parser is so important: it makes no such assumptions, and if fed garbage, will correctly let you know that you did so, rather than trying to digest it and giving you pain later on.
<img\b - an img tag. The word boundary ensures this isn't an imgur tag or whatever.
[^<>]+ - one or more characters, with no closing tag, and for safety, no opening tags either.
\bsrc\s*=\s* - 'src=', but with optional whitespace, and another word-boundary check.
"([^"]+)" - some URL consisting of non-quote characters, within quotes.
Now, be aware that since we're doing NO security checking on the URL, you could be grabbing anything, such as javascript:...something malicious..., or it could be 6GB long - you just don't know. You could add in checking for such things, but you'll always miss something, unless you control the input and know exactly what you're parsing.
Your mention of "my application" does mean that I must reiterate: the above is almost certainly the wrong way to do it if you are writing an application, and the question you should be asking is probably closer to "how do I get the value of the src attribute of an img tag from a HTML page, in my chosen programming language?" rather than "how do I use regexes to extract this token from this HTML tag?"
When I say this, I don't mean "ivory-tower computer scientists will look down their nose at you" - though I admit there can be a lot of that kind of snootiness in programming :D
I mean something more like... "you're setting yourself up for pain as you run into edge-case after edge-case, and spiral down into a deep rabbit-hole of infinitely refining your regex. And you can likely avoid the pain with a simple one-liner, infinitely nicer than regex, perhaps document.querySelector('img[src^="/directory/Images"]') as #LGSon suggests in a comment.
People will say this because they've had this pain, and they're wincing at the idea that you might suffer it too.

There are several ways to match that. This RegEx is just an example and it is not certainly the best expression:
(src=")(.+)(.jpg|.JPG|.PNG|.png|.JPEG)"
You can wrap your target image URLs with a capturing group (), maybe similar to this expression:
(src=")((.+)(.jpg|.JPG|.PNG|.png|.JPEG))"
and simply call it using $2 (group #2).
You can also simplify it as you wish by adding ignore flag such as this expression:
src="((.+)(\.[a-rt-z]+))"

Related

Regex to match everything except a pattern

Regex noob here struggling with this, which I know it will be easy for some of you regex gods out there!
Given the following:
title: Some title
date: 2022-08-15
tags: <value to extract>
identifier: 1234567
---------------------------
Some text
some more text
I would like a regex to match everything except the value of tags (ie the "<value to extract>" text).
For context, this is supposed to run on emacs (in case it matters).
EDIT: Just to clarify as per #phils question, all I care about extracting the tags value. However, this is via a package setting that asks for a regex string and I don't have much control over how it gets use. It seems to expect a regex to strip what I don't need from the string rather than matching what I do want, which is slightly annoying.. Also, the since it seems to match everything with \\(.\\), I'm guessing it's using the global flag?
Please let me know if any of this isn't clear.
Emacs regular expressions can't trivially express "not foo" for arbitrary values of foo. (The likes of PCRE have non-regular extensions for zero-width negative look-ahead/behind assertions, but in Emacs that sort of functionality is generally done with the support of lisp code1.)
You can still do it purely with regexp matching, but it's simply very cumbersome. An Emacs regexp which matches any line which does not begin with tags: is:
^\(?:$\|[^t]\|t[^a]\|ta[^g]\|tag[^s]\|tags[^:]\).*
or if you need to enter it in the elisp double-quoted read syntax for strings:
"^\\(?:$\\|[^t]\\|t[^a]\\|ta[^g]\\|tag[^s]\\|tags[^:]\\).*"
1 In lisp code you would instead simply check each line to see whether it does start with tags: and, if so, skip it (which is why Emacs generally gets away without the feature you're looking for, but of course that doesn't help you here).
After playing around with it for a bit and taken inspiration from #phils' answer, I've come up with the following:
"^\\(?:\\(#\\+\\)?\\(?:filetags:\s+\\|tags:\s+\\|title:.*\\|identifier:.*\\|date:.*\\)\\|.*\\)"
I've also added an extra \\(#\\+\\)? to account for org meta keys which would usually have the format #+key: value.

Removing empty bbcode tags using regex

Using regex I'm trying to remove empty bbcode tags. By empty I mean nothing in between them:
[tag][/tag]
If there is something between them then it should be kept.
I've searched a lot and played around with a regex tester but haven't come up with anything that works right.
Edit: I realize now why I was having a hard time with this. In addition to the example above, I also have one's like:
[url=http://www.somedomain.com/][/url]
I'm trying to cleanup bbcode when a form is submitted so it's not stored since it's unneeded.
In Javascript, you could do :
str.replace(/\[([^\[\]]*)\]\[\/\1\]/g, '');
The operative aspect of regex in this case is the use of internal backrefs; I'm not sure, off the top of my head, whether this is universally supported, but .NET, in any case, seems to use PCRE (is this true?).
The pattern, then, is [, a word, ][/, the same word, ]. If we assume the word has simply the quality of "does not contain ]", then an appropriate regex to match an empty tag is \[([^\]]+)\]\[/\1\], escaped as necessary in context.
For the second case, if assume the form [tag=arg][/tag], and that tag and arg each don't contain any ']' (not a reasonable assumption! but dealing with it is left as an exercise for the reader -- and I'm quite sure most bbcode implementations don't actually deal with that problem, either), one could use a regex \[([^\]=]+)(=[^\]]*)?\]\[/\1\].

How can I parse <img src> with a regex?

I need a clever regex to match ... in these:
<img src="..."
<img src='...'
<img src=...
I want to match the inner content of src, but only if it is surrounded by ", ' or none. This means that <img src=..." or <img src='... must not be accepted.
Any ideas how to match these 3 cases with one regex.
So far I use something like this ("|'|[\s\S])(.*?)\1 and the part that I want to get loose is the hacky [\S\s] which I use to match "missing symbol" on the beginning and the end of the ....
Wow, second one I'm answering today.
Don't parse HTML with regex. Use an HTML/XML parser and your life will be much easier. Tidy will clean up your HTML code for you, so you can run the HTML through Tidy first and then through a parser. Some tidy-based libraries will perform parsing in addition to santizing, and so you may not even have to run it through another parser.
Java, for example has JTidy and PHP has PHP Tidy.
UPDATE
Against my better judgement, I'm giving you this:
/<img\s+src\s*=\s*(["'][^"']+["']|[^>]+)>/
Which works only for your specific case. Even so, it will not take into account escaped " or ' in your image-source names, or the > character. There are probably a bunch of other limitations as well. The capturing group gives you your image names (in the case of names surrounded by single or double quotes, it gives you those as well, but you can strip those out).
Depending on what scripting or programming language you are using to solve this, it can be done with either multiple regex, or simply one regex that checks groups.
<img[^s]+src=("(.+)"|'(.+)'|(.+))[^/<]+(/>|</img>)
If all you want is the image src attribute, you don't have to parse using a parser. In fact, if you're wanting other attributes, just use a different regex. You will run into issues with multiple matches of the image tag, but in that case just match image tags, and for each one perform your desired regex.

Regex to find bad URLs in a database field

We had an issue with the text editor on our website that was doubling up the URL. So for example, the text field may look contain:
This is a description for a media item, and here in a link.
So pretty much I need a regex to detect any string that begins with http and has another http before a closing quote, as in "http://www.example.com/apage.htmlhttp://www.example.com/apage.html"
"http[^"]+http
http://www.example.com/apage.htmlhttp://www.example.com/apage.html
This is actually a valid URL! So you'd want to be a bit careful not to munge any other URLs that happen to have ‘http://’ in the middle of them. To detect only a ‘doubled’ URL you could use backreferences:
"(https?://[^"]*)\1"
(This is a non-standard regex feature, but most modern implementations have it.)
Using regex to process HTML is a bad idea. HTML cannot reliably be parsed by regex.
If you can use the *.? syntax, you can just look for the following:
http(.*?)http
and if its present, reject the url.
The string that begins with http and has another http before a quote is:
^http[^"]*http
But, although this answers exactly your question I suspect you may want Uh Clem's answer instead ;-)
You will probably want something like this:
("http[^"]+)(http)
Then compare the two and if \1 === " + \2 then replace them.
One thought; do you have any query strings in any of your urls. If you do, are any of them like this "http://someurl.com?http=somemoredatahttp://someurl.com?http=somemoredata"?
If so, you will want something far more complicated.

Regex to Match HTML Style Properties

In need of a regex master here!
<img src="\img.gif" style="float:left; border:0" />
<img src="\img.gif" style="border:0; float:right" />
Given the above HTML, I need a regex pattern that will match "float:right" or "float:left" but only on an img tag.
Thanks in advance!
/<img\s[^>]*style\s*=\s*"[^"]*\bfloat\s*:\s*(left|right)[^"]*"/i
Have to advise you, though: in my experience, no matter what regex you write, someone will be able to come up with valid HTML that breaks it. If you really want to do this in a general, reliable way, you need to parse the HTML, not throw regexes at it.
You really shouldn't use regex to parse html or xml, it's impossible to design a foolproof regex that will handle all corner cases. Instead, I would suggest finding an html-parsing library for your language of choice.
That said, here's a possible solution using regex.
<img\s[^>]*?style\s*=\s*".*?(?<"|;)(float:.*?)(?=;|").*?"
The "float:" will be captured in the only capturing group there, which should be number 1.
The regex basically matches the start of an img tag, followed by any type of character that isn't a close bracket any number of times, followed by the style attribute. Within the style attribute's value, the float: can be anywhere within the attribute, but it should only match the actual float style (i.e. it's preceded by the start of the attribute or a semicolon and followed by a semicolon or the end of the attribute).
I agree with Sean Nyman, it's best not to use a regex (at least not for anything permanent). For something ad-hoc and a bit more durable, you might try:
/<img\s(?:\s*\w+\s*=\s*(?:'[^']*'|"[^"]*"))*?\s*\bstyle\s*=\s*(?:"[^"]*?\bfloat\s*:\s*(\w+)|'[^']*?float\s*:\s*(\w+)/i