Regex to find bad URLs in a database field - regex

We had an issue with the text editor on our website that was doubling up the URL. So for example, the text field may look contain:
This is a description for a media item, and here in a link.
So pretty much I need a regex to detect any string that begins with http and has another http before a closing quote, as in "http://www.example.com/apage.htmlhttp://www.example.com/apage.html"

"http[^"]+http

http://www.example.com/apage.htmlhttp://www.example.com/apage.html
This is actually a valid URL! So you'd want to be a bit careful not to munge any other URLs that happen to have ‘http://’ in the middle of them. To detect only a ‘doubled’ URL you could use backreferences:
"(https?://[^"]*)\1"
(This is a non-standard regex feature, but most modern implementations have it.)
Using regex to process HTML is a bad idea. HTML cannot reliably be parsed by regex.

If you can use the *.? syntax, you can just look for the following:
http(.*?)http
and if its present, reject the url.

The string that begins with http and has another http before a quote is:
^http[^"]*http
But, although this answers exactly your question I suspect you may want Uh Clem's answer instead ;-)

You will probably want something like this:
("http[^"]+)(http)
Then compare the two and if \1 === " + \2 then replace them.
One thought; do you have any query strings in any of your urls. If you do, are any of them like this "http://someurl.com?http=somemoredatahttp://someurl.com?http=somemoredata"?
If so, you will want something far more complicated.

Related

How do I select src between <> if img exists?

I need to select src=" using a regular expression in the form: //, but only if it is within an image tag.
This should return true:
<img alt="Alt text" src="/directory/Images/my-image.jpg" />
This to return false:
<script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script>
The end result will be replacing the scr=", which the application I am using performs, I need the regex for the find.
First, the standard disclaimer: if you are using regexes to parse a HTML DOM, you are DOING IT WRONG. With all structured data (XML, JSON, and so forth), the right way to parse HTML is to use something built for that purpose, and query it using its querying system.
That said, it is often the case that what you want is a quick hack on the commandline or the search field of an editor or whatever, and you don't want or need to faff with writing an application that loads in DOM-parsing libraries.
In that case, if you're not actually writing a program, and you don't mind that there are edge-cases where any regex you try will break, then consider something like this:
/<img\b[^<>]+\bsrc\s*=\s*"([^"]+)"/i ... maybe replacing the leading / and trailing /i with whatever other thing your language uses to denote a case-insensitive regular expression.
Note that this makes assumptions, that the url is quoted with doublequotes, the tag is correctly formed, there are no extraneous <img strings in the document, there are no doublequotes in the URL, and countless others that I didn't think of, but a proper parser would. These assumptions are a large part of why using a parser is so important: it makes no such assumptions, and if fed garbage, will correctly let you know that you did so, rather than trying to digest it and giving you pain later on.
<img\b - an img tag. The word boundary ensures this isn't an imgur tag or whatever.
[^<>]+ - one or more characters, with no closing tag, and for safety, no opening tags either.
\bsrc\s*=\s* - 'src=', but with optional whitespace, and another word-boundary check.
"([^"]+)" - some URL consisting of non-quote characters, within quotes.
Now, be aware that since we're doing NO security checking on the URL, you could be grabbing anything, such as javascript:...something malicious..., or it could be 6GB long - you just don't know. You could add in checking for such things, but you'll always miss something, unless you control the input and know exactly what you're parsing.
Your mention of "my application" does mean that I must reiterate: the above is almost certainly the wrong way to do it if you are writing an application, and the question you should be asking is probably closer to "how do I get the value of the src attribute of an img tag from a HTML page, in my chosen programming language?" rather than "how do I use regexes to extract this token from this HTML tag?"
When I say this, I don't mean "ivory-tower computer scientists will look down their nose at you" - though I admit there can be a lot of that kind of snootiness in programming :D
I mean something more like... "you're setting yourself up for pain as you run into edge-case after edge-case, and spiral down into a deep rabbit-hole of infinitely refining your regex. And you can likely avoid the pain with a simple one-liner, infinitely nicer than regex, perhaps document.querySelector('img[src^="/directory/Images"]') as #LGSon suggests in a comment.
People will say this because they've had this pain, and they're wincing at the idea that you might suffer it too.
There are several ways to match that. This RegEx is just an example and it is not certainly the best expression:
(src=")(.+)(.jpg|.JPG|.PNG|.png|.JPEG)"
You can wrap your target image URLs with a capturing group (), maybe similar to this expression:
(src=")((.+)(.jpg|.JPG|.PNG|.png|.JPEG))"
and simply call it using $2 (group #2).
You can also simplify it as you wish by adding ignore flag such as this expression:
src="((.+)(\.[a-rt-z]+))"

How to exclude the last part of a variable string using regex

I am currently making a bunch of landing pages that use similar URL structure, but each URL varies in number of words.
So it's something like:
http://landingpage.xyz/page-number-five
http://landingpage.xyz/page-number-fifty-four
http://landingpage.xyz/page-for-a-different-topic
and for the sent page I just postfix -sent like this. The reason I am not adding it as /sent is because the platform I am using handles URLs this way.
http://landingpage.xyz/page-number-five-sent
http://landingpage.xyz/page-number-fifty-four-sent
http://landingpage.xyz/page-for-a-different-topic-sent
Now I found it easy to make a regular expression that identifies all the sent pages which is let's say:
\/([a-z0-9\-]*)-sent
The thing is that I am not sure how to identify the ones that are not sent. I tried using a similar regular expression using something like this, but it's not working as expected:
\/([a-z0-9\-]*)(?!-sent)
What's the best way to design the regex for this? Or I am approaching it in the wrong way?
A lookahead should be considered where there are some characters left to match. So one at the end of regex doesn't look for anything. As long as I'm not sure whether or not your environment supports lookbehinds, this should be a workaround:
\/(?!.*-sent\b)([a-z0-9\-]*)

Regex to look for url start value and end value

I'm using using regex to look for URL that starts with http or https and with a specific value.
^http|https\:\/\/www
This regex looks at the http/https in a URL and this works.
/[\/]\bvalue?\b[\/]/g
This regex looks for "value" in a url and this currently matches with
http://www.test.co.uk/value/
http://www.test.co.uk/folder/value/
Is there a possibility to put those two regex together? Basically I need to display URLs that doesn't contain http/https or /value/ in the URL path
You're looking to do this: /(?=^(https|http))|(\bvalue\b)/g
First half: (?=^(https|http)) which will look first for https and then for http. My personal opinion however is to reduce the code to look only for http, since by matching for http you can also match for https. You may think this behavior is not going to work, but logically it does. You can try that if you like and see what happens.
Second half: (\bvalue\b). You can be more specific such as it being between forward and back slashes, or not. I used the \b delimiter to avoid it being part of another string and it worked quite well.
The important part here is to unite them, so use the | operator and it yields the above result.
Test strings:
http://www.helloworldvalue/value/values/
https://www.helloworldvalue/values/svalue/value/value/vaaluevalue/
Try it and let me know if you have any questions in the comments below.

regex, find last part of a url

Let's take an url like
www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension
If I want to capture "set_of_random_characters_everything_possible.randomextension" will [^/\n]+$work? (solution taken from Trying to get the last part of a URL with Regex)
My question is: what does the "\n" part mean (it works even without it)? And, is it secure if the url has the most casual combination of characters apart "/"?
First, please note that www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension is not a URL without a scheme like http:// in front of it.
Second, don't parse URLs yourself. What language are you using? You probably don't want to use a regex, but rather an existing module that has already been written, tested, and debugged.
If you're using PHP, you want the parse_url function.
If you're using Perl, you want the URI module.
Have a look at this explanation: http://regex101.com/r/jG2jN7
Basically what is going on here is "match any character besides slash and new line, infinite to 1 times". People insert \r\n into negated char classes because in some programs a negated character class will match anything besides what has been inserted into it. So [^/] would in that case match new lines.
For example, if there was a line break in your text, you would not get the data after the linebreak.
This is however not true in your case. You need to use the s-flag (PCRE_DOTALL) for this behavior.
TL;DR: You can leave it or remove it, it wont matter.
Ask away if anything is unclear or I've explained it a little sloppy.

Writing Regular Expression for URL in Google Analytics

I have a huge list of URL's, in the format:
http://www.example.com/dest/uk/bath/
http://www.example.com/dest/aus/sydney/
http://www.example.com/dest/aus/
http://www.example.com/dest/uk/
http://www.example.com/dest/nor/
What RegEx could I use to get the last three URL's, but miss the first two, so that every URL without a city attached is given, but the ones with cities are denied?
Note: I am using Google Analytics, so I need to use RegEx's to monitor my URL's with their advanced feature. As of right now Google is rejecting each regular expression.
Generally, the best suggestion I can make for parsing URL's with a Regex is don't.
Your time is much much better spent finding a libary that exists for your language dedicated to the task of processing URLs.
It will have worked out all the edge cases, be fully RFC compliant, be bug free, secure, and have a great user interface so you can just suck out the bits you really want.
In your case, the suggested way to process it would be, using your URL library, extract the element s and then work explicitly on them.
That way, at most you'll have to deal with the path on its own, and not have to worry so much wether its
http://site.com/
https://site.com/
http://site.com:80/
http://www.site.com/
Unless you really want to.
For the "Path" you might even wish to use a splitter ( or a dedicated path parser ) to tokenise the path into elements first just to be sure.
tj111's current solution doesn't work - it matches all your urls.
Here's one that works (and I checked with your values). It also matches, no matter if there is a trailing slash or not:
http:\/\/.*dest\/\w+/?$
/http:\/\/www\.site\.com\/dest\/\w+\/?$/i
matches if they're all the same site with the "dest" there. you could also do this:
/\w+:\/\/[^/]+\/dest\/\w+\/?$/i
which will match any site with any protocal (http,ftp) and any site with the /dest/country at the end, and an optional /
Note, that this will only work with a subset of what the urls could legitimately be.
Try this regular expression:
^http://www\.example\.com/dest/[^/]+/$
This would only match the last three URLs.