Regex with URLs - syntax - regex

We're using a proprietary tracking system that requires the use of regular expressions to load third party scripts on the URLs we specify.
I wanted to check the syntax of the regex we're using to see if it looks right.
To match the following URL
/products/18/indoor-posters
We are using this rule:
.*\/products\/18\/indoor-posters.*
Does this look right? Also, if there was a query parameter on the URL, would it still work? e.g.
/products/18/indoor-posters?someParam=someValue
There's another URL to match:
/products
The rule for this is:
.*\/products
Would this match correctly?

Well, "right" is a relative term. Usually, .* is not a good idea because it matches anything, even nothing. So while these regexes will all match your example strings, they'll also match much more. The question is: What are you using the regexes for?
If you only want to check whether those substrings are present anywhere in the string, then they are fine (but then you don't need regex anyway, just check for substrings).
If you want to somehow check whether it's a valid URL, then no, the regexes are not fine because they'd also match foo-bar!$%(§$§$/products/18/indoor-postersssssss)(/$%/§($/.
If you can be sure that you'll always get a correct URL as your input and just want to check whether they match you pattern, then I'd suggest
^.*\/products$
to match any URL that ends in /products, and
^.*\/products\/18\/indoor-posters(?:\?[\w-]+=[\w-]+)?$
to match a URL that ends in /products/18/indoor-posters with an optional ?name=value bit at the end, assuming only alphanumeric characters are legal for name and value.

Related

301 redierction, matching urls through regex. Matching dashes

I'm trying to match urls for a migration, however I can't seem to have a regex which matches it.
I've tried different expressions and using regex checkers to determine where exactly it's broken, but it's not clear to me
This is my regex
https:\/\/blog\.xyz\.ca\/EN\/post\/201[0-9]\/[0-9][0-9]\/[0-9][0-9]\/*\).aspx
I'm trying to match these kinds of urls (hundreds)
https://blog.xyz.ca/EN/post/2019/05/14/how-test-higher-education-test-can-test-more-test-students-and-test-sdf-the-test.aspx
https://blog.xyz.ca/EN/post/2019/05/14/how-test.aspx
https://blog.xyz.ca/EN/post/2019/05/14/how-test-higher-the-test.aspx
And remap them to something like this
https://blog.xyz.ca/2017/12/21/test-how-the-testaspx
I thought that I could match the dash section using the wildcard, but it seems to not be working and none of the generators are giving me a clear warning. I've tried https://regexr.com/ and https://www.regextester.com/
If I understand the problem right, here we might just want to have a simple expression and capture our desired URL components, according to which we would find our redirect rules, and we can likely start with:
(.+\.ca)\/EN\/post(\/[0-9]{4}\/[0-9]{2}\/[0-9]{2})(\/.+)\.aspx
and if necessary, we would be adding/reducing our constraints, and I'm guessing that no validation might be required.
Demo 1
or:
(.+\.ca)\/EN\/post(\/[0-9]{4}\/[0-9]{2}\/[0-9]{2})(\/.+)(\.aspx)
Demo 2

RegEx for URL ending with a query string

I've been using Ensighten for tag management. The way it manages conditions (which pages the tracking scripts are deployed onto) is by using RegEx for the protocol, host and path separately.
Right now, my current condition looks a bit like this:
Protocol: ^(https?)$
Host: ^((www|www-qa)\.example\.com)$
Path: ^(/section-one/page/?|/section-two/page/?|/section-three/page/?)$
This works fine. However, I've been asked to add a URL ending with a query string, and that's where I'm having an issue.
Essentially, I need to also target a URL with the following format:
http://www.example.com/section-one/page?&var123=456
How do I edit my RegEx for the URL path to include this path?
/section-one/page?&var[any numbers, letters, symbols]
Note that for this /section-one/, I only want to target /page or /page + a query string, no subpages. I don't want to target a specific query string. I also want the other pages already in my RegEx to remain included.
How do I write this expression? I have to stick to the "must match this RegEx" single-expression format.
Thanks!
Improvement of what you've got:
^/section-(one|two|three)/page/?$
Backslash can generally escape the special meaning of a question mark, though it depends on flavor.
^/section-(one|two|three)/page/?($|\?)
This assumes that the $ above is more than just a formality, i.e. that it matches the end of the URL.
If you need to use capturing parentheses to actually store the query string, the engine should give you the longest possible match for .* so that there is of course:
^/section-(one|two|three)/page/?($|\?.*)

regex, find last part of a url

Let's take an url like
www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension
If I want to capture "set_of_random_characters_everything_possible.randomextension" will [^/\n]+$work? (solution taken from Trying to get the last part of a URL with Regex)
My question is: what does the "\n" part mean (it works even without it)? And, is it secure if the url has the most casual combination of characters apart "/"?
First, please note that www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension is not a URL without a scheme like http:// in front of it.
Second, don't parse URLs yourself. What language are you using? You probably don't want to use a regex, but rather an existing module that has already been written, tested, and debugged.
If you're using PHP, you want the parse_url function.
If you're using Perl, you want the URI module.
Have a look at this explanation: http://regex101.com/r/jG2jN7
Basically what is going on here is "match any character besides slash and new line, infinite to 1 times". People insert \r\n into negated char classes because in some programs a negated character class will match anything besides what has been inserted into it. So [^/] would in that case match new lines.
For example, if there was a line break in your text, you would not get the data after the linebreak.
This is however not true in your case. You need to use the s-flag (PCRE_DOTALL) for this behavior.
TL;DR: You can leave it or remove it, it wont matter.
Ask away if anything is unclear or I've explained it a little sloppy.

Fixing incorrect server scheme in URL input

I'm accepting web address input from a form, however some URL's are formatted incorrectly like this (dont ask):
http:/www.foo.com
ftp:/ftp.bar.com
scp:/meh.foobar.com
What I want to do is detect if only one forward slash is present and add a second one.
I've never been any good at regular expressions, so I tried it with a combination of substr and parse_url to pick out the scheme, then strip it. But it was a bit of a mess and stripped valid URL's adding a triple /// in the scheme and taking out letters from the hostname in some cases.
Help please :)
use this regular expression: '.*:/[a-zA-Z].*'
it will be matches with:
http:/www.foo.com
ftp:/ftp.bar.com
scp:/meh.foobar.com
and do not matches with:
http://www.foo.com
ftp://ftp.bar.com
scp://meh.foobar.com
then replace your invalid characters using replace(i dont know what language you are using):
url = myUrl.replace(url,':/','://')

Regex for checking a body of text for a URL?

I have a regex pattern for URL's that I use to check for links in a body of text. The only problem is that the pattern will match this link
stackoverflow.com
And this sentence
I'm a sentence.Next Sentence.
Obviously this would make sense because my pattern doesn't strong check .com, .co.uk, .com.au etc
I want it to match stackoverflow.com and not the latter.
As I'm no Regex expert, does anyone know of any good Regex patterns for checking for all types of URL's in a body text, while not matching the sentences like above?
If I have to strong check the domain extension, I suppose I'll have to settle.
Here's my pattern, but i don't think it help.
(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
I would definitely suggest finding a working regex that someone else has made (which would probably include a strong check on the domain extension), but here is one possible way to just modify your existing regex.
It requires that you make the assumption that usually links will not mix case in the domain extension, for example you might see .COM or .com but probably not .Com, if you only match domain extensions that don't mix case then you would avoid matching most sentences.
In the middle of your regex you have [\w]{2,4}, try changing this to ([A-Z]{2,4}|[a-z]{2,4}) (or (?:[A-Z]{2,4}|[a-z]{2,4}) if you don't want a new captured group).