RegEx for URL ending with a query string - regex

I've been using Ensighten for tag management. The way it manages conditions (which pages the tracking scripts are deployed onto) is by using RegEx for the protocol, host and path separately.
Right now, my current condition looks a bit like this:
Protocol: ^(https?)$
Host: ^((www|www-qa)\.example\.com)$
Path: ^(/section-one/page/?|/section-two/page/?|/section-three/page/?)$
This works fine. However, I've been asked to add a URL ending with a query string, and that's where I'm having an issue.
Essentially, I need to also target a URL with the following format:
http://www.example.com/section-one/page?&var123=456
How do I edit my RegEx for the URL path to include this path?
/section-one/page?&var[any numbers, letters, symbols]
Note that for this /section-one/, I only want to target /page or /page + a query string, no subpages. I don't want to target a specific query string. I also want the other pages already in my RegEx to remain included.
How do I write this expression? I have to stick to the "must match this RegEx" single-expression format.
Thanks!

Improvement of what you've got:
^/section-(one|two|three)/page/?$
Backslash can generally escape the special meaning of a question mark, though it depends on flavor.
^/section-(one|two|three)/page/?($|\?)
This assumes that the $ above is more than just a formality, i.e. that it matches the end of the URL.
If you need to use capturing parentheses to actually store the query string, the engine should give you the longest possible match for .* so that there is of course:
^/section-(one|two|three)/page/?($|\?.*)

Related

Using regex to pull domain from URL

I'm currently working on a regex query to pull out the domain name of a URL. The catch is, I only want to pull the domain if it has the following format:
www.miami-dade.com
If the domain is www.miamidade.com, i don't want to detect that. The part I'm struggling with is, I also don't want to detect on the domain if it has more than one dash in it. For example:
www.florida-miami-dade.com shouldn't be detected.
This is my regex, anyone have any idea on what I should do?
(?<=.)[a-zA-Z0-9]+-[a-zA-Z0-9-_]+
Try this:
^\w+\.\w+-\w+\.\w+$
See live demo.
The regex \w means "word character", which includes letters, numbers and the underscore.
If you strictly want only letters, use
^[a-z]+\.[a-z]+-[a-z]+\.[a-z]+$

regex for url without protocol

I need a regex to check a url with this format: "www.stackoverflow.com".
I do not want to allow http or https: "http://www.stackoverflow.com"
I've been looking for a good 45 minutes and can't find anything. only regex that allows both or that require "http".
The closest I've seen is "^([a-zA-Z0-9]+(.[a-zA-Z0-9]+)+.*)$" but this allows anything as long as it includes "."
Acceptable expression: www.example.com
Unacceptable expression: http://www.example.com, example.com etc.
Basically something that makes sure it starts with "www.". If possible I also want to make sure it ends with ".something". And all the other URL regex attributes like not allowing "!" etc.
try using this pattern
^(?!https?).*$
with i modifier for case insensitive.
Demo
Per comment below use this pattern
^(?!https?)www\..*$
or simply
^www\..*$

Finding a URL within two strings regex

I have a long HTML file that contains the names of organizations and their URL's. Each organization's "section" in the code is demarcated by the word "organization" followed by a lot of code, with their URL located inside that code, and ends with the word "organization".
For example:
organization -- a lot of code (with the URL located somewhere inside) -- organization
I have tried to use regex to search and extract the URL, but to no avail.
organization(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\ #/$,]*organization
I suspect my problem lies somewhere in my trying to demarcate the search for URL's by just using the word "organization", but I am not sure.
Try group 1 from this:
organization.*\b(\w+://[\w.?%&=#/$,-]+).*?organization
Your current regex is searching for something sandwiched immediately between two instances of "organization". If there's any chance of characters existing between "organization" and your URL, you'll need to introduce a non-greedy match for any instances of anything (.*?), and if there are newlines in the mix you'll need to use (?:.|\n)*?.
So your regex becomes:
organization(?:.|\n)*?(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\ #/$,]*(?:.|\n)*?organization
(Because of the bold insertions, this mistakenly appears to have spaces, but it does not. If you select it and copy/paste, it will paste correctly without spaces)

Regex with URLs - syntax

We're using a proprietary tracking system that requires the use of regular expressions to load third party scripts on the URLs we specify.
I wanted to check the syntax of the regex we're using to see if it looks right.
To match the following URL
/products/18/indoor-posters
We are using this rule:
.*\/products\/18\/indoor-posters.*
Does this look right? Also, if there was a query parameter on the URL, would it still work? e.g.
/products/18/indoor-posters?someParam=someValue
There's another URL to match:
/products
The rule for this is:
.*\/products
Would this match correctly?
Well, "right" is a relative term. Usually, .* is not a good idea because it matches anything, even nothing. So while these regexes will all match your example strings, they'll also match much more. The question is: What are you using the regexes for?
If you only want to check whether those substrings are present anywhere in the string, then they are fine (but then you don't need regex anyway, just check for substrings).
If you want to somehow check whether it's a valid URL, then no, the regexes are not fine because they'd also match foo-bar!$%(§$§$/products/18/indoor-postersssssss)(/$%/§($/.
If you can be sure that you'll always get a correct URL as your input and just want to check whether they match you pattern, then I'd suggest
^.*\/products$
to match any URL that ends in /products, and
^.*\/products\/18\/indoor-posters(?:\?[\w-]+=[\w-]+)?$
to match a URL that ends in /products/18/indoor-posters with an optional ?name=value bit at the end, assuming only alphanumeric characters are legal for name and value.

Regex for excluding URL

I working with an email company that has a feature where they spider your site in order to provide custom content. I have the ability to have the spider ignore urls based on the regex patterns I provide.
For this system a pattern starts and ends with a "/".
What I'm trying to do is ignore http://www.website.com/2011/10 BUT allow http://www.website.com/2011/10/title-of-page.html
I would have thought the pattern below would work since it does not have a trailing slash but no luck.
Any ideas?
/http:\/\/www\.website\.com\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]/
Your regex matches a part of the URL, so you need to tell it not to allow a slash to follow it:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9](?!\/)/
If you want to also avoid other partial matches like in http://www.website.com/2011/100, then an additional word boundary might help:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9]\b(?!\/)/
It depends on the regexp engine but you can probably either use $ (if the URL is tokenised beforehand) or a match for whitespace and delimiters