Validate URL query string with regex - regex

I'm trying to validate a query string with regex. Note that I'm not trying to match out the values, but validate its syntax. I'm doing this to practice regex, so I'd appreciate help rather than "use this lib", although seing how it may have been done in a lib would help me, so show me if you've got one.
So, this would be the prerequisites:
It must start with a questionmark.
It may contain keys with or without values separated by an equals-sign, pairs separated by ampersand.
I've got pretty far, but I'm having trouble matching in regex that the equals-sign and ampersand must be in a certain order without having to repeat match groups. This is what I've got so far:
#^\?([\w\-]+((&|=)([\w\-]+)*)*)?$#
It correctly matches ?abc=123&def=345, but it also incorrectly matches for example ?abc=123=456.
I could go overkill and do something like...
/^\?([\w\-]+=?([\w\-]+)?(&[\w\-]+(=?[\w\-]*)?)*)?$/
... but I don't want to repeat the match groups which are the same anyway.
How can I tell regex that the separators between values must iterate between & and = without repeating match groups or catastrophic back tracking?
Thank you.
Edit:
I'd like to clarify that this is not meant for a real-world implementation; for that, the built-in library in your language, which is most likely available should be used. This question is asked because I want to improve my regex skills, and parsing a query string seemed like a rewarding challenge.

This seems to be what you want:
^\?([\w-]+(=[\w-]*)?(&[\w-]+(=[\w-]*)?)*)?$
See live demo
This considers each "pair" as a key followed by an optional value (which maybe blank), and has a first pair, followed by an optional & then another pair,and the whole expression (except for the leading?) is optional. Doing it this way prevents matching ?&abc=def
Also note that hyphen doesn't need escaping when last in the character class, allowing a slight simplification.
You seem to want to allow hyphens anywhere in keys or values. If keys need to be hyphen free:
^\?(\w+(=[\w-]*)?(&\w+(=[\w-]*)?)*)?$

You can use this regex:
^\?([^=]+=[^=]+&)+[^=]+(=[^=]+)?$
What it does is:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\? '?'
--------------------------------------------------------------------------------
( group and capture to \1 (1 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^=]+ any character except: '=' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
[^=]+ any character except: '=' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
& '&'
--------------------------------------------------------------------------------
)+ end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
[^=]+ any character except: '=' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \2 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
[^=]+ any character except: '=' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)? end of \2 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \2)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

I agree with Andy Lester, but a possible regex solution is
#^\?([\w-]+=[\w-]*(&[\w-]+=[\w-]*))?$#
which is very much like what you posted.
I haven't tested it and you didn't say what language you're using so it may need a little tweaking.

This might not be a job for regexes, but for existing tools in your language of choice. Regexes are not a magic wand you wave at every problem that happens to involve strings. You probably want to use existing code that has already been written, tested, and debugged.
In PHP, use the parse_url function.
Perl: URI module.
Ruby: URI module.
.NET: 'Uri' class

I made this.
function isValidURL(url) {
// based off https://mathiasbynens.be/demo/url-regex. testing https://regex101.com/r/pyrDTK/2
var pattern = /^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/?)(?:(?:\?(?:(?!&|\?)(?:\S))+=(?:(?!&|\?)(?:\S))+)(?:&(?:(?!&|\?)(?:\S))+=(?:(?!&|\?)(?:\S))+)*)?$/iuS;
return pattern.test(url);
}
Base: https://mathiasbynens.be/demo/url-regex
Testing: https://regex101.com/r/pyrDTK/4/

When you need to validate a very complex url, you may use this regex
`^(https|ftp|http|ftps):\/\/([a-z\d_]+\.)?(([a-zA-Z\d_]+)(\.[a-zA-Z]{2,6}))(\/[a-zA-Z\d_\%\-=\+]+)*(\?)?([a-zA-Z\d=_\+\%\-&\{\}\:]+)?`

/^\?([\w-]+(=[\w.\-:%+]*)?(&[\w-]+(=[\w.\-:%+]*)?)*)?$/
\w = [a-zA-Z0-9_]
? = '?'
above regex supports, a-z A-Z 0-9 _ . - : % + in Param Value
you can test this regex here

Related

Cannot join two regex into one to produce a code snippet to use in VSCode

I am struggling with regular expressions and how to use them in snippets in VSCode and I could really use some help (I am a beginner in that area).
I have two regexp:
the first one is here to remove the extension from a file path: (.*)\.[^.]+$
the second one would output all the occurrences of a backlash in the file path: /(\\)/g
I would like to group them into one regex to use in a user code snippet (html) in VSCode.
From the following input: C:\folder0\myhtml.html
I would like to get the following output from the code snippet using transformations: C:/folder0/myhtml (backlashes are replaced with forward ones and the extension is removed).
I know how to write snippets that do it independently:
${TM_FILEPATH/(.*)\\..+$/$1/} would produce C:\folder0\myhtml
${TM_FILEPATH/(\\\\)/\\//g} would produce C:/folder0/myhtml.html
TM_FILEPATH being C:\folder0\myhtml.html in my example. But I cannot combine them.
I have first tried to combine the regex in https://regex101.com/ like this:
(\\)(.*)\.[^.]+$ but the result is not what I expect.
In your case you can make it much simpler by using other variables (see snippet variables documentation).
"${TM_DIRECTORY/(\\\\)/\\//g}/$TM_FILENAME_BASE"
TM_DIRECTORY gets the full path up to the fileName. C:\folder0 in your example.
$TM_FILENAME_BASE gets the fileName without the extension. myhtml in your example.
So all you really need to do is swap those backslashes with forward slashes with the transform: ${TM_DIRECTORY/(\\\\)/\\//g} and concatenate the parts.
Use
${TM_FILEPATH/\\\\([^\\\\.]*)(?:\\..*)?/\\/$1/}
See regex proof.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
\\ '\'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^\\.]* any character except: '\\', '.' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of grouping

GTM- Regular expression to match child pages but not the parent that share the same page path

Looking for a regular expression for Google Tag Manager that will use the path in a URL but not return it.
For example: (\/fruit-and-veg\/).*$ will match all of the following:
https://www.the-example.co.uk/fruit-and-veg/
https://www.the-example.co.uk/fruit-and-veg/bunch-of-grapes/
https://www.the-example.co.uk/fruit-and-veg/red-pepper/
https://www.the-example.co.uk/fruit-and-veg/green-tomato/
I do not want to match on https://www.the-example.co.uk/fruit-and-veg/, but I do want to match on the others.
Or another way in GTM to achieve this.
.* match any text, empty string included, hence, you allow /fruit-and-veg/ at the end of string.
Require at least one more character:
\/fruit-and-veg\/.+
EXPLANATION
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
fruit-and-veg 'fruit-and-veg'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))

RegEx expression to keep first appearance of word grouping

I have the following RegEx (\[[^]]+])(?=.*\1)
which identifies the first set of appearances of a duplicate word group inside a string (each word group is enclosed between [ ] brackets). However, I am trying to come up with a RegEx that identifies the last set of appearances of a duplicate word group. Reason being, I need to remove duplicate word groups while retaining the order in which each group appears in the overall string.
Using the following string as an example whereby only [John Smith] and [Jane Doe] are duplicate word groups:
[John Smith][John Smith][Mr. Smith][Jane Doe][Mrs. Doe][John Smith][Jane Doe][Doe][John][Smith John][John Smith Sr]
After using my RegEx in a RegEx Replace formula, I get the below:
[Mr. Smith][Mrs. Doe][John Smith][Jane Doe][Doe][John][Smith John][John Smith Sr]"
However, I need my RegEx Replace formula to give me:
[John Smith][Mr. Smith][Jane Doe][Mrs. Doe][Doe][John][Smith John][John Smith Sr]
I have tried many ways to achieve the latter with no luck. Thanks in advance.
Considering infinite-width lookbehinds:
(\[[^\][]+])(?<=\1.*\1)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
[^\][]+ any character except: '\]', '[' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
] ']'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
) end of look-behind
Many regex engines do not support variable-length lookbehinds, but most support variable-length lookaheads. When working with an engine that supports variable-length lookaheads, but not variable-length lookbehinds, one approach that often works is to reverse the string, modify the (reversed) string with a regex and then reverse the resulting string. That approach could be used here.
Suppose, for example, the string were
"[John Smith][John Smith][Mr. Smith][Jane Doe][John Smith][Jane Doe][Doe]"
Reversing the string produces
"]eoD[]eoD enaJ[]htimS nhoJ[]eoD enaJ[]htimS .rM[]htimS nhoJ[]htimS nhoJ["
We now convert matches of the following expression to empty strings:
(\][^[]*\[)(?=.*\1)
which produces
"]eoD[]eoD enaJ[]htimS .rM[]htimS nhoJ["
Demo
Lastly, we reverse that string to obtain
"[John Smith][Mr. Smith][Jane Doe][Doe]"
The regular expression can be written in free-spacing mode to make it self-documenting:
( # begin capture group 1
\] # match ']'
[^[]* # match one or more (as many as possible) chars other than '['
\[ # match '['
) # end capture group 1
(?= # begin a positive lookahead
.* # match one or more (as many as possible) chars
\1 # match the content of capture group 1
) # end the positive lookahead
At first glance this may seem a kludge, but since reversing a string is so easy in any language it does provide a useful work-around in some cases. Mind you, doing what you want to do here in code is pretty easy in most languages. In Ruby, for example, you could write (str being a variable holding the string)
str.scan(/\[[^\]]*\]/).uniq.join

Setting up correct REGEX for conversion tracking target-URL in Google Analytics

Greetings from Norway!
I am new with setting up REGEX in Google analytics, so I would really appreciate your help! :)
I want to track sent campaign forms from my website (like recipt urls), but I need to set up a regular expression in order to track the responses for all campaigns on my website.
These strings need to be tracked (and similar onces in the future):
https://www.domain.no/**Campaigns**/MYcampaignname?**mode=received&formid**=thisand093that123
https://www.domain.no/**Campaigns**/Another-campaign-name?**mode=received&formid**=76280&goback=https%3a%2f%2fwww.domain.no%2fCampaign
https://www.domain.no/**Campaigns**/Name-of-This-campaign?**mode=received&formid**=76283
I have tried several different regular expressions in GA, but I do not get them to work..
Some I have tried:
/Campaigns/.?mode=received&formid=.
/Campaigns/([A-Z]+[a-z]+[0-9]+)?mode=received&formid=[^/]
I would really appreciate your help!
Use
^/Campaigns/([\w-]+)\?mode=received&formid=([^/]+)
See proof
Explanation
MODE
EXPLANATION
^
the beginning of the string
/Campaigns/
'/Campaigns/'
(
group and capture to \1:
[\w-]+
any character of: word characters (a-z, A-Z, 0-9, _), '-' (1 or more times (matching the most amount possible))
)
end of \1
?
'?'
mode=received&formid=
'mode=received&formid='
(
group and capture to \2:
[^/]+
any character except: '/' (1 or more times (matching the most amount possible))
)
end of \2

Trying to match what is before /../ but after / with regular expressions

I am trying to match what is before /../ but after / with a regular expressions, but I want it to look back and stop at the first /
I feel like I am close but it just looks at the first slash and then takes everything after it like... input is this:
this/is/a/./path/that/../includes/face/./stuff/../hat
and my regular expression is:
#\/(.*)\.\.\/#
matching /is/a/./path/that/../includes/face/./stuff/../ instead of just that/../ and stuff/../
How should I change my regex to make it work?
.* means "match any number of any character at all[1]". This is not what you want. You want to match any number of non-/ characters, which is written [^/]*.
Any time you are tempted to use .* or .+ in a regex, be very suspicious. Stop and ask yourself whether you really mean "any character at all[1]" or not - most of the time you don't. (And, yes, non-greedy quantifiers can help with this, but character classes are both more efficient for the regex engine to match against and more clear in their communication of your intent to human readers.)
[1] OK, OK... . isn't exactly "any character at all" - it doesn't match newline (\n) by default in most regex flavors - but close enough.
Change your pattern that only characters other than / ([^/]) get matched:
#([^/]*)/\.\./#
Alternatively, you can use a lookahead.
#(\w+)(?=/\.\./)#
Explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
) end of look-ahead
I think you're essentially right, you just need to make the match non-greedy, or change the (.*) to not allow slashes: #/([^/]*)/\.\./#
In your favourite language, do a few splits and string manipulation eg Python
>>> s="this/is/a/./path/that/../includes/face/./stuff/../hat"
>>> a=s.split("/../")[:-1] # the last item is not required.
>>> for item in a:
... print item.split("/")[-1]
...
that
stuff
In python:
>>> test = 'this/is/a/./path/that/../includes/face/./stuff/../hat'
>>> regex = re.compile(r'/\w+?/\.\./')
>>> regex.findall(me)
['/that/..', '/stuff/..']
Or if you just want the text without the slashes:
>>> regex = re.compile(r'/(\w+?)/\.\./')
>>> regex.findall(me)
['that', 'stuff']
([^/]+) will capture all the text between slashes.
([^/]+)*/\.\. matches that\.. and stuff\.. in you string of this/is/a/./path/that/../includes/face/./stuff/../hat It captures that or stuff and you can change that, obviously, by changing the placement of the capturing parens and your program logic.
You didn't state if you want to capture or just match. The regex here will only capture that last occurrence of the match (stuff) but is easily changed to return that then stuff if used global in a global match.
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^/]+ any character except: '/' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'