Regular expression syntax to match first segment only - regex

I have number of URLs where I need to match first segment without "/" with Regex
This segment can be either xx or xx-xx.
I've tried to do it with lookahead and lookbehind but sometimes in the URL I have another 2 letter segment. (/ts/; /ca/) I don't want /ts; /ca/ them to match.
I only want first segment in my Regex. Any suggestions? Thanks.
https://regex101.com/r/Qy3nyI/1
(?<=\/)\w{2}(-\w{2})?(?=\/)
Test urls:
/en/home.aspx
/en-gb/ts/tc/home.aspx
/en-gb/home.aspx
/en-de/home.aspx
/de-de/home.aspx
/en/home.aspx
/en-fb/afspfas.aspx
/en-gb/ts/ca/anotherPage.aspx

Try adding a starting ^ anchor to the initial lookbehind in your current regex pattern:
(?<=^/)\w{2}(-\w{2})?(?=/)
^^ change is here
Updated demo:
Demo
This pattern says to:
(?<=^/) lookbehind and assert that what precedes is a leading /
\w{2}(-\w{2})? then match the country abbreviation text
(?=/) lookahead and assert that what follows is another /

Related

Regular expression to exactly match the last path segment of an URL without parameters, except if the path ends with a trailing slash

The goal of my regular expression adventure is to create a matcher for a mechanism that could add a trailing slash to URLs, even in the presence of parameters denoted by # or ? at the end of the URL.
For any of the following URLs, I'm looking for a match for segment as follows:
https://example.com/what-not/segment matches segment
https://example.com/what-not/segment?a=b matches segment
https://example.com/what-not/segment#a matches segment
In case there is a match for segment, I'm going to replace it with segment/.
For any of the following URLs, there should be no match:
https://example.com/what-not/segment/ no match
https://example.com/what-not/segment/?a=b no match
https://example.com/what-not/segment/#a no match
because here, there is already a trailing slash.
I've tried:
This primitive regex and their variants: .*\/([^?#\/]+). However, with this approach, I could not make it not match when there is already a trailing slash.
I experimented with negative lookaheads as follows: ([^\/\#\?]+)(?!(.*[\#\?].*))$. In this case, I could not get rid of any ? or # parts properly.
Thank you for your kind help!
Lookahead and lookbehind conditionals are so powerful!
(?<=\/)[\w]+(?(?=[\?\#])|$)
P.s: I just added [\w]+ that means [a-zA-Z0-9_]+.
Of course URLs can contain many other character like - or ~ but for the examples provided it works nicely.
If you want to match urls, you might use
\b(https?://\S+/)[^\s?#/]+(?![^\s?#])
Explanation
\b A word boundary to prevent a partial word match
( Capture group 1
https?://\S+/ Match the protocol, 1+ non whitespace chars and then the last occurrence of /
) Close group 1
[^\s?#/]+ Match 1+ chars other than a whitespace char ? # /
(?![^\s?#]) Negative lookahead, assert that directly to the right is not a non whitespace char other than ? or #
See a regex demo.
In the replacement use group 1 followed by segment/
For a match only instead of a capture group:
(?<=\bhttps?://\S+/)[^\s?#/]+(?![^\s?#])
See another regex demo.

Regular expression to search for specific Referer in HTTP Header

I need to create a regular expression to match everything except a specific URL for a given Referer. I currently have it to match but can't reverse it and create the negative for it.
What I currently have:
Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?
In the list below:
Referer:http://www.test.online/
Referer:https://www.test.online/
Referer:https://www.test.tv/
Referer:https://www.blah.com/
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
It will match:
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
However, I would like it to match everything except for those.
This is for our WAF so unfortunately are restricted on the usage which can only be fulfilled searching for the HTTP Header being passed back.
Try this regex:
^(?!.*Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?).*$
A good way to negate your regex is to use negative lookahead.
Explanation:
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Working example: https://regex101.com/r/QJfeBB/1
You could use an anchor ^ to assert the start of the string and use a negative lookahead to assert what is on the right is not what you want to match.
Note that you have to escape the dot to match it literally and you could omit the last part (\/.*)?.
If you don't use the capturing groups for later use you might also turn those into non capturing groups (?:) instead.
^(?!Referer:(https?(:\/\/))?(www\.)?test\.com).+$
regex101 demo
About the pattern
^ Start of the string
(?! Negative lookahead to assert what is on the right does not match
Referer:(https?(:\/\/))?(www\.)?test\.com Match your pattern
) Close negative lookahead
.+ Match any char except a newline 1+ times
$ Assert end of the string

Negative Lookahead: trying to match one word and negate following words

I have a regex like
^.*\bfrost.*(?!flakes|snowman).*$
I am testing it against the following lines:
frosted flakes
frosty snowman
frost, jack
See this Regex.101 demo.
I only want the third expression to match, but all three are matching.
You should move the second .* into the lookahead, e.g.
^.*\bfrost(?!.*(?:flakes|snowman)).*$
Or
^.*\bfrost(?!.*flakes|.*snowman).*$
See the regex demo
In the original regex, the lookahead is located after a .* and whenever the lookahead returns false, the regex engine can backtrack and still match the string in another way, a location that is not immediately followed with snowman or flakes. When you put .* into the lookahead these two words do not have to appear immediately to the right of the current location.

I need to exclude word from regular expression

I have this regexp:
^[a-z0-9]+([.\-][a-z0-9]+)*$
I need exclude from match only one word "www".
I tried the negative lookahead but without a success.
Use a negative lookahead like this:
^(?!www$)[a-z0-9]+([.-][a-z0-9]+)*$
^^^^^^^^
This will not match a string equal to www.
See the regex demo
If you want to fail a match with strings that contain -www- or .www., use
^(?!.*\bwww\b)[a-z0-9]+([.-][a-z0-9]+)*$
See another regex demo. This pattern contains a (?!.*\bwww\b) lookahead that fails the whole match if there is a www somewhere inside the string and it has no digits or letters round it due to \b word boundaries.

Regular expression for prefix exclusion

I am trying to extract gmail.com from a passage where I want only those string match that don't start with #.
Example: abc#gmail.com (don't match this); www.gmail.com (match this)
I tried the following: (?!#)gmail\.com but this did not work. This is matching both the cases highlighted in the example above. Any suggestions?
You want a negative lookbehind if your regex supports it, like (?<!#)gmail\.com and add \bs to avoid matching foogmail.comz, like: (?<!#)\bgmail\.com\b
[^#\s]*(?<!#)\bgmail\.com\b
assuming you want to find strings in a longer text body, not validate entire strings.
Explanation:
[^#\s]* # match any number of non-#, non-space characters
(?<!#) # assert that the previous character isn't an #
\b # match a word boundary (so we don't match hogmail.com)
gmail\.com # match gmail.com
\b # match a word boundary
On a first glance, the (?<!#) lookbehind assertion appears unnecessary, but it isn't - otherwise the gmail.com part of abc#gmail.com would match.
Use this regular expression using negative lookbehind:
/^.*?(?<!#)gmail\.com$/