PCRE regex lookahead

PCRE regex lookahead - regex

I can't get PCRE lookahead to work correctly, help will be appreciated, I am finding it difficult to explain but here we go, take this pattern for example:
/^\/page(\/?[a-z0-9\.\_\-]+)*\/?$/i
This matches:
/page
/page/
/page/abc/def/
which is good, but this also matches '/pageabcd' which is supposed to be wrong. So basically for a URI trailing "/" should be optional unless there is some other [a-z0-9\.\-\_]*
I have helplessly tried ^\/page(?=(\/)(\/?[a-z0-9\.\_\-]+)*|(\/)?)$ while I knew I am doing something wrong

You may enforce the presence of a letter, digit, ., _ or - by removing the ? quantifier after / inside the quantified capturing group. Also, you may use \w instead of [a-zA-Z0-9_] to shorten the pattern a bit. A non-capturing group will also be more appropriate since you do not need to capture the value of a repeated capturing group (replace ( with (?:).
Use
/^\/page(?:\/[\w.-]+)*\/?$/i
^
See the regex demo.

Related

Regular expression to exactly match the last path segment of an URL without parameters, except if the path ends with a trailing slash

The goal of my regular expression adventure is to create a matcher for a mechanism that could add a trailing slash to URLs, even in the presence of parameters denoted by # or ? at the end of the URL.
For any of the following URLs, I'm looking for a match for segment as follows:
https://example.com/what-not/segment matches segment
https://example.com/what-not/segment?a=b matches segment
https://example.com/what-not/segment#a matches segment
In case there is a match for segment, I'm going to replace it with segment/.
For any of the following URLs, there should be no match:
https://example.com/what-not/segment/ no match
https://example.com/what-not/segment/?a=b no match
https://example.com/what-not/segment/#a no match
because here, there is already a trailing slash.
I've tried:
This primitive regex and their variants: .*\/([^?#\/]+). However, with this approach, I could not make it not match when there is already a trailing slash.
I experimented with negative lookaheads as follows: ([^\/\#\?]+)(?!(.*[\#\?].*))$. In this case, I could not get rid of any ? or # parts properly.
Thank you for your kind help!

Lookahead and lookbehind conditionals are so powerful!
(?<=\/)[\w]+(?(?=[\?\#])|$)
P.s: I just added [\w]+ that means [a-zA-Z0-9_]+.
Of course URLs can contain many other character like - or ~ but for the examples provided it works nicely.

If you want to match urls, you might use
\b(https?://\S+/)[^\s?#/]+(?![^\s?#])
Explanation
\b A word boundary to prevent a partial word match
( Capture group 1
https?://\S+/ Match the protocol, 1+ non whitespace chars and then the last occurrence of /
) Close group 1
[^\s?#/]+ Match 1+ chars other than a whitespace char ? # /
(?![^\s?#]) Negative lookahead, assert that directly to the right is not a non whitespace char other than ? or #
See a regex demo.
In the replacement use group 1 followed by segment/
For a match only instead of a capture group:
(?<=\bhttps?://\S+/)[^\s?#/]+(?![^\s?#])
See another regex demo.

Regex - How to prevent any string that starts with "de" but cannot use lookahead or lookbehind?

I have a regex
[a-zA-Z][a-z]
I have to change this regex such that the regex should not accept string that starts with "de","DE","dE" and "De" .I cannot use look behind or look ahead because my system does not support it?

There's a solution without a lookahead or lookbehind, but you need to be able to use groups.
The idea there is to create a sort of "honeypot" that will match your negative results and keep only the results that do interest you.
In your case, that would write:
[dD][eE].*|(<your-regex>)
If the proposition is de<anything> (case insensitive here), it will match, but group(1) will be null.
On the other hand, matching diZ for instance would match not match what is before the or and would therefore fall into the group(1).
Finally, if the proposition doesn't start with de and doesn't match your regex, well, there will be no groups to get at all.
If you need to be sure that your proposition will match the whole provided string, you can update the regex thus:
^(?:[dD][eE].*|(<your-regex>))$
Note that ?: is not a lookahead of any kind, it serves to mark the group as non-capturing, so that <your-regex> will still be captured by group(1) (would become group(2) otherwise and the capture of a group is not always a transparent operation, performance-wise).

Simply ignore those characters:
[a-ce-z][a-df-z][a-gi-kwxyzWZXZ]
Make sure the flag is set to case insensitive. Also, [a-gi-kwxyzWZXZ] can then be modified to [a-gi-kwxyz].
EDIT:
As pointed out in this comment, the regex here won't support other words that start with d but are not followed by e. In this case, negative lookahead is a possible solution:
^(?!de)[a-z]+

This matches anything not starting with "DE" (case insensitive, without look arounds, allowing leading whitespace):
^ *+(?:[^Dd].|.[^Ee])<your regex for rest of input>
See live demo.
The possessive quantifier *+ used for whitespace prevents [^Dd] from being allowed to match a space via backtracking, making this regex hardened against leading spaces.

You can use an alternation excluding matching the d and D from the first character, or exclude matching the e as the second character.
Note that the pattern [a-zA-Z][a-z] matches at least 2 characters, so will the following pattern:
^(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z]).*
^ Start of string
(?: Non capture group
[abce-zABCE-Z][a-z] Match a char a-zA-Z without d and D followed by a lowercase char a-z
| or
[a-zA-Z][a-df-z] Match a char a-zA-Z followed by a lowercase chars a-z without e
) Close non capture grou
.* Match 0+ times any char except a newline
Regex demo
Another option is to use word boundaries \b instead of an anchor ^
\b(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z])[a-zA-Z]*\b
Regex demo

Regex to block ? parameter in url

I have a pretty basic regex for a soundcloud link as follows:
/^(https?:\/\/)?((www)?\.)?soundcloud\.com\/[a-zA-Z0-9]*/
It doesn't do much and I would like to block that kind of more complicated links that contain something else (?fbclid=) than only valid soundcloud information:
Invalid:
https://soundcloud.com/beat-one/done-lam/w-avlwV?fbclid=IwAR8TyDI-1ezXHrP1BfaBG1rLP8gTlEQeGJzKO0DGZl7h1i63VWqenV_yDI8
Valid:
https://soundcloud.com/beat-one/done-lam/w-avlwV
How can I do it?
Thanks!

Using a pattern like ((www)?\.)? matches optional www but could also match an optional .
I think you meant to use optional www. as a whole.
For the question mark part, you could match the beginning of the use and then use a negative lookahead (?!.*\?fbclid=) to assert what is on the right does not match ?fbclid=
If that is the case, you could match for example 0+ times a non whitespace char \S*
If you are not using the capturing groups for further processing, you could also make them non capturing (?:
^(https?:\/\/)?(www\.)?soundcloud\.com\/(?!.*\?fbclid=)\S*$
Regex demo
If the link should not contain a question mark, you could simplify the lookahead to (?!.*\?)

Regex to match ISO languages ISO

I have the following languages or language locale codes in a URL and i am trying to identify through REGEX. I was partially successful in identifying them but it is failing for some scenarios
Languages that i am testing with
en-us -- Passes
us -- Fails
Here is the REGEX that i have
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/)c\/(deals-and-tips\/)?
For instance:
https://forum.leasehackr.com/en-us/c/deals-and-tips (passes)
https://forum.leasehackr.com/us/c/deals-and-tips (fails)
What am I missing in the above REGEX?

The regex you wanted is:
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips\/)?
The difference from your regex is that I moved the first \/ from inside the parenthesis to outside (to sit with c\/).
Test here.

The last / fails the match in any case since your urls doesn't have it, in any way I would rewrite your regex as this: ([a-zA-Z]{2})(-[a-zA-Z]{2})?\/c\/(deals-and-tips)?.
This way it always looks for the first part (en) and consider the second (-us) as optional.
Alternatively use (\w{2})(-\w{2})?\/c\/(deals-and-tips)?, if you don't mind risking to match underscores and similar simbols

The reason your pattern does not match us is because the alternation ([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/) only matches the \/ in the second part of the alternation.
Also it does not match the last group with deals-and-tips because there is no trailing \/ in the example data.
Your updated pattern might look like
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips)?
Regex demo
You could shorten the pattern a bit by using an optional non capturing group (?:-[a-zA-Z]{2})? inside the first capturing group to optionally match the part starting with a hyphen.
As in the example data you could match the leading \/ in front of the capturing group to get a more efficient match.
\/([a-zA-Z]{2}(?:-[a-zA-Z]{2})?)\/c\/(deals-and-tips)?
In parts
\/ To be a bit more precise, match the leading /
( Capture group 1
[a-zA-Z]{2} Match 2 chars a-z
(?:-[a-zA-Z]{2})? Optionally match - and 2 chars a-z
) Close group
\/c\/ Match /c/deals-and-tips`
(deals-and-tips)? Optional capture group 2 match deals-and-tips
Regex demo
Note that if you use another delimiter than / you don't have to escape the forward slash.

Non capturing group included in capture?

This text
"dhdhd89(dd)"
Matched against this regex
.+?(?:\()
..returns "dhdhd89(".
Why is the start parenthesis included in the capture?
Two different tools, as well as the .NET Regex class, returns the same result. So I gather there is something I don't understand about this.
The way I read my regex is.
Match any character, at least one occurrence. As few as possible.
The matched string should be followed by a start parenthesis, but not to be included in the capture.
I can find workaround, but I still want to know what is going on.

Just turn the non-capturing group to positive lookahead assertion.
.+?(?=\()
.+? non-greedy match of one or more characters followed by an opening parenthesis. Assertions won't match any characters but asserts whether a match is possible or not. But the non-capturing group will do the matching operation.
DEMO

You can just use this negation based regex to capture only text before a literal (:
^([^(]+)
When you use:
.+?(?:\()
Regex engine does match ( after initial text but it just doesn't return that in a captured group to you.

You havn't defined capture groups then I guess you display the whole match (group 0), you can do:
(.+?)(?:\()
and the string you want is in group 1
or use lookahead as #AvinashRaj said.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

PCRE regex lookahead - regex

Related

Regular expression to exactly match the last path segment of an URL without parameters, except if the path ends with a trailing slash

Regex - How to prevent any string that starts with "de" but cannot use lookahead or lookbehind?

Regex to block ? parameter in url

Regex to match ISO languages ISO

Non capturing group included in capture?

Categories

Resources