Abort regex execution when pattern found in negative lookahead syntax - regex

While struggling trying to validate SQL Server's connection string pattern using regex I've achieved the following result:
^(?!.*?(?<=^|\;)[a-zA-Z]+( [a-zA-Z]+)*(\=[^\;]+?\=[^\;]*)?(\;|$))+([a-zA-Z]+( [a-zA-Z]+)*\=[^\;]+\;?)+$
Sample string used was:
option=value;missingvalue;multiple assignment=123=456
* (hosted and tested in regex101)
And, as expected, the string didn't match. The issue is that I think this may not be standard, recommended nor optimal regex implementation — especially at the negative lookahead part, considering it's just going through the whole string even after a successful match.
I'll try to break down how it works below:
Negative Lookahead
1. ^(?!.*?(?<=^|;)
Negative lookahead pattern starting either at the beginning of the string or recursively throughout just after the semi colon character
2. [a-zA-Z]+( [a-zA-Z]+)*(=[^;]+?=[^;]*)?(;|$))+
Matching the simple or composite option names — that is, just [a-zA-Z]+ (mandatory) or, additionally, ( [a-zA-Z]+)* any number of times; afterwards there's an optional group that tries to match when there's more than one consecutive value assignment for any given option; finally it ends with either ; or $ (end of string) — in case of the first one, the lookahead pattern restarts from the beginning (recursion)
Regular Pattern Matching
([a-zA-Z]+( [a-zA-Z]+)*=[^;]+;?)+$
Not much new to say here other than that this is the pattern which should actually match the string after the initial Negative Lookahead thorough scan/validation.
I can't deny that it's kinda working for what I intended, but I can't hold back the feeling that I'm misunderstanding something about regex's workings.
Is there an easier way to do this while avoiding having to recursively look ahead using the pattern described above multiple times?
EDIT: As requested, some closer to real life examples would be the following — for both valid and invalid formatting:
VALID
Database=somedb;Username=admin;Password=P#ssword!23;Port=1433
INVALID
missing delimiter between Username and Password options
Database=somedb;Username=adminPassword=P#ssword!23;Port=1433
missing value for Port option
Database=somedb;Port;Username=admin;Password=P#ssword!23

The following string accepts only letters for the names. for the purposes of testing it accepts any character except equals and semi colon in the values. This would need to be defined as characters like line ending and tab would need to be excluded.
We have a negative lookahead to forbid a second equals sign in the values and a negative lookback to forbid a semi-colon before the end. Please note that your "correct" example is found to be wrong because there is no semi-colon at the end
If we try to block the otherway round it becomes impossible to match the regex.
I've added an optional single space in the name to match "Connection Timeout" and similar
/^(\s*[a-zA-Z]+ ?[a-zA-Z]+=[^=;]+;)+$/gm
I have also allowed spaces before the name.
Our string is made up of
^beginning of line
( start group
\s* optional whitespace before name
[a-zA-Z]+ ?[a-zA-Z]+name containing at least one letter before and after an optional space. This means at least two letters
=an equals sign
(start inner group
(?!\=) negative look ahead for equals sign
[^=;] any character except equals and semi-colon at least once
; a literal semi-colon.
){4,}close the outer group and repeat it at least 4 times
$ end of line
Thank you Casimir et Hippolyte for the improvement. I was using look-aheads and look-backs following the question but your syntax is much cleaner.

Related

Catastrophic backtracking issue with regular expression on long names

I am currently trying to validate a regex pattern for a names list among other things.
It actually works so far except when I try to test the limits. If the name is quite long, a maximum of 128 characters is allowed and then at the end a character which is defined in an inner group, such as:. a separator e.g. Space or a puncture, catastrophic backtracking occurs. Somehow I don't quite understand that because I would assume that group one (?:[\p{L}\p{Nd}\p{Ps}])+ 1 x must be there, group (?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}])? is optional and if the group has to be valid at the end (?:[\p{L}\p{Nd}\p{Pe}.]). The rear 2 groups can occur more often.
Full pattern
^(?!.{129})(?!.["])(?:[\p{L}\p{Nd}\p{Ps}])+(?:(?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}])?(?:[\p{L}\p{Nd}\p{Pe}.]))*$
Tests & Samples
https://regex101.com/r/6E0Khd/1
You need to re-phrase the pattern in such a way so that the consequent regex parts could not match at the same location inside the string.
You can use
^(?!.{129})(?!.")[\p{L}\p{Nd}\p{Ps}][\p{L}\p{Nd}\p{Pe}.]*(?:(?:\p{Zs}\p{P}?|\p{P}\p{Zs}?)[\p{L}\p{Nd}\p{Pe}.]+)*$
See the regex demo.
Your regex was ^<Lookahead_1><Lookahead_2><P_I>+(?:<OPT_SEP>?<P_II>)*$. You need to make sure your string only starts with a char that matches <P_I> pattern, the rest of the chars can match <P_II> pattern. So, it should look like ^<Lookahead_1><Lookahead_2><P_I><P_II>*(?:<SEP><P_II>+)*$. Note the P_I pattern is used to match the first char only, P_II pattern is added right after P_I to match zero or more chars matching that pattern, SEP pattern is now obligatory and P_II pattern is quantified with +.
I also shrunk the (?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}]) pattern into (?:\p{Zs}\p{P}?|\p{P}\p{Zs}?) (it matches either a horizontal whitespace and an optional punctuation proper symbol, or an optional punctuation proper symbol followed with an optional horizontal whitespace.
Note that \p{Zs} does not match a TAB char, you may want to use [\p{Zs}\t] instead.

Regex check for name Initials

I am trying to create a regex that checks if one or more middle-name initials have the following stucture:
INITIAL.[BLANK]INITIAL.[BLANK]INITIAL.
There can be multiple Initials as long as they are followed by a dot (.) - blank spaces are only allowed between two initials (e.g. L. B.)
It should not be possible to have a space after an initial if there's no other initial following.
At the moment, I have the following Regex which doesn't work perfectly as of now:
([A-Z]\. (?=[A-Z]|$))+
Using regex101, this is an example:
As you can see, it still matches the string even though there's a blank space at the end, without having another Initial following.
I am not sure why this is happening. I am just learning regex and would be glad if anyone could provide me with a solution to my problem :)
The error you're seeing is because at the last step, your expression reads in [A-Z]\. looks ahead for $ (and finds it). I would express the pattern this way: (?:[A-Z]\. )*[A-Z]\.$. Treat the last initial specially because it does not have a final space.
The pattern you tried ([A-Z]\. (?=[A-Z]|$))+ uses a repeated capturing group which will give you the value of the last iteration.
In that repetition you match a space <code>[A-Z]\. </code> effectively meaning that it should be present in the match.
You could repeat 0+ times matching a char [A-Z] followed by a space to match multiple occurrences.
Then match a char [A-Z] asserting what is on the right is not a non whitespace char.
\b(?:[A-Z]\. )*[A-Z]\.(?!\S)
Regex demo
If there can be multiple spaces but it should not match a newline:
\b(?:[A-Z]\.[^\S\r\n]*)*[A-Z]\.(?!\S)
Regex demo

What would be the Regex expression to get the first letter after a group of character and some integers?

I have a string that the following structure:
ABCD123456EFGHIJ78 but sometimes it's missing a number or a character like:
ABC123456EFGHIJ78 or
ABCD123456E or
ABCD12345EFGHIJ78
etc.
That's why I need regular expressions.
What I want to extract is the first letter of the third group, in this case 'E'.
I have the following regex:
(\D+)+(\d+)+(\D{1})\3
but I don't get the letter E.
This seems to work for the example cases you provided.
^(?:[A-Za-z]+)(?:\d+)(.)
It assumes that the first group is only letters and that the second group is only digits.
There's already a nice answer.
But for the records, your initial proposal was very close to work. You just needed to say that the character matching the 3rd group can repeat several times by adding a star:
^(\D+)(\d+)(\D{1})\3*
The main weakness is that \D matches any char except digits, so also spaces. Making it more robust leads us to explicit the range of chars accepted:
^([A-Za-z]+)(\d+)([A-Za-z]{1})\3*
It's much better, but my favourite uses \w to match at the end of the pattern any non white character:
([A-Za-z]+)(\d+)([A-Za-z]{1})\w*

Username cannot contain repeating underscore or period

I have always struggled with these darn things. I recall a lecturer telling us all once that if you have a problem which requires you use regular expressions to solve it, you in fact now have 2 problems.
Well, I certainly agree with this. Regex is something we don't use very often but when we do its like reading some alien language (well for me anyway)... I think I will resolve to getting the book and reading further.
The challenge I have is this, I need to validate a username based on the following criteria:
can contain letters, upper and lower
can contain numbers
can contain periods (.) and underscores (_)
periods and underscores cannot be consecutive i.e. __ .. are not allowed but ._._ would be valid.
a maximum of 20 characters in total
So far I have the following : ^[a-zA-Z_.]{0,20}$ but of course it allows repeat underscores and periods.
Now, I am probably doing this all wrong starting out with the set of valid characters and max length. I have been trying (unsuccessfully) to create some look-around or look-behind or whatever to search for invalid repetitions of period (.) and underscore (_) not sure what the approach or methodology to break down this requirement into a regex solution is.
Can anyone assist with a recommendation / alternative approach or point me in the right direction?
This one is the one you need:
^(?:[a-zA-Z0-9]|([._])(?!\1)){5,20}$
Edit live on Debuggex
You can have a demo of what it matches here.
"Either an alphanum char ([a-zA-Z0-9]), or (|) a dot or an underscore ([._]), but that isn't followed by itself ((?!\1)), and that from 5 to 20 times ({5,20})."
(?:X) simply is a non-capturing group, i.e. you can't refer to it afterwards using \1, $1 or ?1 syntaxes.
(?!X) is called a negative lookahead, i.e. literally "which is not followed by X".
\1 refers to the first capturing group. Since the first group (?:...){5,20} has been set as non-capturing (see #1), the first capturing group is ([._]).
{X,Y} means from X to Y times, you may change it as you need.
Don't try to shove this into a single regex. Your single regex works fine for all criteria except #4. To do #4, just do a regex that matches invalid usernames and reject the username if it matches. For example (in pseudocode):
if username.matches("^[a-zA-Z_.]{0,20}$") and !username.matches("__|\\.\\.") {
/* accept username */
}
You can use two negative lookahead assertions for this:
^(?!.*__)(?!.*\.\.)[0-9a-zA-Z_.]{0,20}$
Explanation:
(?! # Assert that it's impossible to match the following regex here:
.* # Any number of characters
__ # followed by two underscores in a row
) # End of lookahead
Depending on your requirements and on your regex engine, you may replace [0-9A-Za-z_.] with [\w.].
#sp00n raised a good point: You can combine the lookahead assertions into one:
^(?!.*(?:__|\.\.))[0-9a-zA-Z_.]{0,20}$
which might be a bit more efficient, but is a little harder to read.
For your answer above
I've tried to do what it you says on the account but it still says
The account name shall be a combination of letter, number or underscore
then after i am try do that then app reject that account
So write me a sample of the correct registration data according to the name I want to register is PACIFIC CONCORD INTERNATIONAL
And put signs and underscores on this name correctly so that the site accepts it
Thank you

Need a simple RegEx to find a number in a single word

I've got the following url route and i'm wanting to make sure that a segment of the route will only accept numbers. as such, i can provide some regex which checks the word.
/page/{currentPage}
so.. can someone give me a regex which matches when the word is a number (any int) greater than 0 (ie. 1 <-> int.max).
/^[1-9][0-9]*$/
Problems with other answers:
/([1-9][0-9]*)/ // Will match -1 and foo1bar
#[1-9]+# // Will not match 10, same problems as the first
[1-9] // Will only match one digit, same problems as first
If you want it greater than 0, use this regex:
/([1-9][0-9]*)/
This'll work as long as the number doesn't have leading zeros (like '03').
However, I recommend just using a simple [0-9]+ regex, and validating the number in your actual site code.
This one would address your specific problem. This expression
/\/page\/(0*[1-9][0-9]*)/ or "Perl-compatible" /\/page\/(0*[1-9]\d*)/
should capture any non-zero number, even 0-filled. And because it doesn't even look for a sign, - after the slash will not fit the pattern.
The problem that I have with eyelidlessness' expression is that, likely you do not already have the number isolated so that ^ and $ would work. You're going to have to do some work to isolate it. But a general solution would not be to assume that the number is all that a string contains, as below.
/(^|[^0-9-])(0*[1-9][0-9]*)([^0-9]|$)/
And the two tail-end groups, you could replace with word boundary marks (\b), if the RE language had those. Failing that you would put them into non-capturing groups, if the language had them, or even lookarounds if it had those--but it would more likely have word boundaries before lookarounds.
Full Perl-compatible version:
/(?<![\d-])(0*[1-9]\d*)\b/
I chose a negative lookbehind instead of a word boundary, because '-' is not a word-character, and so -1 will have a "word boundary" between the '-' and the '1'. And a negative lookbehind will match the beginning of the string--there just can't be a digit character or '-' in front.
You could say that the zero-width assumption ^ is just one of the cases that satisfies the zero-width assumption (?<![\d-]).
string testString = #"/page/100";
string pageNumber = Regex.Match(testString, "/page/([1-9][0-9]*)").Groups[1].Value;
If not matched pageNumber will be ""
While Jeremy's regex isn't perfect (should be tested in context, against leading characters and such), his advice is good: go for a generic, simple regex (eg. if you must use it in Apache's mod_rewrite) but by any means, handle the final redirect in server's code (if you can) and do a real check of parameter's validity there.
Otherwise, I would improve Jeremy's expression with bounds: /\b([1-9][0-9]*)$/
Of course, a regex cannot provide a check against any max int, at best you can control the number of digits: /\b([1-9][0-9]{0,2})$/ for example.
This will match any string such that, if it contains /page/, it must be followed by a number, not consisting of only zeros.
^(?!.*?/page/([0-9]*[^0-9/]|0*/))
(?! ) is a negative look-ahead. It will match an empty string, only if it's contained pattern does not match from the current position.