validate a path structure ending in a filename.xml - regex

I'm having trouble with a regular expression. Hope someone can help or point in right direction. Essentially I've to validate a path structure.
The rules for valid input to my method are:
(including the forward slash) starts with /usersname/
must only be one occurrence of /usersname/
after the one occurrence of /usersname/ there must only be one [aphanumeric-_ space].xml
So for example, the following are valid input into my method:
/norrisc/thesf6457.xml
/norrisc/thess63-57.xml
/norrisc/thqsf64-57 gdhy.xml
/norrisc/ase45tg_3.xml
.. and the following are *in*valid input into method:
/norrisc/anotherFolder/thesf6457.xml
/norrisc/norrisc/thess63-57.xml
/norrisc/norrisc/thess63-57.txt
/norrisc/norrisc/thess63-57
/norrisc/thqsf64-57 gdhy.xml/kjhfsd.xml
My efforts (to no avail) so far are..
\b[/username/]{1}^[a-zA-Z0-9_\\s-]+$\.xml
^[/username/]{1}[a-zA-Z0-9_\\s-]+$\.xml{1}
\b/username/{1}[a-zA-Z0-9_\\s-]+$\.xml{1}
Hope someone can help.. ?
Thanks v much

This worked for me on your test cases:
^\/username\/(?!.*\/)(\w|\s|-)+\.xml$
where username, obviously, is the literal username or a variable containing it.
Breaking that down ...
^ - start of string
\/username\/ - literal username enclosed by /
(?!.*\/) - negative lookahead: ensures the rest of the string does not contain another /
(\w|\s|-)+ - one or more letters, digits, spaces, _, or -
\.xml - literal .xml
$ - end of string
If you're unfamiliar with lookaheads, the (?=) structure lets you match using a zero-width assertion. For example, (?=a) would attempt to recognize an a character but will NOT include it as part of the match (that's what "zero-width" means— ^ and $ are other examples of zero-width assertions). This is called a positive lookahead and lets you "skip over" characters in a sense.
(?!) does the same thing, but checks if the specified pattern does not exist. It's called a negative lookahead. So in the regex above, (?!.*\/) looks for the .*\/ pattern, which means "any or zero character(s) followed by a slash". If it finds this, such as in the string /username/another_username/whatever.xml, the match will NOT succeed (because the lookahead is negative).

Related

Abort regex execution when pattern found in negative lookahead syntax

While struggling trying to validate SQL Server's connection string pattern using regex I've achieved the following result:
^(?!.*?(?<=^|\;)[a-zA-Z]+( [a-zA-Z]+)*(\=[^\;]+?\=[^\;]*)?(\;|$))+([a-zA-Z]+( [a-zA-Z]+)*\=[^\;]+\;?)+$
Sample string used was:
option=value;missingvalue;multiple assignment=123=456
* (hosted and tested in regex101)
And, as expected, the string didn't match. The issue is that I think this may not be standard, recommended nor optimal regex implementation — especially at the negative lookahead part, considering it's just going through the whole string even after a successful match.
I'll try to break down how it works below:
Negative Lookahead
1. ^(?!.*?(?<=^|;)
Negative lookahead pattern starting either at the beginning of the string or recursively throughout just after the semi colon character
2. [a-zA-Z]+( [a-zA-Z]+)*(=[^;]+?=[^;]*)?(;|$))+
Matching the simple or composite option names — that is, just [a-zA-Z]+ (mandatory) or, additionally, ( [a-zA-Z]+)* any number of times; afterwards there's an optional group that tries to match when there's more than one consecutive value assignment for any given option; finally it ends with either ; or $ (end of string) — in case of the first one, the lookahead pattern restarts from the beginning (recursion)
Regular Pattern Matching
([a-zA-Z]+( [a-zA-Z]+)*=[^;]+;?)+$
Not much new to say here other than that this is the pattern which should actually match the string after the initial Negative Lookahead thorough scan/validation.
I can't deny that it's kinda working for what I intended, but I can't hold back the feeling that I'm misunderstanding something about regex's workings.
Is there an easier way to do this while avoiding having to recursively look ahead using the pattern described above multiple times?
EDIT: As requested, some closer to real life examples would be the following — for both valid and invalid formatting:
VALID
Database=somedb;Username=admin;Password=P#ssword!23;Port=1433
INVALID
missing delimiter between Username and Password options
Database=somedb;Username=adminPassword=P#ssword!23;Port=1433
missing value for Port option
Database=somedb;Port;Username=admin;Password=P#ssword!23
The following string accepts only letters for the names. for the purposes of testing it accepts any character except equals and semi colon in the values. This would need to be defined as characters like line ending and tab would need to be excluded.
We have a negative lookahead to forbid a second equals sign in the values and a negative lookback to forbid a semi-colon before the end. Please note that your "correct" example is found to be wrong because there is no semi-colon at the end
If we try to block the otherway round it becomes impossible to match the regex.
I've added an optional single space in the name to match "Connection Timeout" and similar
/^(\s*[a-zA-Z]+ ?[a-zA-Z]+=[^=;]+;)+$/gm
I have also allowed spaces before the name.
Our string is made up of
^beginning of line
( start group
\s* optional whitespace before name
[a-zA-Z]+ ?[a-zA-Z]+name containing at least one letter before and after an optional space. This means at least two letters
=an equals sign
(start inner group
(?!\=) negative look ahead for equals sign
[^=;] any character except equals and semi-colon at least once
; a literal semi-colon.
){4,}close the outer group and repeat it at least 4 times
$ end of line
Thank you Casimir et Hippolyte for the improvement. I was using look-aheads and look-backs following the question but your syntax is much cleaner.

Regex only solution for matching the single occurence without a prefix within a string

Suppose I have a string that can have parameters
--true config --false
All parameters can be prefixed by a predetermined prefix like -- or ! and they can occur at any place in the string, before or after the "non parameter".
Using only regex, how can I match all the "non-parameters" in the string, e.g in the example above it would be config.
I've looked at other answers suggesting a negative look aheads such as
^(?!--).* but those only work for the whole string itself.
You may use
(?<!\S)(?!--|!)\S+
See the regex demo.
Details
(?<!\S) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a non-whitespace
(?!--|!) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a -- or ! substring
\S+ - 1+ non-whitespace chars.

Name validation - Adding a check to this regex to stop entering just identical characters

I'm trying to add another feature to a regex which is trying to validate names (first or last).
At the moment it looks like this:
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$)([a-z][a-z'-]{1,})$/i
https://regex101.com/r/pQ1tP2/1
The idea is to do the following
Don't allow just adding a title like Mr, Mrs etc
Ensure the first character is a letter
Ensure subsequent characters are either letters, hyphens or apostrophes
Minimum of two characters
I have managed to get this far (shockingly I find regex so confusing lol).
It matches things like O'Brian or Anne-Marie etc and is doing a pretty good job.
My next additions I've struggled with though! trying to add additional features to the regex to not match on the following:
Just entering the same characters i.e. aaa bbbbb etc
Thanks :)
I'd add another negative lookahead alternative matching against ^(.)\1*$, that is, any character, repetead until the end of the string.
Included as is in your regex, it would make that :
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$|^(.)\1*$)([a-z][a-z'-]{1,})$/i
However, I would probably simplify your negative lookahead as follows :
/^(?!(mr|ms|miss|dr|mr-mrs|(.)\2*)$)([a-z][a-z'-]{1,})$/i
The modifications are as follow :
We're evaluating the lookahead at the start of the string, as indicated by the ^ preceding it : no need to repeat that we match the start of the string in its clauses
Each alternative match the end of the string. We can put the alternatives in a group, which will be followed by the end-of-string anchor
We have created a new group, which we have to take into account in our back-reference : to reference the same group, it now must address \2 rather than \1. An alternative in certain regex flavours would have been to use a non-capturing group (?:...)

Match pattern anywhere in string?

I want to match the following pattern:
Exxxx49 (where x is a digit 0-9)
For example, E123449abcdefgh, abcdefE123449987654321 are both valid. I.e., I need to match the pattern anywhere in a string.
I am using:
^*E[0-9]{4}49*$
But it only matches E123449.
How can I allow any amount of characters in front or after the pattern?
Remove the ^ and $ to search anywhere in the string.
In your case the * are probably not what you intended; E[0-9]{4}49 should suffice. This will find an E, followed by four digits, followed by a 4 and a 9, anywhere in the string.
I would go for
^.*E[0-9]{4}49.*$
EDIT:
since it fullfills all requirements state by OP.
"[match] Exxxx49 (where x is digit 0-9)"
"allow for any amount of characters in front or after pattern"
It will match
^.* everything from, including the beginning of the line
E[0-9]{4}49 the requested pattern
.*$ everthing after the pattern, including the the end of the line
Your original regex had a regex pattern syntax error at the first *. Fix it and change it to this:
.*E\d{4}49.*
This pattern is for matching in engines (most engines) that are anchored, like Java. Since you forgot to specify a language.
.* matches any number of sequences. As it surrounds the match, this will match the entire string as long as this match is located in the string.
Here is a regex demo!
Just simply use this:
E[0-9]{4}49
How do I allow for any amount of characters in front or after pattern? but it only matches E123449
Use global flag /E\d{4}49/g if supported by the language
OR
Try with capturing groups (E\d{4}49)+ that is grouped by enclosing inside parenthesis (...)
Here is online demo

Regular expression to split a string but consider multi-digit escape sequences

I could need some help on the following problem with regular expressions and would appreciate any help, thanks in advance.
I have to split a string by another string, let me call it separator. However, if an escape sequence preceeds separatorString, the string should not be split at this point. The escape sequence is also a string, let me call it escapeSequence.
Maybe it is better to start with some examples
separatorString = "§§";
escapeSequence = "###";
inputString = "Part1§§Part2" ==> Desired output: "Part1", "Part2"
inputString = "Part1§§Part2§§ThisIs###§§AllPart3" ==> Desired output: "Part1", "Part2", "ThisIs###§§AllPart3"
Searching stackoverflow, I found Splitting a string that has escape sequence using regular expression in Java and came up with the regular expression
"(?<!(###))§§".
This is basically saying, match if you find "§§", unless it is preceeded by "###".
This works fine with Regex.Split for the examples above, however, if inputString is "Part1###§§§§Part2" I receive "Part1###§", "§Part2" instead of "Part1###§§", "Part2".
I understand why, as the second "§" gives a match, because the proceeding chars are "##§" and not "###". I tried several hours to modify the regex, but the result got only worse. Does someone have an idea?
Let's call the things that appear between the separators, tokens. Your regex needs to stipulate what the beginning and end of a token looks like.
In the absence of any stipulation, in other words, using the regex you have now, the regex engine is happy to say that the first token is Part1###§ and the second is §Part2.
The syntax you used, (?<!foo) , is called a zero-width negative look-behind assertion. In other words, it looks behind the current match, and makes an assertion that it must match foo. Zero-width just indicates that the assertion does not advance the pointer or cursor in the subject string when the assertion is evaluated.
If you require that a new token start with something specific (say, an alphanumeric character), you can specify that with a zero-width positive lookahead assertion. It's similar to your lookbehind, but it says "the next bit has to match the following pattern", again without advancing the cursor or pointer.
To use it, put (?=[A-Z]) following the §§. The entire regex for the separator is then
(?<!###)§§(?=[A-z]).
This would assert that the character following a separator sequence needs to be an uppercase alpha, while the characters preceding the separator sequence must not be ###. In your example, it would force the match on the §§ separator to be the pair of chars before Part2. Then you would get Part1###§§ and Part2 as the tokens, or group captures.
If you want to stipulate what a token is in the negative - in other words to stipulate the a token begins with anything except a certain pattern, you can use a negative lookahead assertion. The syntax for this is (?!foo). It works just as you would expect - like your negative lookbehind, only looking forward.
The regular-expressions.info website has good explanations for all things regex, including for the lookahead and lookbehind constructs.
ps: it's "Hello All", not "Hello Together".
How about doing the opposite: Instead of splitting the string at the separators match non-separator parts and separator parts:
/(?:[^§#]|§[^§#]|#(?:[^#]|#(?:[^#]|#§§)))+|§§/
Then you just have to remove every matched separator part to get the non-separator parts.