RegEx - Finding multiple dashes in entire string... and a little more - regex

So I am HORRIBLE with RegEx... can you help me with the following?
only allows 52 total characters
only has "a-z", "A-Z", "0-9", & "-" (letters, numbers, and dashes)
does not start with "-" (dash)
does not end with "-" (dash)
does not have "--" (two consecutive dashes together)
does not have more than 2 "-" (dashes) in the entire string (this is what I'm having problems with)
So here is a helpful list (I guess):
Pass:
abc-123
abc-123-abc
Fail:
-abc-123 (fails due to starting with a dash)
abc-123- (fails due to ending with a dash)
abc-123-abc-123 (fails due to 3 dashes)
abc-12#-abc (failed due to having a character that is not a-z, A-Z, 0-9, or a dash)
This is what I currently have, but feel free to change it however you would like:
(?!.*--)^[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,50}[a-zA-Z0-9])?$
I'm sure there's a better way to do this, but as mentioned above, I'm horrible with expressions. My expression works, it just doesn't find more than two dashes.
Thanks for your help.

You could use assert a maximum of 52 chars in a positive lookahead.
Then match 1 or more times [a-zA-Z0-9]+ and repeat 0, 1 or 2 times or more times the same pattern preceded with a -
^(?=[a-zA-Z0-9-]{1,52}$)[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+){0,2}$
Explanation
^ Start of line
(?= Positive lookahead, assert what is on the right is
[a-zA-Z0-9-]{1,52}$ Match any of the listed 1-52 times and assert end of string
) Close lookahead
[a-zA-Z0-9]+ Match 1+ times any of the listed to prevent matching an empty string
(?: Non capture group
-[a-zA-Z0-9]+ Match - and 1+ times any of the listed without the -
){0,2} Close group and repeat 0-2 times
$ End of line
Regex demo

You can use the following PCRE-flavoured regex:
/^(?!.*\-\-)(?!.*\-.+\-.*\-)(?!-)[a-z0-9-]{0,52}(?<!-)$/gmi
Demo
The regex can be made self-documenting by writing it in free-spacing mode:
/
^ # match beginning of line
(?!.*\-\-.*$) # the line may not contain two consecutive hyphens
(?!.*\-.*\-.*\-.*$) # the line may not contain more than two hyphens
(?!-) # the first char cannot be a hyphen
[a-z0-9-]{0,52} # match 0-52 letters, digits and hyphens
(?<!-) # the last char cannot be a hyphen
$ # match end of the line
/xgmi # free-spacing, global, multiline, case indifferent modes
(?!.*\-\-.*$), (?!.*\-.*\-.*\-.*$) and (?!-) are negative lookaheads; (?<!-) is a negative lookbehind.
This matches each line of a string (convenient for showing test cases at the demo). If the string contains a single line the regex can be simplified somewhat:
\A(?!.*\-\-)(?!.*\-.+\-.*\-)(?!-)[a-z0-9-]{0,52}(?<!-)\z
Not that \A and \z are beginning and end of string anchors, whereas ^ and $ are beginning and end of line anchors. Compare the negative lookaheads in this regex with those in the earlier one.
Should it matter, this matches empty strings.

Related

RegEx: How to match a whole string with fixed-length region with negative look ahead conditions that are overriden afterwards?

The strings I parse with a regular expression contain a region of fixed length N where there can either be numbers or dashes. However, if a dash occurs, only dashes are allowed to follow for the rest of the region. After this region, numbers, dashes, and letters are allowed to occur.
Examples (N=5, starting at the beginning):
12345ABC
12345123
1234-1
1234--1
1----1AB
How can I correctly match this? I currently am stuck at something like (?:\d|-(?!\d)){5}[A-Z0-9\-]+ (for N=5), but I cannot make numbers work directly following my region if a dash is present, as the negative look ahead blocks the match.
Update
Strings that should not be matched (N=5)
1-2-3-A
----1AB
--1--1A
You could assert that the first 5 characters are either digits or - and make sure that there is no - before a digit in the first 5 chars.
^(?![\d-]{0,3}-\d)(?=[\d-]{5})[A-Z\d-]+$
^ Start of string
(?![\d-]{0,3}-\d) Make sure that in the first 5 chars there is no - before a digit
(?=[\d-]{5}) Assert at least 5 digits or -
[A-Z\d-]+ Match 1+ times any of the listed characters
$ End of string
Regex demo
If atomic groups are available:
^(?=[\d-]{5})(?>\d+-*|-{5})[A-Z\d_]*$
^ Start of string
(?=[\d-]{5}) Assert at least 5 chars - or digit
(?> Atomic group
\d+-* Match 1+ digits and optional -
| or
-{5} match 5 times -
) Close atomic group
[A-Z\d_]* Match optional chars A-Z digit or _
$ End of string
Regex demo
Use a non-word-boundary assertion \B:
^[-\d](?:-|\B\d){4}[A-Z\d-]*$
A non word-boundary succeeds at a position between two word characters (from \w ie [A-Za-z0-9_]) or two non-word characters (from \W ie [^A-Za-z0-9_]). (and also between a non-word character and the limit of the string)
With it, each \B\d always follows a digit. (and can't follow a dash)
demo
Other way (if lookbehinds are allowed):
^\d*-*(?<=^.{5})[A-Z\d-]*$
demo

Regex for extracting digits in a string not in a word and not separated by a symbol?

I want to extract an ID from a search query but I don't know the length of the ID.
From this input I want to get the numbers that are not in the words and the numbers that are not separated by symbols.
12 11231390 good123e41 12he12o1 1391389 dajue1290a 12331 12-10 1.2 test12.0why 12+12 12*6 2d1139013 09`29 83919 1
Here I want to return
12 11231390 1391389 12331 83919 1
So far I've tried /\b[^\D]\d*[^\D]\b/gm but I get the numbers in between the symbols and I don't get the 1 at the end.
You could repeatedly match digits between whitespace boundaries. Using a word boundary \b would give you partial matches.
Note that [^\D] is the same as \d and would expect at least a single character.
Your pattern can be written as \b\d\d*\d\b and you can see that you don't get the 1 at the end as your pattern matches at least 2 digits.
(?<!\S)\d+(?:\s+\d+)*(?!\S)
The pattern matches:
(?<!\S) Negateive lookbehind, assert a whitespace boundary to the left
\d+(?:\s+\d+)* Match 1+ digits and optionally repeat matching 1+ whitespace chars and 1+ digits.
(?!\S) Negative lookahead, assert a whitspace boundary to the right
Regex demo
If lookarounds are not supported, you could use a match with a capture group
(?:^|\s)(\d+(?:\s+\d+)*)(?:$|\s)
Regex demo

Regex match string 3-6 characters long, at least one letter, no duplicate "-"

I have to match a string that is 3-6 characters long, contains at least one letter, but can have letters, numbers and only 1 "-".
The "-" must not be at the start or at the beginning.
Match:
string
str-ng
st-ng
s1-1g
st-1g
Do not match:
strings
-string
string-
st--ng
s-tn-g
1111
st
The closest I've gotten is this:
^((?!-.*-)[0-9A-Z]{3,6})$
But this divides the regex match with - So it matches s-tri but not st-ri because there aren't 3 chars at each end
Maybe you can use:
^(?=.*[a-z])(?!-|.*-$|.*-.*-)[a-z\d-]{3,6}$
See the online demo
^ - Start string anchor.
(?=.*[a-z]) - Positive lookahead to make sure there is at least one letter.
(?!-|.*-$|.*-.*-) - Negative lookahead to prevent a hyphen at the beginning or at the end or multiple.
[a-z\d-]{3,6} - Three to six times a character from the give class.
$ - End string anchor.
Note that I used the case-insensitive flag.
You can use
^(?=.{3,6}$)(?=[^a-zA-Z]*[A-Za-z])[0-9a-zA-Z]+(?:-[0-9a-zA-Z]+)?$
See the regex demo. Details:
^ - start of string
(?=.{3,6}$) - string must contain three to six chars other than line break chars
(?=[^a-zA-Z]*[A-Za-z]) - there must be at least one ASCII letter in the string
[0-9a-zA-Z]+ - one or more alphanumeric ASCII chars
(?:-[0-9a-zA-Z]+)? - an optional sequence of - and then one or more alphanumeric ASCII chars
$ - end of string.
Looking at the pattern that you tried, you meant to exclude the match when there are 2 hyphens present using the negative lookahead.
Also this part [0-9A-Z]{3,6} does not match a hyphen.
Reading
The "-" must not be at the start or at the beginning.
You might do that using
^(?![^\n-]*-[^\n-]*-)(?=[^a-zA-Z\n]*[a-zA-Z])[a-zA-Z0-9][a-zA-Z0-9-]{2,5}$
Regex demo
If you meant also no - at the end:
^(?![^\n-]*-[^\n-]*-)(?=[^a-zA-Z\n]*[a-zA-Z])[a-zA-Z0-9][a-zA-Z0-9-]{1,4}[a-zA-Z0-9]$
Explanation
^ Start of string
(?![^\n-]*-[^\n-]*-) Assert not 2 times -
(?=[^a-zA-Z\n]*[a-zA-Z]) Assert a char a-zA-Z
[a-zA-Z0-9] Match One of the listed without -
[a-zA-Z0-9-]{1,4} Repeat 1-4 times any of the listed including -
[a-zA-Z0-9] Match One of the listed without -
$ End of string
Regex demo

Regex for picking word with atleast two underscores

I am trying to parse a text document which contains some words with underscores.
I was looking to regex match but am failing currently.
I was looking at fetching (line by line) words which have atleast two underscores or words with atleast two underscores and forward slash + atleast three digits.
I have gotten till
([a-zA-Z]+(?:_{2,}[a-zA-Z]+)*)
The correct match examples are
VOK17_05_530_526002 *(has atleast than two underscores)*
VIE_ROMS_002 *(has atleast than two underscores)*
VOK_OVSZ_001/002 *(has atleast two underscores and a forward slash + three digits)*
Input sample
VOK17_05_530_526002 502 504 BACU VIE_ROMS_002 VIE_ROMS_001 VOK_OVSZ_001/002
VOK17_05_530_526002 401 401 LGCU VIE_ROMS_002 VIE_ROMS_001 VOK_OVSZ_001/002
VOK17_05_530_526002 510 513 BACU VIE_ROMS_002 VIE_ROMS_001 VOK_OVSZ_001/002
VOK17_05_530_526002 515 515 BACU VIE_ROMS_002 VIE_ROMS_001 VOK_OVSZ_001/002
VOK17_05_530_526003 503 506 BACU VIE_ROMS_002 VIE_ROMS_001 VOK_OVSZ_001/002
I am trying out my regex # https://regex101.com/r/yToVtc/1
If someone can help out here, I would be grateful.
You could use
\b[A-Za-z0-9]+(?:_[a-zA-Z0-9]+)+(?:_[0-9]{3,})+(?:_[a-zA-Z0-9]+)*(?:/[0-9]+)?\b
In parts
\b Word boundary
[A-Za-z0-9]+ Match 1+ times any of the listed
(?:_[a-zA-Z0-9]+)+ Repeat 1+ times an underscore and 1+ times any of the listed
(?:_[0-9]{3,})+ Match at least the second underscore and 3 or more digits
(?:_[a-zA-Z0-9]+)* Repeat 0+ times and underscore and any of the listed
(?:/[0-9]+)? Match an optional / and 1+ digits
\b Word boundary
Regex demo
Use this one:
\b[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+){2,}(?:/\d{3})?\b
Explanation:
\b # word boundary
[a-zA-Z0-9]+ # 1 or more alphanum
(?: # non capture group
_ # underscore
[a-zA-Z0-9]+ # 1 or more alphanum
){2,} # end group, must appear 2 or more times
(?: # non capture group
/ # a slask
\d{3} # 3 digits
)? # end group, optional
\b # word boundary
demo
Regex for a text string with at least two underscores
^[^_]*_[^_]*_.*$
^ - Start of match
[^_]* Then zero to any number of characters which are not an underscore
_ Then an underscore
[^_]* Then zero to any number of characters which are not an underscore
_ Then an underscore
.* Then zero to any number of characters
$ - End of match
The regex below uses a lookahead assertion to simplify matters:
\b(?=(?:([A-Za-z0-9]*_){2}))[A-Za-z0-9_]+(?:\/\d{3,})?\b
\b(?=(?:([A-Za-z0-9]*_){2})) Matches a word boundary if and only if is followed by ([A-Za-z0-9]*_){2}, which is to say, a string that contains 0 or more of the legal alphanumeric characters followed by a '_', all repeated twice. This is the lookahead assertion that ensures our match in step 2 below contains at least two underscore characters.
[A-Za-z0-9_]+ Matches the legal characters one or more times.
(?:\/\d{3,})? An optional forward slash followed by 3 or more digits.
\b Matches a word boundary.
See demo
([^\s]*_[^\s]*){2,} will match any word (any string without a space) that contains at least two underscores by finding two consequetive groups of any number of non spaces, an underscore and again howevermany "word"-symbols are to come.
You can replace [^\s]* with any definition of "word" you like.
You can also add ?: after the first bracket to avoid capturing the groups.

what does this regular expression mean?

^(?!-)[a-z\d\-]{1,100}$
Here's an explanation using regex comment mode, so this expanded form can itself be used as a regex:
(?x) # flag to enable comment mode
^ # start of line/string.
(?!-) # negative lookahead for literal hyphen (-) character, so fails if the next position contains one.
[a-z\d\-] # character class matches a single alpha (a-z), digit (\d) or hyphen (\-).
{1,100} # match the above [class] upto 100 times, at least once.
$ # end of line/string.
In short, it's matching upto 100 lowercase alphanumerics or hyphen, but the first character must not be hyphen.
Could be attempting to validate a serial number, or similar, but it's too general to say for sure.
Not all regex engines support negative lookaheads. If you're trying to figure out what it is doing in order to adapt for an engine without negative lookaheads, you can use:
^[a-z\d][a-z\d-]{0,99}$
(?!-) == negative lookahead
start of line not followed by a - that contains at least 1 to 100 characters that can be a-z or 0-9 or a - followed by the end of the line, though the \d in the character class is probably wrong and should be specified by 0-9 otherwise the a-z takes care of a 'd' character, depends on the regex flavor.
A string of letters, digits and dashes. Between 1 and 100 characters. The first character is not a dash.