Regex catching adjacent characters with a single character set - regex

I am trying to construct a regex statement that matches a string conforming to the following conditions:
3-63 lowercase alphanumeric characters, plus "." and "-"
May not start or end with . or -
Dashes and periods cannot be adjacent to each other.
abc-123.xyz <- should match
abc123-.xyz <- should not match
I have been able to put this regex together, but it does not catch the third requirement. I've tried to use another negative lookahead/lookbehind,[i.e. - (?!.-|-.) ] but its still matching the strings with adjacent periods and dashes. Here's the regex statement I came up with that fulfills conditions 1 & 2:
^(?!\.|-)([a-z0-9]|\.|-){3,63}(?<!\.|-)$
FYI, this regex is for validating input when specifiying an AWS S3 bucket name in a CloudFormation template.

How about:
^(?=.{3,63}$)[a-z0-9]+(?:[-.][a-z0-9]+)*$

Use this Pattern ^(?!.*[.-](?=[.-]))[^.-][a-z0-9.-]{1,61}[^.-]$ Demo
# ^(?!.*[.-](?=[.-]))[^.-][a-z0-9.-]{1,61}[^.-]$
^ # Start of string/line
(?! # Negative Look-Ahead
. # Any character except line break
* # (zero or more)(greedy)
[.-] # Character in [.-] Character Class
(?= # Look-Ahead
[.-] # Character in [.-] Character Class
) # End of Look-Ahead
) # End of Negative Look-Ahead
[^.-] # Character not in [.-] Character Class
[a-z0-9.-] # Character in [a-z0-9.-] Character Class
{1,61} # (repeated {1,61} times)
[^.-] # Character not in [.-] Character Class
$ # End of string/line

^[a-z0-9](?:[a-z0-9]|[.\-](?=[a-z0-9])){2,62}$
We match a lowercase alphanumeric character, followed by between 2 and 62 repetitions of either:
a lowercase alphanumeric character, or
a . or - (which must be followed by a lowercase alphanumeric character).
The last restriction makes sure that you can't have two ./- characters in a row, or a ./- at the end of the string.

Related

Regex to capture optional characters

I want to pull out a base string (Wax) or (noWax) from a longer string, along with potentially any data before and after if the string is Wax. I'm having trouble getting the last item in my list below (noWax) to match.
Can anyone flex their regex muscles? I'm fairly new to regex so advice on optimization is welcome as long as all matches below are found.
What I'm working with in Regex101:
/(?<Wax>Wax(?:Only|-?\d+))/mg
Original string
need to extract in a capturing group
Loc3_341001_WaxOnly_S212
WaxOnly
Loc4_34412-a_Wax4_S231
Wax4
Loc3a_231121-a_Wax-4-S451
Wax-4
Loc3_34112_noWax_S311
noWax
Here is one way to do so, using a conditional:
(?<Wax>(no)?Wax(?(2)|(?:Only|-?\d+)))
See the online demo.
(no)?: Optional capture group.
(? If.
(2): Test if capture group 2 exists ((no)). If it does, do nothing.
|: Or.
(?:Only|-?\d+)
I assume the following match is desired.
the match must include 'Wax'
'Wax' is to be preceded by '_' or by '_no'. If the latter 'no' is included in the match.
'Wax' may be followed by:
'Only' followed by '_', in which case 'Only' is part of the match, or
one or more digits, followed by '_', in which case the digits are part of the match, or
'-' followed by one or more digits, followed by '-', in which case
'-' followed by one or more digits is part of the match.
If these assumptions are correct the string can be matched against the following regular expression:
(?<=_)(?:(?:no)?Wax(?:(?:Only|\d+)?(?=_)|\-\d+(?=-)))
Demo
The regular expression can be broken down as follows.
(?<=_) # positive lookbehind asserts previous character is '_'
(?: # begin non-capture group
(?:no)? # optionally match 'no'
Wax # match literal
(?: # begin non-capture group
(?:Only|\d+)? # optionally match 'Only' or >=1 digits
(?=_) # positive lookahead asserts next character is '_'
| # or
\-\d+ # match '-' followed by >= 1 digits
(?=-) # positive lookahead asserts next character is '-'
) # end non-capture group
) # end non-capture group

Modify this yup validation to change max length to 9 if the string does not include a dash

I'm trying to write a yup validator that validates a field's max length, depending on whether a dash is included in the string. If a dash is included, the max length is 10, if there is no dash, the max length should be 9.
For example:
'string-111' should have a max length of 10.
'string111' should have a max length of 9.
My current code looks like:
import * as Yup from 'yup';
export default Yup.object().shape({
description: Yup.string()
.matches(
/^[a-zA-Z0-9-]*$/,
'Invoice # can only contain letters, numbers and dashes'
)
.max(10, 'Invoice # has a max length of 10 characters'),
});
I see the yup documentation https://github.com/jquense/yup has a .when() method, but it seems to be used in very specific cases in their examples. Here, the user can place the dash anywhere in the string.
Any ideas on how to rewrite this validator, so that when there is no dash in the string, the maxlength should be 9?
You could match either match 10 chars where a hyphen can occur at any place using a positive lookahad, or match 9 chars consisting only of a-z0-9.
^(?:(?=[a-z0-9-]{10}$)[a-z0-9]*-[a-z0-9]*|[a-z0-9]{9})$
Explanation
^ Start of string
(?: Non capture group
(?= Positive lookahead, assert what is on the right is
[a-z0-9-]{10}$ Match 10 times either a-z0-9 or - till the end of the string
) Close lookahead
[a-z0-9]*-[a-z0-9]* Match a hyphen between chars a-z0-9
| Or
[a-z0-9]{9} Match 9 chars a-z0-9
) Close group
$ End of string
Regex demo
I worked up a solution I liked but found it had already been posted by #Thefourthbird, so I tried a different tack and came up with this:
/^(?=(?:-*[^-]-*){9}$)(?=(?:[^-]*-[^-]*){0,1}$).*/gm
You can see that this regex contains two positive lookaheads, both beginning at the start of a line. The first ensures that the string contains 9 non-hyphens; the second requires that there be at most one hyphen.
demo
The demo provides a detailed and thorough explanation of how this regex works, but we can also make it self-documenting by writing it in free-spacing mode:
/
^ # match beginning of string
(?= # begin a positive lookahead
(?:-*[^-]-*){9} # match 9 strings, each with one char that is
# not a hyphen, possibly preceded and/or
# followed by hyphens
$ # match the end of a line
) # end positive lookahead
(?= # begin a positive lookahead
(?:[^-]*-[^-]*){0,1} # match 0 or 1 strings, each containing one hyphen,
# possibly preceded and/or followed by non-hyphens
$ # match the end of the string
) # end positive lookahead
.* # match 0+ characters (the entire string)
/gmx # global, multiline and free-spacing regex
# definition modes
If desired, [^-] could replaced with [a-zA-Z0-9], \p{Alnum} or something else, depending on requirements.

Issue matching with regex

I am trying to write a regex to match the following rules:
a word is consisted only of letters, digits, apostrophes, hyphens and underscores
start with a letter or apostrophe followed by letter
do not contain sequence of 2 or more apostrophes, underscores or hyphens
end with a letter, digit or apostrophe preceded by the letter s or apostrophe followed by s
So far I have a few regexes built:
For rule 2 I have built
^[']?[a-zA-Z][a-zA-Z0-9]+
For rule 3 I have built
(?!.*[-_'][-_'])(?=[a-z])[a-zA-Z0-9]*
but for a test string abc def''ghi it matches ghi not abc
For rule 4 I have built
.*[a-zA-Z0-9](?:'s)?(?:s')?$
but for a test string test's abc' it does not match anything but it should match test's
I am looking for some advice for rule 3 and 4 on how to improve my regex so they work
(?:^|\s)\K(?!'')['a-z](?:['_-]?[a-z0-9])+['_-]?(?:(?<!')'s|s'|[a-z])(?=\s|$)
Explanation:
(?:^|\s) # non capture group, beginning of line OR space
\K # forget all we've seen until this position
(?!'') # negative lookahead, not two apos.
['a-z] # apos. or letter
(?: # start non capture group
['_-]? # apos, dash or underscore, optional
[a-z0-9] # a letter or digit
)+ # group may appear 1 or more times
['_-]? # apos, dash or underscore, optional
(?: # start non capture group
(?<!') # negative lookbehind, make sure we haven't apos before
's # apos and s
| # OR
s' # s and apos
| # OR
[a-z] # a letter
) # end group
(?=\s|$) # lookahead, make sure we have a space or end of line after
Demo

Regular expression captures unwanted string

I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)
Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)
It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1

Regular Expression to match strings

I want to match all the strings satifying following rules-
should consist of lower-case letters and digits and dashes
should start with a letter or a number
should end with a letter or number
total string length should be atleast 3 and atmost 20 characters
dot . is optional, there shouldn't be two or more consecutive dots .
dash - is optional, there shouldn't be two or more consecutive dashes -
dot . and dash - shouldn't be consecutive // the string aaa.-aaabbb is invalid
underscore not allowed
I have come up with this regex:
^[a-z0-9]([a-z0-9]+\.?\-?[a-z0-9]+){1,18}[a-z0-9]$
[a-z0-9] //should start/end with a letter or a number
([a-z0-9]+\.?\-?[a-z0-9]+){1,18} //other rules
However it is failing in some scenarios like -
abcdefghijklmnopqrstuvwxyz //should fail total number of chars greater than 20
aaa.-aaabbb //should fail as dot '.' and dash '-' are consecutive
Can anyone please help me in correcting this regex?
You can achieve this with a lookahead assertion:
^(?!.*[.-]{2})[a-z0-9][a-z0-9.-]{1,18}[a-z0-9]$
Explanation:
^ # Start of string
(?! # Assert that the following can't be matched:
.* # Any number of characters
[.-]{2} # followed by .. or -- or .- or -.
) # End of lookahead
[a-z0-9] # Match lowercase letter/digit
[a-z0-9.-]{1,18} # Match 1-18 of the allowed characters
[a-z0-9] # Match lowercase letter/digit
$ # End of string
I came up with this which uses a negative lookahead similar to Tim's solution but a different way of appying it. Because it only does the look ahead when it sees a dot or a dash it may not need to do quite so much back tracking which may make it perform very slightly faster.
^[a-z0-9]([a-z0-9]|([-.](?![.-]))){1,18}[a-z0-9]$
Explanation:
^ # Start of string
[a-z0-9] # Must start with a letter or number
( # Begin Group
[a-z0-9] # Match a letter or number
| # OR
([-.](?![.-])) # Match a dot or dash that is not followed by a dot or dash
){1,18} # Match group 1 to 18 times
[a-z0-9] # Must end with a letter or number
$ # End of string