RegEx for name: Any language but first letter must be capital - regex

I have a requirement to accept a first name as input and check that the first letter is caps and that there can be 1 space after the end of the string.
This RegEx works for 'Bob ':
^[A-Z][A-Za-z\p{L}]+[\s,.'\-]?[a-zA-Z\p{L}]*$
An extra requirement is then to allow any language / character which then involves allowing unicode.
This RegEx works for a russian name: 'Афанасий'
^[A-Z\p{L}][A-Za-z\p{L}]+[\s,.'\-]?[a-zA-Z\p{L}]*$
... However, while it allows for unicode, it also allows me to enter 'bob' with a small first letter and the RegEx allows this through.
Is there any way to allow both unicode and still flag up the first letter when it is not capital? ( Using a RegEx)
I could make some code changes to get round this issue but it would be nice to be able to keep it all in the RegEx value without making code changes.

Any Unicode uppercase letter can be matched with \p{Lu}.
Use
^\p{Lu}\p{L}+[\s,.'\-]?\p{L}*$
or
^\p{Lu}\p{L}+(?:[\s,.'-]\p{L}+)?$
See the regex demo 1 and regex demo 2. The second regex is more precise as it won't allow trailing whitespace, comma, etc. (what is defined in the [\s,.'-] character class).
Note that there is no point in using [A-Za-z\p{L}] since \p{L} already matches [a-zA-Z].
Pattern details:
^ - start of string
\p{Lu} - an uppercase Unicode letter
\p{L}+ - one or more Unicode characters
(?:[\s,.'-]\p{L}+)? - one or zero (optional) sequence of
[\s,.'-] - a whitespace, ,, ., ' or a hyphen
\p{L}+ - 1 or more Unicode letters
$ - end of string.

Related

Regex allow only Uppercase Extended ASCII

I need a regex to allow only Uppercase Extended ASCII characters of a maxLength I set before that it's the maximum length of the word.
Regex for uppercase letters: \P{Ll}*
Regex for extended ASCII letters: [\x00-\xFF]*
Using ^[\p{Ll}] it's not enough because I need characters to be extended ASCII(to not allow emoji or other special characters outrange ASCII extended).
How can I combine that 2 requirements ? And length of maxLength.
Thank you!!
Generally, you can use
^(?:(?=\p{Lu})\p{Latin}){1,10}$
See the regex demo. Details:
^ - start of string
(?: - start of a non-capturing group:
(?=\p{Lu})\p{Latin} - a char from Latin Unicode category class that is an uppercase letter
){1,10} - end of the group, repeat one to ten occurrences
$ - end of string.
Since you are using the regex in a DevExpress masked input component you need to enumerate all these letters in a character class. Based on Regex Latin characters filter and non latin character filer, you need
Latin-1 Supplement U+0080 - U+00FF
Latin Extended-A U+0100 - U+017F
Latin Extended-B U+0180 - U+024F
All chars that are uppercase letters in these three ranges are the ones you want to allow:
var res = []
for (var i=128; i<=591; i++) { // Get chars from \u0080 to \u024F
if (/^\p{Lu}$/u.test(String.fromCharCode(i))) { // If it is an uppercase letter
res.push(String.fromCharCode(i)); // Add it to the results
}
}
console.log(res.join(""))
The code will look like
settings.MaskExpression = "[\\u00C0-\\u00D6\\u00D8-\\u00DE\\u0100\\u0102\\u0104\\u0106\\u0108\\u010A\\u010C\\u010E\\u0110\\u0112\\u0114\\u0116\\u0118\\u011A\\u011C\\u011E\\u0120\\u0122\\u0124\\u0126\\u0128\\u012A\\u012C\\u012E\\u0130\\u0132\\u0134\\u0136\\u0139\\u013B\\u013D\\u013F\\u0141\\u0143\\u0145\\u0147\\u014A\\u014C\\u014E\\u0150\\u0152\\u0154\\u0156\\u0158\\u015A\\u015C\\u015E\\u0160\\u0162\\u0164\\u0166\\u0168\\u016A\\u016C\\u016E\\u0170\\u0172\\u0174\\u0176\\u0178\\u0179\\u017B\\u017D\\u0181\\u0182\\u0184\\u0186\\u0187\\u0189-\\u018B\\u018E-\\u0191\\u0193\\u0194\\u0196-\\u0198\\u019C\\u019D\\u019F\\u01A0\\u01A2\\u01A4\\u01A6\\u01A7\\u01A9\\u01AC\\u01AE\\u01AF\\u01B1-\\u01B3\\u01B5\\u01B7\\u01B8\\u01BC\\u01C4\\u01C7\\u01CA\\u01CD\\u01CF\\u01D1\\u01D3\\u01D5\\u01D7\\u01D9\\u01DB\\u01DE\\u01E0\\u01E2\\u01E4\\u01E6\\u01E8\\u01EA\\u01EC\\u01EE\\u01F1\\u01F4\\u01F6-\\u01F8\\u01FA\\u01FC\\u01FE\\u0200\\u0202\\u0204\\u0206\\u0208\\u020A\\u020C\\u020E\\u0210\\u0212\\u0214\\u0216\\u0218\\u021A\\u021C\\u021E\\u0220\\u0222\\u0224\\u0226\\u0228\\u022A\\u022C\\u022E\\u0230\\u0232\\u023A\\u023B\\u023D\\u023E\\u0241\\u0243-\\u0246\\u0248\\u024A\\u024C\\u024E]{1,10}";
The \u... part matches any letters from the ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƉƊƋƎƏƐƑƓƔƖƗƘƜƝƟƠƢƤƦƧƩƬƮƯƱƲƳƵƷƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶǷǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺȻȽȾɁɃɄɅɆɈɊɌɎ set.
The {1,10} limiting quantifier matches one to ten occurrences. You may adjust it further.
Slight modification of #Wiktor's comment that I think is easier to read:
^[^\P{Lu}\P{Latin}]{0,10}$
should match a string of a max of 10 uppercase Latin (inc. extended) characters. Using a negation class to find 10 characters that are not not uppercase nor not Latin. It does match such beautiful and definitely not cursed strings as ĦꜴꝎꞂꜨⱠƎƢƔ.

Regex and yup: Allow a certain special character, but it can't be repeated

I have this yup validator with a regex that allows all characters bellow:
Yup.string()
.required(MESSAGES.requiredField)
.min(min, MESSAGES.minCharacters(min))
.matches(
/^([a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð '])+$/u,
'Insert only normal character'
),
And the only special character that is allowed is " ' ", single quote. But I need to also validate if the user typing repeating times this character, if so, I need to block the form.
In this situation the form allows: Vinicius Sant'anna
But if the user types this: "Vinicius Sant''''''anna", I need to block. How can I improve my regex to also validate this case?
You can use
/^'?\p{L}+(?:[' ]\p{L}+)*'?$/u
Detais:
^ - start of string
'? - an optional leading '
\p{L}+ - one or more Unicode letters
(?:[' ]\p{L}+)* - zero or more occurrences of a ' or space and then one or more letters
'? - an optional ' char
$ - end of string
u - enables Unicode property classes in the regex.
If you need to also support diacrtics use
/^'?(?:\p{L}\p{M}*)+(?:['\s](?:\p{L}\p{M}*)+)*'?$/u
where (?:\p{L}\p{M}*)+ matches one or more occurrences of a letter and then zero or more diacritics. \s matches a whitespace.

Regex to detect preferred stock symbols

To start off, regex is probably the least talented aspect within my programming belt, this is what I have so far:
\D{1,5}(PR)\D+$
\D{1,5} because common stock symbols are always a maximum of 5 letters
(PR) because that is part of the pattern that needs to be searched (more below in the background info)
\D+$ because I'm trying to match any single letter at the end of the string
A small tidbit of background
Preferred stock symbols are not standardized and so every platform, exchange, etc has their own way to display them. Having said that, most display a special character in their name, which makes those guys easy to detect. The characters are
[] {'.', '/', '-', ' ', '+'};
The trickier ones all have a similar pattern:
{symbol}PR{0}
{symbol}p{0}
{symbol}P{0}
Where 0 is just any single letter A-Z
Here is a sample data set for the trickier ones:
PSAPRZ
PSApA
PSApZ
PSAPA
PSAPZ
My regex seems to be working for the first one, since I'm specifically looking for (PR) and matching any single letter character at the end, but I can't for the life of me figure out how to also detect the patterns that end in p{0} or P{0} in the same regex. I completely gave up trying to incorporate finding the special symbols because I can easily just do a string.Contains on the target string for any of those chars. The more important part is figuring out these trickier ones.
How do I get my regex statement to also detect the p{0} and P{0} matches within the same regex statement?
Edit 1
If you're curious at the madness of different possibilities, including the "easy to detect" versions, grab a popcorn, here you go :)
PSA.PA
PSA.PR.A
PSA/PA
PSAPRA
PSA-A
PSA PRA
PSA.PRA
PSA.PA
PSA+A
PSA/PRA
PSApA
PSAPA
PSA-PA
This should do it:
^[A-Z]{1,5}([Pp]|PR)[A-Z]$
Explanation:
^ - anchor at start
[A-Z]{1,5} - one to five uppercase letters
([Pp]|PR) - capture group used for: uppercase P or lowercase p or uppercase PR
[A-Z] - one uppercase letters
$ - anchor at end
UPDATE after EDIT 1 in question. To support the odd formats with ., /, -, + use this:
^[A-Z]{1,5}[.\/\s\+\-]?([Pp]|PR\.?)[A-Z]$
Explanation:
^ - anchor at start
[A-Z]{1,5} - one to five uppercase letters
[.\/\s\+\-]? - optional single character ., /, , +, -
([Pp]|PR\.?) - capture group used for: uppercase P, or lowercase p, or uppercase PR followed by optional .
[A-Z] - one uppercase letters
$ - anchor at end
Note on anchors: Use ^...$ anchors if you only have the stock symbol in the string. If you have text with a stock symbol anywhere within, use word boundaries \b...\b instead.
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Negating a complex regex containing three parts

I need a regex which is matched when the string doesn't have both lowercase and uppercase letters.
If the string has only lowercase letters -> should be matched
If the string has only uppercase letters -> should be matched
If the string has only digits or special characters -> should be matched
For example
abc, ABC, 123, abc123, ABC123&^ - should match
AbC, A12b, AB^%12c - should not match
Basically I need an inverse/negation of the following regex:
^(?=.*[a-z])(?=.*[A-Z]).+$
Does not sound like any lookarounds would be needed.
Either match only characters that are not a-z or only characters, that are not A-Z.
^(?:[^a-z]+|[^A-Z]+)$
See this demo at regex101 (used + for one or more)
You may use
^(?!.*[A-Z].*[a-z])(?!.*[a-z].*[A-Z])\S+$
Or
^(?=(?:[^a-z]+|[^A-Z]+)$).*$
See the regex demo #1 and regex demo #2
A lookaround solution like this can be used in more complex scenarios, when you need to apply more restrictions on the pattern. Else, consider a non-lookaround solution.
Details
^ - start of string
(?!.*[A-Z].*[a-z]) - no uppercase followed with a lowercase letter
(?!.*[a-z].*[A-Z]) - no lowercase letter followed with an uppercase one
(?=(?:[^a-z]+|[^A-Z]+)$) - a positive lookahead that requires 1 or more characters other than lowercase ASCII letters ([^a-z]+) to the end of the string, or 1 or more characters other than uppercase ASCII letters ([^A-Z]+) to the end of the string
.+ - 1+ chars other than line break chars
$ - end of string.
You can use this regex
^(([A-Z0-9?&%^](?![a-z]))+|([a-z0-9?&%^](?![A-Z]))+)$
You can test more cases here.
I've only added the characcter ?&%^ as possible character, but you could add which ever you like.
I would go with:
^(?:[^a-z]+?|[^A-Z]+?)$
It translates to "If the entire string is composed of non-lowercase letters or non-uppercase letters then match the string."
Lazy quantifiers +? are used so that the end-string $ anchor is obeyed when the multiline flag is enabled. If you're only validating a single-line string the you can simply use + without the question mark.
If you have a whitelist of specific allowed special chars then change [^A-Z] into [A-Z0-9()_+=-] and list the allowed special chars.
https://regex101.com/r/Wg6tLn/1

REGEX to find the first one or two capitalized words in a string

I am looking for a REGEX to find the first one or two capitalized words in a string. If the first two words is capitalized I want the first two words. A hyphen should be considered part of a word.
for Madonna has a new album I'm looking for madonna
for Paul Young has no new album I'm looking for Paul Young
for Emmerson Lake-palmer is not here I'm looking for Emmerson Lake-palmer
I have been using ^[A-Z]+.*?\b( [A-Z]+.*?\b){0,1} which does great on the first two, but for the 3rd example I get Emmerson Lake, instead of Emmerson Lake-palmer.
What REGEX can I use to find the first one or two capitalized words in the above examples?
You may use
^[A-Z][-a-zA-Z]*(?:\s+[A-Z][-a-zA-Z]*)?
See the regex demo
Basically, use a character class [-a-zA-Z]* instead of a dot matching pattern to only match letters and a hyphen.
Details
^ - start of string
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
(?:\s+[A-Z][-a-zA-Z]*)? - an optional (1 or 0 due to ? quantifier) sequence of:
\s+ - 1+ whitespace
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
A Unicode aware equivalent (for the regex flavors supporting Unicode property classes):
^\p{Lu}[-\p{L}]*(?:\s+\p{Lu}[-\p{L}]*)?
where \p{L} matches any letter and \p{Lu} matches any uppercase letter.
This is probably simpler:
^([A-Z][-A-Za-z]+)(\s[A-Z][-A-Za-z]+)?
Replace + with * if you expect single-letter words.
If u need a Full name only (a two words with the first capitalize letters), this is a simple example:
^([A-Z][a-z]*)(\s)([A-Z][a-z]+)$
Try it. Enjoy!