Google Sheet: custom formula to split words by uppercase - regex

I would like to come up with a script to program a custom formula for google sheet. The idea is to split a string composed of multiple words. The formula should recognize the words where there is a capital letter and separate them. The result would be a string where the words are separated by ",".
To clarify this is an example of the string:
Nursing StudentStudentNurseNursing School
Desired Result:
Nursing Student,Student,Nurse,Nursing School
I have tried to use a formula in Google Sheet:
=split(regexreplace(A1,"[A-Z][^A-Z]*","$0"&char(9)),char(9))
However, it generates 6 cells with the below strings:
Nursing Student Student Nurse Nursing School
Can anybody help me or give me some hint?

=REGEXREPLACE(A1,"(\B)([A-Z])",",$2")
\B not a word Border.
[A-Z] Upper case letter.
If \B is followed by a upper case letter, replace the \B with ,

If you plan to insert a comma in between a lowercase letter and an uppercase letter, you may use either of:
=REGEXREPLACE(A1,"([a-z])([A-Z])","$1,$2")
=REGEXREPLACE(A1,"([[:lower:]])([[:upper:]])","$1,$2")
where
([a-z]) / ([[:lower:]]) - Capturing group 1 (later referred to with $1 from the replacement pattern): any lowercase ASCII letter
([A-Z]) / ([[:upper:]]) - Capturing group 2 (later referred to with $2 from the replacement pattern): any uppercase ASCII letter
Note that another suggestion, based on a non-word boundary \B, that can be written as =REGEXREPLACE(A1,"\B[A-Z]",",$0"), will also match an uppercase letter after _ and any digit, so it might overfire if you do not expect that behavior.

Related

Regular Expressions - A word with only one capitalized letter and which doesn't contain numbers

I am new to RegExp. I have a sentence and I would like to pull out a word which satisfies the following -
It must contain only one capitalized letter
It must consist of only characters/letters without numbers
For instance -
"appLe", "warDrobe", "hUsh"
The words that do not fit - "sf_dsfsdF", "331ffsF", "Leopard1997", "mister_Ram" et cetera.
How would you resolve this problem?
The following regex should work:
will find words that have only one capital letter
will only find words with letters (no numbers or special characters)
will match the entire word
\b(?=[A-Z])[A-Z][a-z]*\b|\b(?=[a-z])[a-z]+[A-Z][a-z]*\b
Matches:
appLe
hUsh
Harry
suSan
I
Rejects
HarrY - has TWO capital letters
warDrobeD - has TWO capital letters
sf_dsfsdF - has SPECIAL characters
331ffsF - has NUMBERS
Leopd1997 - has NUMBERS
mistram - does not have a CAPITAL LETTER
See it in action here
Note:
If the capital letter is OPTIONAL- then you will need to add a ? after each [A-Z] like this:
\b(?=[A-Z])[A-Z]?[a-z]*\b|\b(?=[a-z])[a-z]+[A-Z]?[a-z]*\b
You can do this by using character sets ([a-z] & [A-Z]) with appropriate quantifiers (use ? for one or zero capitals), wrapped in () to capture, surrounded by word breaks \b.
If the capital is optional and can appear anywhere use:
/\b([a-z]*[A-Z]?[a-z]*)\b/ //will still match empty string check for length
If you always want one capital appearing anywhere use:
/\b([a-z]*[A-Z][a-z]*)\b/ // does not match empty string
If you always want one capital that must not be the first or last character use:
/\b([a-z]+[A-Z][a-z]+)\b/ // does not match empty string
Here is a working snippet demonstrating the second regex from above in JavaScript:
const exp = /\b([a-z]*[A-Z][a-z]*)\b/
const strings = ["appLe", "warDrobe", "hUsh", "sf_dsfsdF", "331ffsF", "Leopard1997", "mister_Ram", ""];
for (const str of strings) {
console.log(str, exp.test(str))
}
Regex101 is great for dev & testing!
RegExp:
/\b[a-z\-]*[A-Z][a-z\-]*\b/g
Demo:
RegEx101
Explanation
Segment
Description
\b[a-z\-]*
Find a point where a non-word is adjacent to a word ([A-Za-z0-9\-] or \w), then match zero or more lowercase letters and hyphens (note, the hyphen needs to be escaped (\-))
[A-Z]
Find a single uppercase letter
[a-z\-]*\b
Match zero or more lowercase letters and hyphens, then find a point where a non-word is adjacent to a word

Need RE for Picking up only UPPERCASE set of words before end of line

I wanted to create regex that will pick up set of UPPERCASE words (seperated by spaces) on a line.
For Eg.in this text
TOPIC ONE
Description of this topic, one CAPITAL word
TOPIC NUMBER TWO
Description of this topic two CAPITAL word
I need to pick only TOPIC ONE and TOPIC NUMBER TWO but not the word CAPITAL.
I tried the following RE
\b[A-Z]+\b
which is able to pick up CAPITAL WORDS individually
I also tried
\b[A-Z]+\ \b
but it picks all except last UPPERCASE WORD.
I want to make sure that RE should only pick More than one word always.
Here is sample text to test:
CHIEF COMPLAINT Weakness inability to talk
HISTORY OF THE PRESENT ILLNESS This is a yearold
AfricanAmerican male with a history of hypertension who was
in his usual state of health
FAMILY HISTORY Unknown
SOCIAL HISTORY The patient lives
PHYSICAL EXAMINATION ON ADMISSION During the five minute
examination the patient became progressively less responsive
and then vomited requiring intubation and paralytics during
the examination
You may use
\b[A-Z]+(?:\s+[A-Z]+)+\b
\b[A-Z]+(?:[^\S\r\n]+[A-Z]+)+\b
\b\p{Lu}+(?:\h+\p{Lu}+)+\b
See the regex demo and the regex graph:
Details
\b - word boundary
[A-Z]+ - 1+ uppercase ASCII letters (\p{Lu} matches any Unicode uppercase letter)
(?:\s+[A-Z]+)+ - 1 or more consecutive occurrences of
\s+ - 1+ whitespaces ([^\S\r\n]+, \h+, [\p{Zs}\t]+ will match 1 or more horizontal whitespaces)
[A-Z]+ - 1+ uppercase ASCII letters
\b - word boundary

Regex to match a unlimited repeating pattern between two strings

I have a dataset with repeating pattern in the middle:
YM10a15b5c27
and
YM1b5c17
How can I get what is between "YM" and the last two numbers?
I'm using this but is getting one number in the end and should not.
/([A-Z]+)([0-9a-z]+)([0-9]+)/
Capture exactly two characters in the last group:
/([A-Z]+)([0-9a-z]+)([0-9]{2})/
You should use:
/^(?:([a-z]+))([0-9a-z]+)(?=\1)/
^ matches the start of the sentence. This is really important, because if your code is aaaa1234aaaa, then without the ^, it would also match the aaaa of the end.
(?:([a-z]+)) is a non-capturing group which takes any letter from 'a' to 'z' as group 1
(?=\1) tells the regex to match the text as long as it is followed by the same code at the starting.
All you have to do is extract the code by group(2)
An example is shown here.
Solution
If you want to match these strings as whole words, use \b(([a-z])\2)([0-9a-z]+)(\1)\b. If you need to match them as separate strings, use ^(([a-z])\2)([0-9a-z]+)(\1)$.
Explanation
\b - a word boundary (or if ^ is used, start of string)
(([a-z])\2) - Group 1: any lowercase ASCII letter, exactly two occurrences (aa, bb, etc.)
([0-9a-z]+) - Group 3: 1 or more digits or lowercase ASCII letters
(\1) - Group 4: the same text as stored in Group 1
\b - a word boundary (or if $ is used, end of string).

REGEX to find the first one or two capitalized words in a string

I am looking for a REGEX to find the first one or two capitalized words in a string. If the first two words is capitalized I want the first two words. A hyphen should be considered part of a word.
for Madonna has a new album I'm looking for madonna
for Paul Young has no new album I'm looking for Paul Young
for Emmerson Lake-palmer is not here I'm looking for Emmerson Lake-palmer
I have been using ^[A-Z]+.*?\b( [A-Z]+.*?\b){0,1} which does great on the first two, but for the 3rd example I get Emmerson Lake, instead of Emmerson Lake-palmer.
What REGEX can I use to find the first one or two capitalized words in the above examples?
You may use
^[A-Z][-a-zA-Z]*(?:\s+[A-Z][-a-zA-Z]*)?
See the regex demo
Basically, use a character class [-a-zA-Z]* instead of a dot matching pattern to only match letters and a hyphen.
Details
^ - start of string
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
(?:\s+[A-Z][-a-zA-Z]*)? - an optional (1 or 0 due to ? quantifier) sequence of:
\s+ - 1+ whitespace
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
A Unicode aware equivalent (for the regex flavors supporting Unicode property classes):
^\p{Lu}[-\p{L}]*(?:\s+\p{Lu}[-\p{L}]*)?
where \p{L} matches any letter and \p{Lu} matches any uppercase letter.
This is probably simpler:
^([A-Z][-A-Za-z]+)(\s[A-Z][-A-Za-z]+)?
Replace + with * if you expect single-letter words.
If u need a Full name only (a two words with the first capitalize letters), this is a simple example:
^([A-Z][a-z]*)(\s)([A-Z][a-z]+)$
Try it. Enjoy!

Regex to find Upper case character at beginning of each word in a field

I created a function that will compare a field against a regex and return 0 if it doesn't match the patter and 1 if it does. I've already created the class so I could create a UDF for the pattern matching.
function(expression,rexex) //If it matches it
I have been researching regex in SQL server for a bit this weekend and am at a bit of a crossroad.
I basically need to have the following pattern with 1 passing and 0 failing. Basically I want the first letter of every word do be capitalized:
the dog is bad - 0
The Dog Is Bad - 1
I'm ashamed to say that it's taken me all day just to figure out how to idenfity the first letter of each work and see if it's capital.
Here is what I have so far.
[\p{Lu}\p{Lt}]
Any help or nudge in the right direction would be appreciated.
Start of match (^) followed by one or more groups ((...)+) of a capital letter ([A-Z]) followed by zero or more word characters (\w*) followed by one or more spaces, or the end ((\s+|$)).
/^([A-Z]\w*(\s+|$))+/
This assumes letters only, and only one space per word:
^((?:\b[A-Z][a-z]*\b) {0,1})+$
Debuggex Demo
Free spaced:
^ //Start of line
( //(Capture)
(?: //(Non-capture)
\b // Followed by word boundary
[A-Z] // Followed by a capital letter
[a-z]* // Followed by zero or more lowercase letters
\b // Followed by word boundary
) {0,1} // Followed by either no space, or one space
)+ // One or more times
$ //End of line
You can use a Negative Lookahead (?!) to validate the line/sentence:
/(?!.*?\b[a-z].*?\b)^.*?$/gm
This will not pass on any string or line which has a word that begins with a lowercase letter.
As it seems you want to be unicode compatible, I'd do:
(?:^|\s+)(\p{lu}\p{Ll}*)