Looking for a way to extract only words that are in ALL CAPS from a text string. The catch is that it shouldn't extract other words in the text string that are mixed case.
For example, how do I use regex to extract KENTUCKY from the following sentence:
There Are Many Options in KENTUCKY
I'm trying to do this using regexextract() in Google Sheets, which uses RE2.
Looking forward to hearing your thoughts.
Pretending that your text is in cell A2:
If there is only one instance in each text segment this will work:
=REGEXEXTRACT(A2,"([A-Z]{2,})")
If there are multiple instances in a single text segment then use this, it will dynamically adjust the regex to extract every occurrance for you:
=REGEXEXTRACT(A2, REPT(".* ([A-Z]{2,})", COUNTA(SPLIT(REGEXREPLACE(A2,"([A-Z]{2,})","$"),"$"))-1))
If you need to extract whole chunks of words in ALLCAPS, use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:\s+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:\s+\p{Lu}+)*\b")
See this regex demo.
Details
\b - word boundary
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
(?:\s+[A-Z]+)* - zero or more repetitions of
\s+ - 1+ whitespaces
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
\b - word boundary.
Or, if you allow any punctuations or symbols between uppercase letters you may use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:[^a-zA-Z0-9]+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:[^\p{L}\p{N}]+\p{Lu}+)*\b")
See the regex demo.
Here, [^a-zA-Z0-9]+ matches one or more chars other than ASCII letters and digits, and [^\p{L}\p{N}]+ matches any one or more chars other than any Unicode letters and digits.
This should work:
\b[A-Z]+\b
See demo
2nd EDIT ALL CAPS / UPPERCASE solution:
Finally got this simpler way from great other helping solutions here and here:
=trim(regexreplace(regexreplace(C15,"(?:([A-Z]{2,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
From this input:
isn'ter JOHN isn'tar DOE isn'ta or JANE
It returns this output:
JOHN DOE JANE
The Same For Title Case (Extracting All Capitalized / With 1st Letter As Uppercase Words :
Formula:
=trim(regexreplace(regexreplace(C1,"(?:([A-Z]([a-z]){1,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
Input in C1:
The friendly Quick Brown Fox from the woods Jumps Over the Lazy Dog from the farm.
Output in A1:
The Quick Brown Fox Jumps Over Lazy Dog
Previous less efficient trials :
I had to custom tailor it that way for my use case:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
""), "[A-Z]{2,}") = FALSE, "", REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
"")))
Going one by one over all exceptions and adding their respective regex formulations to the front of the multiple pipes separated regexes in the regexextract function.
#Wiktor Stribiżew any simplifying suggestions would be very welcome.
Found some missing and fixed them.
1st EDIT:
A simpler version though still quite lengthy:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(P3: P, "[a-z,]",
" "), "-|\.", " "), "(^[A-Z]\s)", " "
), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" "), "[A-Z]{2,}") = FALSE, " ", REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
P3: P, "[a-z,]", " "), "-|\.", " "),
"(^[A-Z]\s)", " "), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" ")))
From this example:
Multiple regex matches in Google Sheets formula
Related
I have the following phrases:
Mr "Smith"
MrS "Smith"
I need to retrieve only Smith from this phrases. I tried thousands of variants. I stoped on
(?!Mr|MrS)([^"]+).
Help, please.
The pattern (?!Mr|MrS)([^"]+) asserts from the current position that what is directly to the right is not Mr or MrS and then captures 1+ occurrences of any char except "
So it will not start the match at Mr but it will at r because at the position before the r the lookahead assertion it true.
Instead of using a lookaround, you could match either Mr or MrS and capture what is in between double quotes.
\mMrS? "([^"]+)"
\m A word boundary
MrS? Match Mr with an optional S
" Match a space and "
([^"]+) capture in group 1 what is between the "
" Match "
See a postgresql demo
For example
select REGEXP_MATCHES('Mr "Smith"', '\mMrS? "([^"]+)"');
select REGEXP_MATCHES('MrS "Smith"', '\mMrS? "([^"]+)"');
Output
regexp_matches
1 Smith
regexp_matches
1 Smith
I have a text that I need to split in subsentences but if the text contains special cases such as domain.com or st. moris it gets splitted at those points too.
Here is what I got:
val pattern = "(?<=[.](?<![s][t][.]))"
val text = "here is an axample with cases like st. moris and google.com here. second sentence."
val list = text.split(pattern)
list.foreach(println)
I want this code to return
List(
"here is an axample with cases like st. moris and google.com here.",
"second sentence."
)
but instead it returns:
List(
"here is an axample with cases like st.",
" moris and google.",
"com here.",
"second sentence."
)
How can I make it work?
If you want to split with 1+ whitespaces preceded with a dot that is not itself is preceded with st as a whole word, you may use
val pattern = """(?i)(?<=(?<!\bst)\.)\s+"""
Or, if the number of whitespace chars after the dot can be 0, you may implement the logic to avoid matching a . if it is followed with com, org, etc. as whole words:
val pattern = """(?i)(?<=\.(?<!\bst\.)(?!(?:com|org)\b))\s*+(?!$)"""
See the regex #1 demo and regex #2 demo. Details:
(?i) - makes the pattern case insensitive
(?<=(?<!\bst)\.) - a location immediately preceded with a dot that is not immediately preceded with a whole word st
\s+ - 1 or more whitespaces
Or
(?i) - makes the pattern case insensitive
(?<=\.(?<!\bst\.)(?!(?:com|org)\b)) - a location immediately preceded with a dot that is not immediately preceded with a whole word st and not immediately followed with com or org as whole words (add more alternatives if needed after |)
\s*+ - 0 or more whitespaces matched possessively
(?!$) - not at the end of string.
See Scala demo #1 (Scala demo #2):
val pattern = """(?i)(?<=(?<!\bst)\.)\s+"""
// val pattern = """(?i)(?<=\.(?<!\bst\.)(?!(?:com|org)\b))\s*+(?!$)""" // Pattern #2
val text = "here is an axample with cases like st. moris and google.com here. second sentence."
val list = text.split(pattern)
list.foreach(println)
Output:
here is an axample with cases like st. moris and google.com here.
second sentence.
Your code is returning such value because as you have mentioned in pattern you need to split when your mentioned symbol comes.
And one of the symbols among you mentioned is "." .
So after st when "." Comes it splits.
So you have two options either remove "." after st and Google or give something another symbol from pattern before "second" word and remove "." from pattern.
So this one works for me, and can be expanded with different exclusions in the text
((.+(st\.|mr\.|mrs\.))*.+?\.( |$))
Maybe there will be some sub-matches in the group, but you should look only for full matches. Here is the regex101.com example
As you see on the right, only two matches.
To add more exclusions, you should add to the (st\.|mr\.|mrs\.) part string pattern which you would like to count as exclusions.
The domain names are exluded with this part: \.( |$). It says, that the end of the sentence should be a dot and a space(or)end of the line.
Reply if it works in your environment.
Using regexpal.com to practice my regular expressions. I decided to start simply and ran into a problem.
Say you want to find all 3 letter words.
\s\w{3}\s
\s - space
\w - word characters
{3} - 3 and only 3 of the previous character
\s
If I have two three letter words next to each other example " and the " only the first is selected. I thought that after a regex found a match it would go back one character and start searching for the next matching string. (In which case it would "find" both " and " & " the ".
(?<=\s)\w{3}(?=\s)
Overlapping spaces.
Use 0 width assertions instead.When you use \s\w{3}\s on " abc acd " the regex engine consumes abc so the only thing left is acd which your regex will not match.So use lookaround to just assert and not consume.
EDIT:
\b\w{3}\b
Can also be used.
\b==>assert position at a word boundary (^\w|\w$|\W\w|\w\W)
or
(?:^|(?<=\s))\w{3}(?=\s|$)
This will find your 3 letter word even if it is at start or in middle or at end.
Using regexpal.com to practice my regular expressions. I decided to start simply and ran into a problem.
Say you want to find all 3 letter words.
\s\w{3}\s
\s - space
\w - word characters
{3} - 3 and only 3 of the previous character
\s
If I have two three letter words next to each other example " and the " only the first is selected. I thought that after a regex found a match it would go back one character and start searching for the next matching string. (In which case it would "find" both " and " & " the ".
(?<=\s)\w{3}(?=\s)
Overlapping spaces.
Use 0 width assertions instead.When you use \s\w{3}\s on " abc acd " the regex engine consumes abc so the only thing left is acd which your regex will not match.So use lookaround to just assert and not consume.
EDIT:
\b\w{3}\b
Can also be used.
\b==>assert position at a word boundary (^\w|\w$|\W\w|\w\W)
or
(?:^|(?<=\s))\w{3}(?=\s|$)
This will find your 3 letter word even if it is at start or in middle or at end.
Say I have the following string:
Something before _The brown "fox" jumped over_ Something after
I want to capture what's between _ and _, but only if there is an even number of quotes " between them. So the above case will be a match.
From the following, only the bold ones should be matched:
Some text _fir"st part_ and other text _seco"nd t"est_ and more _thir"d" "t"est_
Note that the second and third ones have 2 and 4 quotes, respectively.
I've tried to do it but I've not been very successful: _ (?= [^_]* " [^_]* " [^_]* _)* .*? _ The spaces are added for readability.
I'm using PHP if it's relevant.
You can use this regex:
_(([^"_]*"){2})+[^"_]*_
Online Demo: http://regex101.com/r/bN9pF1