Related
I want to recognize the house number in a given string. Here you can find some sample inputs:
"My house number is 23"
"23"
"23a"
"23 a"
"The house number is 23 a and the street ist XY"
"The house number is 23 a"
I have the following regex:
\d+(([\s]{0,1}[a-zA-Z]{0,1}[\s])*|[\s]{0,1}[a-zA-Z]{0,1}$)
But it is not able to capture the inputs which have the number followed by a letter at the end of the line (e.g. the house number is 23 a).
Any help would be appreciated.
PS: I finally need the regex in typescript.
If I got your problem correctly, this should work:
(\d+(\s?[a-zA-Z]?\s?|\s?[a-zA-Z]$))
Note: [\s]{0,1} is the same as \s?
https://regex101.com/r/r6WHFy/1
The issue in your regex was that The house number is 23 a matches ([\s]{0,1}[a-zA-Z]{0,1}[\s])* part, thus the parser "does not need" to look for the part with end of string symbol.
You could also write the pattern using word boundaries and without using an alternation |
\b\d+(?:\s*[a-zA-Z])?\b
\b A word boundary
\d+ Match 1+ digits
(?:\s*[a-zA-Z])? Optionally match optional whitespace chars and a-zA-Z
\b A word boundary
const regex = /\b\d+(?:\s*[a-zA-Z])?\b/;
[
"My house number is 23",
"23",
"23a",
"23 a",
"The house number is 23 a and the street ist XY",
"The house number is 23 a"
].forEach(s => console.log(s.match(regex)[0]));
Regex demo
This is the PCRE2 regexp:
(?<=hello )(?:[^_]\w++)++
It's intended use is against strings like the following:
Hello Bob (Marius) Smith. -> Match "Bob"
Hello Bob Jr. (Joseph) White -> Match "Bob Jr."
Hello Bob Jr. IInd (Paul) Jobs -> Match "Bob Jr. IInd"
You get the point.
Essentially there is a magic word, in this case "hello", followed by a first name, followed by a second name which is always between parens.
First names could be anything really. A single word, a list of words followed by punctuation, and so on. Heck, look at Elon Musks' kids' name (X Æ A-Xii) to see how weird names can get :)
Let's only assume ascii, though. Æ is not in my targets :)
I'm at a loss on how to convert this Regexp to JS, and the only viable solution I found was to use PCRE2-wasm on node which spins up a wasm virtual machine and sucks up 1gb of resources just for that. That's insane.
This would match your cases in ECMAscript.
(?<=[Hh]ello )(?:[^_][\w.]+)+
You need to look for a capital H done by looking for [Hh] instead of h, as your testcases starts with a capital H and your + needs to be single to be used in ECMAscript.
also you need to include a . with the \w since it is included in some names.
https://regex101.com/r/lkZK7w/1
-- thanks "D M" for pointing out the missing . in the testcase.
#Nils has the correct answer.
If you do need to expand your acceptable character set, you can use the following regex. Check it out. The g, m, and i flags are set.
(?<=hello ).*(?=\([^\)]*?\))
Hello Bob (Marius) Smith.
Hello Bob Jr. (Joseph) White
Hello Bob Jr. IInd (Paul) Jobs
Hello X Æ A-Xii (Not Elon) Musk
Hello Bob ()) Jr. ( (Darrell) Black
Match Number
Characters
Matched Text
Match 1
6-10
Bob
Match 2
32-40
Bob Jr.
Match 3
61-74
Bob Jr. IInd
Match 4
92-102
X Æ A-Xii
Match 5
124-138
Bob ()) Jr. (
The idea is pretty simple:
Look behind for your keyword: (?<=hello ).
Look ahead for your middle name: (?=\([^\)]*?\)) (anything inside a set of parenthesis that is not a closing parenthesis, lazily so you don't take part of the first name).
Take everything between as your first name: .*.
The ++ does not work as Javascript does not support possessive quantifiers.
As a first name, followed by a second name which is always between parens, you might also use a capture group with a match instead of a lookbehind.
\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)
\b[Hh]ello Match hello or Hello
( Capture group 1
\w.*? Match 1+ word chars followed by any char as least as possible
) Close group 1
\s*\([^()\s]*\) Match optional whitespace char followed by ( till )
Regex demo
const regex = /\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)/;
["Hello Bob (Marius) Smith.",
"Hello Bob Jr. (Joseph) White",
"Hello Bob Jr. IInd (Paul) Jobs"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
})
With the lookbehind, you might also match word characters followed by an optionally repeated capture group matching whitspace chars followed by word characters or a dot.
(?<=[Hh]ello )\w+(?:\s+[\w.]+)*
Regex demo
The example names that I am trying it on are here
O'Kefe,Shiley
Folenza,Mitchel V
Briscoe Jr.,Sanford Ray
Andade-Alarenga,Blnca
De La Cru,Feando
Carone,Letca Jo
O'Conor,Mole K
Daeron III,Lawence P
Randall,Jason L
Esquel Mendez,Mara D
Dinle III,Jams E
Coras Sr.,Cleybr E
Hsieh-Krnk,Caolyn E
Graves II,Theodore R
I am trying to capture everything before comma except the roman numbers and Sr.|Jr. suffix.
So if the name is like Andade-Alarenga,Blnca I want to capture Andade-Alarenga, but if the name is Briscoe Jr.,Sanford Ray I just want Briscoe.
the code I have tried is here
^((?:(?![JjSs][rR]\.|\b(?:[IV]+))[^,]))
also this one - ^(?!\w+ \A[jr|sr|Jr|Sr].*)\w+| \w+ \w+|'\w+|-\w+$
[Regex101 my code with example sets][1]
https://regex101.com/r/jX5cK6/2
One option could be using a capturing group with a non greedy match up till the first occurrence of a comma and optionally before the comma match Jr Sr jr sr or a roman numeral.
Then match the comma itself. The value is in capture group 1.
An extended match for a roman numeral can be found for example on this page as the character class [XVICMD]+ is a broad match which would also allow other combinations.
^(\w.*?)(?: (?:[JjSs]r\.|[XVICMD]+\b))?,
^ Start of string
( Capture group 1
\w.*? Match a word char and 0+ times any char except a newline non greedy
) close group
(?: Non capturing group
(?: Match a space and start non capturing group
[JjSs]r\. Match any of the listed followed by r.
| Or
[XVICMD]+\b Match 1+ times any of the listed and a word boundary
) Close group
)? Close group and make it optional
, Match the comma
Regex demo
Because of your test on Regex101, I'm assuming your regex engine supports positive lookaheads (This is true for PCRE, Javascript or Python, for example)
A positive lookahead will enable you to match only what you want, without the need for capturing groups. The full match will be the string you're looking for.
^[\w'\- ]+?(?= ?(?:\b(?:[IVXCMD]*|\w+\.)),)
The part that matches the name is as simple as it gets:
^[\w'\- ]+?
All it does is match any of the characters on the list. the final ? is there to make it lazy: This way, the engine will only match as few characters as it needs to.
The important part is this one:
(?= ?(?:\b(?:[IVXCMD]*|\w+\.)),)
It is divided in two parts by the pipe (this character: |) there. The first part matches roman numerals (or nothing), and the second part matches titles (Basically, anything that ends on a .). Finally, we need to match the comma, because of your requirement.
Here it is on Regex101
You didn't specify a language so I used a regex in the replaceAll() String method of Java.
String[] names = {
"O'Kefe,Shiley", "Folenza,Mitchel V", "Briscoe Jr.,Sanford Ray",
"Andade-Alarenga,Blnca", "De La Cru,Feando", "Carone,Letca Jo",
"O'Conor,Mole K", "Daeron III,Lawence P", "Randall,Jason L",
"Esquel Mendez,Mara D", "Dinle III,Jams E", "Coras Sr.,Cleybr E",
"Hsieh-Krnk,Caolyn E", "Graves II,Theodore R"
};
for (String name : names) {
System.out.println(name + " -> "
+ name.replaceAll("(I{1,3},|((Sr|Jr)\\.,)|,).*", ""));
}
Here is a python solution using re.sub
import re
names = ["O'Kefe,Shiley", "Folenza,Mitchel V", "Briscoe Jr.,Sanford Ray",
"Andade-Alarenga,Blnca", "De La Cru,Feando", "Carone,Letca Jo",
"O'Conor,Mole K", "Daeron III,Lawence P", "Randall,Jason L",
"Esquel Mendez,Mara D", "Dinle III,Jams E", "Coras Sr.,Cleybr E",
"Hsieh-Krnk,Caolyn E", "Graves II,Theodore R"]
for name in names:
print(name, "->", re.sub("(I{1,3},|((Sr|Jr)\\.,)|,).*","",name))
You may use
^(?:(?![JS]r\.|\b(?:[XVICMD]+)\b)[^,])+\b(?<!\s)
See the regex demo
Details
^ - start of a string
(?:(?![JS]r\.|\b(?:[XVICMD]+)\b)[^,])+ - any char but , ([^,]), one or more occurrences (+), that does not start a Jr. or Sr. char sequence or a whole word consisting of 1 or more X, V, I, C, M,D chars
\b - a word boundary
(?<!\s) - no whitespace immediately to the left is allowed (it is trimming the match)
Looking for a way to extract only words that are in ALL CAPS from a text string. The catch is that it shouldn't extract other words in the text string that are mixed case.
For example, how do I use regex to extract KENTUCKY from the following sentence:
There Are Many Options in KENTUCKY
I'm trying to do this using regexextract() in Google Sheets, which uses RE2.
Looking forward to hearing your thoughts.
Pretending that your text is in cell A2:
If there is only one instance in each text segment this will work:
=REGEXEXTRACT(A2,"([A-Z]{2,})")
If there are multiple instances in a single text segment then use this, it will dynamically adjust the regex to extract every occurrance for you:
=REGEXEXTRACT(A2, REPT(".* ([A-Z]{2,})", COUNTA(SPLIT(REGEXREPLACE(A2,"([A-Z]{2,})","$"),"$"))-1))
If you need to extract whole chunks of words in ALLCAPS, use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:\s+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:\s+\p{Lu}+)*\b")
See this regex demo.
Details
\b - word boundary
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
(?:\s+[A-Z]+)* - zero or more repetitions of
\s+ - 1+ whitespaces
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
\b - word boundary.
Or, if you allow any punctuations or symbols between uppercase letters you may use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:[^a-zA-Z0-9]+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:[^\p{L}\p{N}]+\p{Lu}+)*\b")
See the regex demo.
Here, [^a-zA-Z0-9]+ matches one or more chars other than ASCII letters and digits, and [^\p{L}\p{N}]+ matches any one or more chars other than any Unicode letters and digits.
This should work:
\b[A-Z]+\b
See demo
2nd EDIT ALL CAPS / UPPERCASE solution:
Finally got this simpler way from great other helping solutions here and here:
=trim(regexreplace(regexreplace(C15,"(?:([A-Z]{2,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
From this input:
isn'ter JOHN isn'tar DOE isn'ta or JANE
It returns this output:
JOHN DOE JANE
The Same For Title Case (Extracting All Capitalized / With 1st Letter As Uppercase Words :
Formula:
=trim(regexreplace(regexreplace(C1,"(?:([A-Z]([a-z]){1,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
Input in C1:
The friendly Quick Brown Fox from the woods Jumps Over the Lazy Dog from the farm.
Output in A1:
The Quick Brown Fox Jumps Over Lazy Dog
Previous less efficient trials :
I had to custom tailor it that way for my use case:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
""), "[A-Z]{2,}") = FALSE, "", REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
"")))
Going one by one over all exceptions and adding their respective regex formulations to the front of the multiple pipes separated regexes in the regexextract function.
#Wiktor Stribiżew any simplifying suggestions would be very welcome.
Found some missing and fixed them.
1st EDIT:
A simpler version though still quite lengthy:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(P3: P, "[a-z,]",
" "), "-|\.", " "), "(^[A-Z]\s)", " "
), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" "), "[A-Z]{2,}") = FALSE, " ", REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
P3: P, "[a-z,]", " "), "-|\.", " "),
"(^[A-Z]\s)", " "), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" ")))
From this example:
Multiple regex matches in Google Sheets formula
I am using a wildcard find/replace involving the following find field:
([0-9]*)
(Please note that there should be a space at the end of the field even though I can't get it to stick on here on SO)
When I search on the text:
13 April Boon 87 155
(Just because it's not visually clear here, everything should be tab-separated except for the "87 155" and "April Boon", which have spaces.)
Since post-star is (nominally) a lazy evaluator, I would expect this to match only "87 ". This is the result that I want!
But it is making 4 matches:
"13 April "
"3 April "
"87 "
"7 "
This is all the more mysterious to me because it is NOT matching "13 April Boon 87 " or "3 April Boon 87 "
What's going on here? How can I get the match that I seek?
Thanks in advance!
Your wildcard pattern works as expected. Your pattern ([0-9]*) matches:
([0-9] - (Capture group 1, can be referenced with \1) a digit
*) - any characters but as few as possible up to the first...
- space.
Since matches are found from left to right, you have 4 matches. [0-9] matches a digit.
You can only capture 87 with a regex like (<[0-9]#>) <[0-9]#>^13.
(<[0-9]#>) - a whole "word" containing one or more digits
- a space
<[0-9]#> - a whole "word" containing one or more digits
^13 - carriage return