regex to match the end of string doesn't work - regex

I want to recognize the house number in a given string. Here you can find some sample inputs:
"My house number is 23"
"23"
"23a"
"23 a"
"The house number is 23 a and the street ist XY"
"The house number is 23 a"
I have the following regex:
\d+(([\s]{0,1}[a-zA-Z]{0,1}[\s])*|[\s]{0,1}[a-zA-Z]{0,1}$)
But it is not able to capture the inputs which have the number followed by a letter at the end of the line (e.g. the house number is 23 a).
Any help would be appreciated.
PS: I finally need the regex in typescript.

If I got your problem correctly, this should work:
(\d+(\s?[a-zA-Z]?\s?|\s?[a-zA-Z]$))
Note: [\s]{0,1} is the same as \s?
https://regex101.com/r/r6WHFy/1
The issue in your regex was that The house number is 23 a matches ([\s]{0,1}[a-zA-Z]{0,1}[\s])* part, thus the parser "does not need" to look for the part with end of string symbol.

You could also write the pattern using word boundaries and without using an alternation |
\b\d+(?:\s*[a-zA-Z])?\b
\b A word boundary
\d+ Match 1+ digits
(?:\s*[a-zA-Z])? Optionally match optional whitespace chars and a-zA-Z
\b A word boundary
const regex = /\b\d+(?:\s*[a-zA-Z])?\b/;
[
"My house number is 23",
"23",
"23a",
"23 a",
"The house number is 23 a and the street ist XY",
"The house number is 23 a"
].forEach(s => console.log(s.match(regex)[0]));
Regex demo

Related

Exclude word and quotes from regexp

I have the following phrases:
Mr "Smith"
MrS "Smith"
I need to retrieve only Smith from this phrases. I tried thousands of variants. I stoped on
(?!Mr|MrS)([^"]+).
Help, please.
The pattern (?!Mr|MrS)([^"]+) asserts from the current position that what is directly to the right is not Mr or MrS and then captures 1+ occurrences of any char except "
So it will not start the match at Mr but it will at r because at the position before the r the lookahead assertion it true.
Instead of using a lookaround, you could match either Mr or MrS and capture what is in between double quotes.
\mMrS? "([^"]+)"
\m A word boundary
MrS? Match Mr with an optional S
" Match a space and "
([^"]+) capture in group 1 what is between the "
" Match "
See a postgresql demo
For example
select REGEXP_MATCHES('Mr "Smith"', '\mMrS? "([^"]+)"');
select REGEXP_MATCHES('MrS "Smith"', '\mMrS? "([^"]+)"');
Output
regexp_matches
1 Smith
regexp_matches
1 Smith

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary

Extract only ALLCAPS words with regex

Looking for a way to extract only words that are in ALL CAPS from a text string. The catch is that it shouldn't extract other words in the text string that are mixed case.
For example, how do I use regex to extract KENTUCKY from the following sentence:
There Are Many Options in KENTUCKY
I'm trying to do this using regexextract() in Google Sheets, which uses RE2.
Looking forward to hearing your thoughts.
Pretending that your text is in cell A2:
If there is only one instance in each text segment this will work:
=REGEXEXTRACT(A2,"([A-Z]{2,})")
If there are multiple instances in a single text segment then use this, it will dynamically adjust the regex to extract every occurrance for you:
=REGEXEXTRACT(A2, REPT(".* ([A-Z]{2,})", COUNTA(SPLIT(REGEXREPLACE(A2,"([A-Z]{2,})","$"),"$"))-1))
If you need to extract whole chunks of words in ALLCAPS, use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:\s+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:\s+\p{Lu}+)*\b")
See this regex demo.
Details
\b - word boundary
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
(?:\s+[A-Z]+)* - zero or more repetitions of
\s+ - 1+ whitespaces
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
\b - word boundary.
Or, if you allow any punctuations or symbols between uppercase letters you may use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:[^a-zA-Z0-9]+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:[^\p{L}\p{N}]+\p{Lu}+)*\b")
See the regex demo.
Here, [^a-zA-Z0-9]+ matches one or more chars other than ASCII letters and digits, and [^\p{L}\p{N}]+ matches any one or more chars other than any Unicode letters and digits.
This should work:
\b[A-Z]+\b
See demo
2nd EDIT ALL CAPS / UPPERCASE solution:
Finally got this simpler way from great other helping solutions here and here:
=trim(regexreplace(regexreplace(C15,"(?:([A-Z]{2,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
From this input:
isn'ter JOHN isn'tar DOE isn'ta or JANE
It returns this output:
JOHN DOE JANE
The Same For Title Case (Extracting All Capitalized / With 1st Letter As Uppercase Words :
Formula:
=trim(regexreplace(regexreplace(C1,"(?:([A-Z]([a-z]){1,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
Input in C1:
The friendly Quick Brown Fox from the woods Jumps Over the Lazy Dog from the farm.
Output in A1:
The Quick Brown Fox Jumps Over Lazy Dog
Previous less efficient trials :
I had to custom tailor it that way for my use case:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
""), "[A-Z]{2,}") = FALSE, "", REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
"")))
Going one by one over all exceptions and adding their respective regex formulations to the front of the multiple pipes separated regexes in the regexextract function.
#Wiktor Stribiżew any simplifying suggestions would be very welcome.
Found some missing and fixed them.
1st EDIT:
A simpler version though still quite lengthy:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(P3: P, "[a-z,]",
" "), "-|\.", " "), "(^[A-Z]\s)", " "
), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" "), "[A-Z]{2,}") = FALSE, " ", REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
P3: P, "[a-z,]", " "), "-|\.", " "),
"(^[A-Z]\s)", " "), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" ")))
From this example:
Multiple regex matches in Google Sheets formula

Word find/replace not being fully lazy

I am using a wildcard find/replace involving the following find field:
([0-9]*)
(Please note that there should be a space at the end of the field even though I can't get it to stick on here on SO)
When I search on the text:
13 April Boon 87 155
(Just because it's not visually clear here, everything should be tab-separated except for the "87 155" and "April Boon", which have spaces.)
Since post-star is (nominally) a lazy evaluator, I would expect this to match only "87 ". This is the result that I want!
But it is making 4 matches:
"13 April "
"3 April "
"87 "
"7 "
This is all the more mysterious to me because it is NOT matching "13 April Boon 87 " or "3 April Boon 87 "
What's going on here? How can I get the match that I seek?
Thanks in advance!
Your wildcard pattern works as expected. Your pattern ([0-9]*) matches:
([0-9] - (Capture group 1, can be referenced with \1) a digit
*) - any characters but as few as possible up to the first...
- space.
Since matches are found from left to right, you have 4 matches. [0-9] matches a digit.
You can only capture 87 with a regex like (<[0-9]#>) <[0-9]#>^13.
(<[0-9]#>) - a whole "word" containing one or more digits
- a space
<[0-9]#> - a whole "word" containing one or more digits
^13 - carriage return

match groups of words and digits regex in any order

suppose you have the following string:
"7 apples and 13 oranges"
/(\d+).*?(apples)/i
the above regex will match 7 apples but if you alternate the order and numbers to "45 oranges and 9 apples".it will match the first digit 45 rather than the digit corresponding to apples, which I want.
How can I write a regex to match and return match groups of digits + apples if I write the sentence in the following two orders:
"7 apples and 13 oranges"
"13 oranges 52 apples"
ie, I'd like to match 7 apples, with the match groups of 7 and apples AND 52 apples with the match groups 52 and apples.
Where you got wrong in /(\d+).*?(apples)/i ?
.*? even though it is a lazy matching it matches from the digit to next apple
which means that for string
"13 oranges 52 apples"
It matches from 13 till the apple at the end of the string, since . matches anything
see the link for an illustration : http://regex101.com/r/uL5eX0/2
How to correct?
since the symbol seperating your digit and apple is a space, you can use a \s character instead of . as
(\d+)\s(apples)
matches 7 and 52 as seen in http://regex101.com/r/uL5eX0/3
For safe side you can have
(\d+)\s+(apples)
any number of spaces between digit and apple
a word boundary \b can also be used for extra safety
(\d+)(?=\s*(apples))
Try this.Use a postive lookahead.See demo.
http://regex101.com/r/yG7zB9/17
use this pattern
(\d+)\s++(apples\b)
by popular demand from the crowed.
(\d+)\s+(apples\b)
Demo
You could simply add \D*? instead of .*? where . would match the in-between digit but \D wouldn't.
(\d+)\D*?(apples)
\D*? Non-greedy match of any character but not of a digit zero or more times.
DEMO
What's wrong with your regex?
(\d+).*?(apples)
At first regex engine would try to match characters which satisfy the given pattern from left to right. So \d+ would match the first number and .*?(apples) forces the engine to match all the characters upto the string apple. Use \D*? instead of .*? to force the engine to match any character but not of a digit zero or more digits.