I need a regex to split a name into first name, family name (surname) and everything in between as (possibly empty) middle names. Several items on stack overflow handle this, but they don't handle the following names, with common European layouts:
Gloria VanderBilt
Gloria van der Bilt
Gloria v.d. Bilt
G. v.d. Bilt
Us humanoids have no problem recognizing the first name, the middle names and the family name. However a regular expression for this is not so simple.
After trying, I've got the following RegEx:
^\b(\w+)\b(.*)\b(\w+)\b
Select three items:
A word in the beginning,
then as much characters as possible,
finally a word at the end.
The first three names are correct, I even get"Gloria", "v.d.", "Bilt" as three separate items, inclusive correct punctuation.
Alas, the last name gives problems with the punctuation:
"G" without the dot!
". v.d." too many dots
"Bilt"
So as a nice puzzle: what should be the regex?
You could go for
^ # match beginning of the line/string
(?P<first>[\w-.]+) # match a word character (a-z_), a dash and dot
\h* # horizontal whitespaces, zero or more
(?P<middle>.+) # at least one character (can be a whitespace)
\h* # horizontal whitespaces, zero or more
\b(?P<last>\w+) # a word boundary, followed by word characters
$ # the end of the line / string
See a demo on regex101.com.
Related
/[\w|A-Z]{1,3}[a-z]/g
but I want to match only the first 3 char of words.
For example:
I WANt THE FIRst 3 CHAr OF WORds ONLy.
It's for a rapid lector: only uppercase the begining of any words.
The best could be: (First 3 char)(Rest of the word or space)
https://regex101.com/r/PCi8Dn/2
Thank you !
Original answer
Use positive lookahead ((?=[pattern]) to match without including in the match.
[A-Z]{1,3}(?=[a-z])
appears to do what you want (if I've understood your spec correctly).
You can see it in action here.
New answer following clarification on spec
I think this does what you want:
(\S{1,3})(\S*[\s\.]+)
The breakdown is:
1st capturing group: (\S{1,3})
Matches a maximum of 3 non-space characters (\S used instead of \w because I think you want to match characters with diacritics like à and punctuation in the middle of words like '.
2nd capturing group: (\S*[\s\.]+)
Matches zero or more non-space characters (the remaining characters in each word) followed by one or more delimiter characters (space or period). I included period as a delimiter to match the last word. You might want to adjust that part depending on your exact needs.
See it in action here.
I am creating regexes that get the whole sentence if a piece of specific information exists. Right now I am working on my name regex, so if there is any composed name (example: "Jorge Martel", "Jorge Martel del Arnold Albuquerque") the regex should get the whole sentence that has the name.
If I have these two sentences:
(1) - "A hardworking guy is working at the supermarket. They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
The regex should return these two results from the sentences above:
(1) - "They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
This is my regex:
(?:(?(?<=[\.!?]\s([A-Z]))(.+?[^.])|))?((?:(?:[A-Z][A-zÀ-ÿ']+\s(?:(?:(?:[A-zÀ-ÿ']{1,3}\s)?(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']*\s?))+))\b)(.+?[\.!?](?:\s|\n|\Z)))
Basically, it verifies if there is a dot, exclamation, or interrogation symbol with a blank space and an upper case character and tells the regex that everything must be select, else it should get all the sentence.
My else case (|) right now is empty, because using (.+?) avoids my first condition...
Regex without the else case:
Validates until the dot, but doesn't get the second sentence.
Regex with the else case:
Validates the second sentence, but overrides the first condition that appears in the first sentence.
I expect my regex to return correctly the sentences:
"They call him Jorge Horizon, but that's not his real name."
"He has an identity document that contains the name, Jorge Martel Arnold."
I have also created a text to validate the regex operations as I will be using it a lot in texts. I added a lot of conditions in this text, which will probably appear in my daily work.
Check my regex, sentence, and text here:
Does anyone know what should I change in my regex? I have tried many variations and still cannot find the solution.
P.S.: I intend to use it in my python code, but I need to fix it with the regex and not with the python code.
you can try this.
[\w\ \,\']+\.\ ?([\w\ \,\']+\.)|^([\w\ \,\']+\.)$
prints $1$2. I.e if group one is empty it prints blank since there is no match, then will print group 2. Visa versa, it prints group 1 when group 2 is not there.
[\w\ ,']+.\ ?([\w\ ,']+.) - as matching anything with XXX. XXX.
then
^([\w\ ,']+.)$ - must start end with only 1 sentence.
Though honestly this can easily be done with a Tokenizer of (.) that check length of 1 or 2. It' really like using a sledgehammer to hammer a nail.
Matching names can be a very hard job using a regex, but if you want to match at least 2 consecutive uppercase words using the specified ranges.
Assuming the names start with an uppercase char A-Z (else you can extend that character class as well with the allowed chars or if supported use \p{Lu} to match an uppercase char that has a lowercase variant):
(?<!\S)[A-Z][A-Za-zÀ-ÿ]*(?:\s+[a-zÀ-ÿ,]+)*\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]*.*?[.!?](?!\S)
(?<!\S) Assert a whitespace boundary to the left
[A-Z][A-Za-zÀ-ÿ]* Match an uppercase char A-Z optionally followed by matching the defined ranges
(?:\s+[a-zÀ-ÿ,]*)* Optionally repeat matching 1+ whitespace chars and 1 or more of the ranges
\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]* Match 2 times whitespace chars followed by an uppercase A-Z and optional chars defined in the character class
.*?[.!?] Match as least as possible chars followed by one of . ! or ?
(?!\S) Assert a whitspace boundary to the right
Regex demo
Try this:
((?:^|(?:[^\.!?]*))[^\.!?\n]*(?:(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']+\s?){2,}[^\.!?]*[\.!?]))
It will capture sentences where name has at least two words, e.g. His name is John Smith.
It won't capture sentences like: John went to a concert.
I am trying to capture n consecutive capitalized words. My current code is
n=5
a='This is a Five Gram With Five Caps and it also contains a Two Gram'
re.findall(' ([A-Z]+[a-z|A-Z]* ){n}',a)
Which returns the following:
['Caps ']
It's identifying the fifth consecutive capitalized word, but I would like it to return the entire string of capitalized words. In other words:
[' Five Gram With Five Caps ']
Note that | doesn't act as an OR inside a character class. It'll match | literally. The other issue here is that findall's behaviour is to return the match unless a group exists (although python's documentation doesn't really make this clear):
The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups
So this is why you're getting the result of the first capture group, which is the last uppercase-starting word of Caps.
The simple solution is to change your capturing group to a non-capturing group. I've also changed the space at the start to \b so as to not match an additional whitespace (which I presume you were planning on trimming anyway).
See code in use here
import re
r = re.compile(r"\b(?:[A-Z][a-zA-Z]* ){5}")
s = "This is a Five Gram With Five Caps and it also contains a Two Gram"
print(r.findall(s))
See regex in use here
\b(?:[A-Z][a-zA-Z]* ){5}
\b Assert position as a word boundary
(?:[A-Z][a-zA-Z]* ?){5} Match the following exactly 5 times
[A-Z] Match an uppercase ASCII letter once
[a-zA-Z]* Match any ASCII letter any number of times
Match a space
Result: ['Five Gram With Five Caps ']
Additionally, you may use the regex \b\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}\b instead. This will allow matches at the start/end of the string as well as anywhere in the middle without grabbing extra whitespace. Another alternative may include (?:^|(?<= ))\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}(?= |$)
Wrap the whole pattern in a capturing group:
(([A-Z]+[a-z|A-Z]* ){5})
Demo
I want to match specific strings from beginning to 5th word of article title.
Input string:
The 14 best US colleges in the West are dominated by California — here's who makes the cut.
regex:
/^.*(\bbest\b|\btop\b|\bhot\b).*$/
Currently matched whole article title but want to search till "colleges".
and also need ignore or not matched strings like laptop,hot-spot etc.
You can use this expression
^((?:\w+\s?){1,5}).*
Explanation:
^ assert position at start of the string
\w+ match any word character
\s? match any white space character
{1,5} Quantifier - Between 1 and 5 times, as many times as possible
.* matches any character (except newline)
This matches the first 5 words (and spaces).
^(\w+\s){0,4}\b(best|top|hot)(\s|$)
You want to match string within first five words of input sentence. Then if counted from the start the sentence, there must be 0-4 words before the word you want to match. So you need ^(\w+\s){0,4} before the specific words you want to match. See https://regex101.com/r/nS0dU6/4
regex101 comes to help again.
^(?=(?:\w+\s){0,4}?(?:best|top|hot)\b(?!-))(\w+(?:\s\w+){0,4})
(?=(?:\w+\s){0,4}?(?:best|top|hot)\b(?!-) checks that the keyword is within first 5 (note that (?!-) is added to cater for words such as hot-spot)
(\w+(?:\s\w+){0,4}) then matches the first maximum 5 words
I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+) Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)