I have this flat file and I want to make sure I only display records where the first name is Andrew
00012 Andrew Carter
02349 John Smith
20089 Charlotte Andrew
Each line contains, in order, three fields: five-digit employee number, first name, and last name. Each field is delimited by a space.
I think you want to find all lines that match the following pattern:
\d+\sAndrew.+
Or, as #Sam Sullivan points out, you could also specify the number of digits:
\d{5}\sAndrew.+
If you have set your regex options to allow the dot to match newline characters, you could use [^\n]+ instead of the final .+, as #Sam Sullivan also points out. But as #Casimir et Hippolyte notes, by default the dot will not match newline characters.
([0-9]{5})\s(Andrew)\s([A-Za-z\s.,-]{1,})
Each parenthesis will capture the three pieces of information.
00012, Andrew, Carter
This requires 5 numeric, a space, the name Andrew (case sensitive), another space, and then whatever the last name is, including surnames and suffixes.
The third capture is looking for a capital letter a through z, or a lower case letter a-z, a space, a period, comma, or dash
So "Carter-Smith, Jr. M.D." is a valid last name.
First off, this is a case where tools can be your friend.
Check out
expresso: "http://www.ultrapico.com/expresso.htm"
It's great for design and testing of regexes
Also there is
RegexCoach: http://www.weitz.de/regex-coach/
Which actually lets you step RegExs like normal code
On to your question though:
Begin line: ^
Five digits: \d\d\d\d\d
Space: \w
Name: Andrew
Space: \w
LastName: [A-Za-z]+
End: $
So:
^\d\d\d\d\d\wAndrew\w[A-Za-z]+$
Disclaimer - not tested, but pretty confident :)
Related
I have like this input address list:
St. Washington, 80
7-th mill B.O., 34
Pr. Lakeview, 17
Pr. Harrison, 15 k.1
St. Hillside Avenue, 26
How I can match only words from this addresses and get like this result:
Washington
mill
Lakeview
Harrison
Hillside Avenue
Pattern (\w+) can't help to me in my case.
It's difficult to know what a "perfect" solution here looks like, as such input might encounter all sorts of unexpected edge cases. However, here's my initial attempt which does at least correctly handle all five examples you have given:
(?<= )[a-zA-Z][a-zA-Z ]*(?=,| )
Demo Link
Explanation:
(?<= ) is a look-behind for a space. I chose this rather than the more standard \b "word boundary" because, for example, you don't want the th in 7-th or the O in B.O. to be counted as a "word".
[a-zA-Z][a-zA-Z ]* is matching letters and spaces only, where the first matched character must be a letter. (You could also equivalently make the regex case-insensitive with the /i option, and just use a-z here.)
(?=,| ) is a look-ahead for a comma or space. Again I chose this rather than the more standard \b "word boundary" because, for example, you don't want the B in B.O. to be counted as a "word".
I am creating regexes that get the whole sentence if a piece of specific information exists. Right now I am working on my name regex, so if there is any composed name (example: "Jorge Martel", "Jorge Martel del Arnold Albuquerque") the regex should get the whole sentence that has the name.
If I have these two sentences:
(1) - "A hardworking guy is working at the supermarket. They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
The regex should return these two results from the sentences above:
(1) - "They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
This is my regex:
(?:(?(?<=[\.!?]\s([A-Z]))(.+?[^.])|))?((?:(?:[A-Z][A-zÀ-ÿ']+\s(?:(?:(?:[A-zÀ-ÿ']{1,3}\s)?(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']*\s?))+))\b)(.+?[\.!?](?:\s|\n|\Z)))
Basically, it verifies if there is a dot, exclamation, or interrogation symbol with a blank space and an upper case character and tells the regex that everything must be select, else it should get all the sentence.
My else case (|) right now is empty, because using (.+?) avoids my first condition...
Regex without the else case:
Validates until the dot, but doesn't get the second sentence.
Regex with the else case:
Validates the second sentence, but overrides the first condition that appears in the first sentence.
I expect my regex to return correctly the sentences:
"They call him Jorge Horizon, but that's not his real name."
"He has an identity document that contains the name, Jorge Martel Arnold."
I have also created a text to validate the regex operations as I will be using it a lot in texts. I added a lot of conditions in this text, which will probably appear in my daily work.
Check my regex, sentence, and text here:
Does anyone know what should I change in my regex? I have tried many variations and still cannot find the solution.
P.S.: I intend to use it in my python code, but I need to fix it with the regex and not with the python code.
you can try this.
[\w\ \,\']+\.\ ?([\w\ \,\']+\.)|^([\w\ \,\']+\.)$
prints $1$2. I.e if group one is empty it prints blank since there is no match, then will print group 2. Visa versa, it prints group 1 when group 2 is not there.
[\w\ ,']+.\ ?([\w\ ,']+.) - as matching anything with XXX. XXX.
then
^([\w\ ,']+.)$ - must start end with only 1 sentence.
Though honestly this can easily be done with a Tokenizer of (.) that check length of 1 or 2. It' really like using a sledgehammer to hammer a nail.
Matching names can be a very hard job using a regex, but if you want to match at least 2 consecutive uppercase words using the specified ranges.
Assuming the names start with an uppercase char A-Z (else you can extend that character class as well with the allowed chars or if supported use \p{Lu} to match an uppercase char that has a lowercase variant):
(?<!\S)[A-Z][A-Za-zÀ-ÿ]*(?:\s+[a-zÀ-ÿ,]+)*\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]*.*?[.!?](?!\S)
(?<!\S) Assert a whitespace boundary to the left
[A-Z][A-Za-zÀ-ÿ]* Match an uppercase char A-Z optionally followed by matching the defined ranges
(?:\s+[a-zÀ-ÿ,]*)* Optionally repeat matching 1+ whitespace chars and 1 or more of the ranges
\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]* Match 2 times whitespace chars followed by an uppercase A-Z and optional chars defined in the character class
.*?[.!?] Match as least as possible chars followed by one of . ! or ?
(?!\S) Assert a whitspace boundary to the right
Regex demo
Try this:
((?:^|(?:[^\.!?]*))[^\.!?\n]*(?:(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']+\s?){2,}[^\.!?]*[\.!?]))
It will capture sentences where name has at least two words, e.g. His name is John Smith.
It won't capture sentences like: John went to a concert.
I need a regex to split a name into first name, family name (surname) and everything in between as (possibly empty) middle names. Several items on stack overflow handle this, but they don't handle the following names, with common European layouts:
Gloria VanderBilt
Gloria van der Bilt
Gloria v.d. Bilt
G. v.d. Bilt
Us humanoids have no problem recognizing the first name, the middle names and the family name. However a regular expression for this is not so simple.
After trying, I've got the following RegEx:
^\b(\w+)\b(.*)\b(\w+)\b
Select three items:
A word in the beginning,
then as much characters as possible,
finally a word at the end.
The first three names are correct, I even get"Gloria", "v.d.", "Bilt" as three separate items, inclusive correct punctuation.
Alas, the last name gives problems with the punctuation:
"G" without the dot!
". v.d." too many dots
"Bilt"
So as a nice puzzle: what should be the regex?
You could go for
^ # match beginning of the line/string
(?P<first>[\w-.]+) # match a word character (a-z_), a dash and dot
\h* # horizontal whitespaces, zero or more
(?P<middle>.+) # at least one character (can be a whitespace)
\h* # horizontal whitespaces, zero or more
\b(?P<last>\w+) # a word boundary, followed by word characters
$ # the end of the line / string
See a demo on regex101.com.
I tried to find someone with the same problem, but I didn't find anything.
I need two separate regular expressions, one for names, and one for last names.
Here the rules for names:
A name can't start with spaces, numbers or any other symbol.
A name must be long at least 3 characters, for a maximum of 15 characters.
No symbols allowed.
Some allowed name examples:
Malcolm Walter Bob
giovanni francesco
Steven
Here last names rules:
A last name can't start with spaces, numbers or any other symbol.
A last name must be long at least 3 characters, for a maximum of 15 characters.
A last name can contains apostrophes, dots and dash.
Some allowed last names examples:
D'addario
berners Lee
berners lee
O'Riley.
Thanks in advance for your help!
First name:
[a-zA-Z.\-']+
Last name:
[a-zA-Z.\-' ]{3,15}
Combined:
[a-zA-Z.\-' ]+ [a-zA-Z.\-']{3,15}
Keep in mind that your restriction on last names is very strict - you'll rule out many Asian last names like Li or Wu.
Also, I'm not sure, if you wanted to make the last name optional. If so:
[a-zA-Z.\-' ]+ (?:[a-zA-Z.\-']{3,15})?
For names [A-Z]{1}[a-z]{2,14}[^ ]
[A-Z]{1}
1 capital A-Z char.
[a-z]{2,14}
2 to 14 a-z
[^ ]
without spaces
For last names [A-Za-z/.-/']{3,15}[^ ]
[A-Za-z/.-/']{3,15}
A-Z or a-z . - ' 3 to 15 long
[^ ]
without spaces
Try Here http://gskinner.com/RegExr/
I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+) Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)