Regular expression for matching diffrent name format in python - regex

I need a regular expression in python that will be able to match different name formats like
I have 4 different names format for same person.like
R. K. Goyal
Raj K. Goyal
Raj Kumar Goyal
R. Goyal
What will be the regular expression to get all these names from a single regular expression in a list of thousands.
PS: My list have thousands of such name so I need some generic solution for it so that I can combine these names together.In the above example R and Goyal can be used to write RE.
Thanks

"R(\.|aj)? (K(\.|umar)? )?Goyal" will only match those four cases. You can modify this for other names as well.

Fair warning: I haven't used Python in a while, so I won't be giving you specific function names.
If you're looking for a generic solution that will apply to any possible name, you're going to have to construct it dynamically.
ASSUMING that the first name is always the one that won't be dropped (I know people whose names follow the format "John David Smith" and go by David) you should be able to grab the first letter of the string and call that the first initial.
Next, you need to grab the last name- if you have no Jr's or Sr's or such, you can just take the last word (find the last occurrence of ' ', then take everything after that).
From there, "<firstInitial>* <lastName>" is a good start. If you bother to grab the whole first name as well, you can reduce your false positive matches further with "<firstInitial>(\.|<restOfFirstName>)* <lastName>" as in joon's answer.
If you want to get really fancy, detecting the presence of a middle name could reduce false positives even more.

I may be misunderstanding the problem, but I'm envisioning a solution where you iterate over the list of names and dynamically construct a new regexp for each name, and then store all of these regexps in a dictionary to use later:
import re
names = [ 'John Kelly Smith', 'Billy Bob Jones', 'Joe James', 'Kim Smith' ]
regexps={}
for name in names:
elements=name.split()
if len(elements) == 3:
pattern = '(%s(\.|%s)?)?(\ )?(%s(\.|%s)? )?%s$' % (elements[0][0], \
elements[0][1:], \
elements[1][0], \
elements[1][1:], \
elements[2])
elif len(elements) == 2:
pattern = '%s(\.|%s)? %s$' % (elements[0][0], \
elements[0][1:], \
elements[1])
else:
continue
regexps[name]=re.compile(pattern)
jksmith_regexp = regexps['John Kelly Smith']
print bool(jksmith_regexp.match('K. Smith'))
print bool(jksmith_regexp.match('John Smith'))
print bool(jksmith_regexp.match('John K. Smith'))
print bool(jksmith_regexp.match('J. Smith'))
This way you can easily keep track of which regexp will find which name in your text.
And you can also do handy things like this:
if( sum([bool(reg.match('K. Smith')) for reg in regexps.values()]) > 1 ):
print "This string matches multiple names!"
Where you check to see if some of the names in your text are ambiguous.

Related

How to write expression in expression builder in data flow of ADF

I am transforming data, While doing it I have to perform some transformation.
I need an expression in the expression builder to transform the customer Name as below
Take first character of word in the name followed by * . Customer name may contain 1 or more words
Name can be Tim or Tim John or Tim John Zac or Tim John Mike Zac
I have reproduced above and got below results using derived column.
I have used the same data that you have given in a single column and used the below dataflow expression in derived column.
dropLeft(toString(reduce(map(split(Name, ' '),regexReplace(#item, concat('[^',left(#item,1),']'), '*')), '', #acc + ' ' + #item, #result)), 2)
Here, some general regular expressions were given errors for me in dataflow, that's why used the above approach.
First, I have used split() by space to get an array of strings. Then used regular expression on every item of array like above.
As we do not have join in dataflow expression, I have used the code from this SO answer by #Jarred Jobe to convert array to a string seperated by spaces.
Result:
NOTE:
Make sure you give two spaces in toString() of above code to get the required result. If we give only one space it will give the results like below.
Update:
Thank you so much for sharing this. I have tried your solution but I
got few names wrong .Also I want to replace the rest of the characters
with just 5 '' irrespective of how many characters the name has. Also
name : Mia hellah came as M* h****h instead of M***** h*****. Another
one SAM & JOHN TIBEH should be S***** &***** J***** T*****. I tried to
update your expression but I couldn't get it right.
If you want to do like above, you can directly use concat function dataflow expression.
dropLeft(toString(reduce(map(split(Name, ' '),concat(left(#item,1), '*****')), '', #acc + ' ' + #item, #result)), 2)
Results:

Regex to get the [nth] name following a specific set of text

I don't have a great grasp on Regex; but I am attempting to grab names following the word "sortname", but only after the nth time that word appears.
I have (thanks to Wikipedia's API) a list of governors in the United States, listed in order of their states name alphabetically. (https://en.wikipedia.org/w/api.php?action=parse&prop=wikitext&page=List_of_current_United_States_governors&section=1&format=json)
If you do ctrl+f you will see that each name follows the word "sortname" and there are 50 of them. So if I wanted to see who the Governor of Texas is, I would get the name that follows the 43rd instance of the word "sortname". furthermore the first and last name of each governor is formatted as "sortname|Kay|Ivey" or "sortname|Michelle|Lujan Grisham".
Thanks for the help!
After some more testing I have ended up with the following pattern sortname([^;]*)[^}|]}
It collects more than necessary but its going in the right direction. I can use python to sort it out from there.
Assuming a string str contains the whole text, would you please try:
m = re.findall(r'sortname\|[^|]+\|[^}]+', str, re.DOTALL)
print(m[42])
Output:
sortname|Greg|Abbott

Reserved Keyword search, but in reverse. Regex

I'm writing a code that looks through a string and then takes in words that are not considered "reserved keywords". I am new to regex, but have spent quite some time learning what kind of structure I need to look for reserved words. So far, I've written something along the lines of this:
\b(import|false|int|etc)\b
I am going to use an array list to feed in all of the reserved words into the string above, but I need it to work opposite of how it works now. I've figured out how to get it to search for the specific words with the code above, but how can I get it to look for the words that are NOT listed above. I've tried incorporating the ^ symbol, but I'm not having any luck there. Any regex veterans out there who see what I'm doing wrong?
There are two obvious possibilities, depending on what (else) you are doing.
Possibility 1: Use a dict or set:
You could just match words and then test for membership in a set or dictionary:
Reserved_words = set('import false true int ...'.split())
word_rx = r'\b\w+\b' # Or whatever rule you like for "words"
for m in re.finditer(...):
word = m.group(0)
if word in Reserved_words:
print("Found reserved word:", word)
else:
print("Found unreserved word:", word)
This approach is frequently used in lexers, where it is easier to just write a catch-all "match a word" rule, and then check a matched word against a list of keywords, than it is to write a fairly complex rule for each keyword and a catch-all to deal with the rest.
You can use a dict if you want to associate some kind of payload with the keyword (such as a class handle for instantiating a particular AST node type, etc.).
Possibility 2: Use named groups:
Another possibility is that you could use named groups in your regex to capture keyword/nonkeyword values:
word_rx = r'\b(?<keyword>import|int|true|false|\.\.\.)|(?<nonkeyword>\w+)\b'
for m in re.finditer(...):
word = m.group('keyword')
if word:
print("Found keyword:", word)
else:
word = m.group('nonkeyword')
print("Found nonkeyword:", word)
This is going to be slower than the previous approach, because of prefixes: "int" matches a keyword, but "integral" starts to match an int, then fails, then backtracks to the other branch, then matches a nonkeyword. :-(
However, if you are strongly tied to a mostly-regex implementation, for example, if you have many other regex-based rules, and you are processing them in a loop, then go for it!

Match both of these with regex, not just one of

I'm looking in a sql table through a bunch of names and I want to get a list of all the different titles used. e.g. SNR, MRS, MR, JNR etc
Sometimes there is an entry that might have 2 titles, e.g: MR NAME NAME JNR. I want both of these titles 'MR' & 'JNR'
I thought a good way to do this would be with regex and find any names that have 2 or 3 characters. A title at the front would be followed by a space, while a title at the end would be preceded by one. So I have:
/(^[A-Z]{2,3})\s|\s(^[A-Z]{2,3}$)/
a regex101 example here.
As you can see I've used 'match either A or B' thing. If I throw at it a name with a title at either the start or finish, I end up getting what I want, but I don't know how to tell it to get both. i.e. strings with 2 titles will only give me back one match.
How do I get both?
Instead of an "OR", you could just match any character in between:
(^[A-Z]{3})\s.*\s([A-Z]{3}$)

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell