Regex to get the [nth] name following a specific set of text - regex

I don't have a great grasp on Regex; but I am attempting to grab names following the word "sortname", but only after the nth time that word appears.
I have (thanks to Wikipedia's API) a list of governors in the United States, listed in order of their states name alphabetically. (https://en.wikipedia.org/w/api.php?action=parse&prop=wikitext&page=List_of_current_United_States_governors&section=1&format=json)
If you do ctrl+f you will see that each name follows the word "sortname" and there are 50 of them. So if I wanted to see who the Governor of Texas is, I would get the name that follows the 43rd instance of the word "sortname". furthermore the first and last name of each governor is formatted as "sortname|Kay|Ivey" or "sortname|Michelle|Lujan Grisham".
Thanks for the help!

After some more testing I have ended up with the following pattern sortname([^;]*)[^}|]}
It collects more than necessary but its going in the right direction. I can use python to sort it out from there.

Assuming a string str contains the whole text, would you please try:
m = re.findall(r'sortname\|[^|]+\|[^}]+', str, re.DOTALL)
print(m[42])
Output:
sortname|Greg|Abbott

Related

Parse a log file to fetch some values in a line

I am reading a log file where i am trying to fetch some values from lines which contains a substring "edited by:" and ending with " bye".
This is how a log file is designed.
Error nothing reported
19-06-2021 LOGGER:INFO edited by : James Cooper Person Administrator bye. //Line 2
No data match.
19-06-2021 LOGGER:INFO edited by : Harry Rhodes Person External bye. //Line 4
.......
So i am trying to fetch:
James Cooper Person Administrator //from line 2
Harry Rhodes Person External //from line 4
And assign them to variables in my tcl program.
I am assuming the fetched lines are in a list name line2.
like
set splitList[$line2 ' ']
set agent [lindex $splitList 0]
set firstName [lindex $splitList 1]
set lastName [lindex $splitList 2]
set role [lindex $splitList 3]
I understand that having the fetched or extracted lines from log file in a list is not a good idea as they are unstructured input. Using Tcl list functions can lead to weird things when they aren't in proper Tcl list format.
I am very new to tcl. And don't have much idea using regex in tcl.
So I tried extracting values from the matched line using regex. Suppose line2 is a variable holding the extracted matched line2 from the log file,
regexp -- {edited by:(.*) bye.$} $line2 match agent
I was able to get the expected output like below.
Person Harry Rhodes External
However, on this extracted string I don't know how I can further drill to get my variables assigned values. Any suggestion on this approach or any other functions which are present in tcl library which can help me with this task please let me know.
Updated the question by editing the log format. The format of the log file was not correct.
To err on the safe side, I would modify the regex to look for whitespace ([[:space:]]) between words, using * (= "any amount") and + (= "at least one") as appropriate and storing each variable in a capturing group (surrounded by parentheses ()):
edited[[:space:]]+by[[:space:]]*:[[:space:]]*([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+bye.$
Please note that [^[:space:]] matches any character except whitespace.
Regex101 demo: https://regex101.com/r/78l4HJ/1
First off, taking apart the name of a person into its components is extremely difficult. For example, some people have a multi-word family name. (Yes, I know specific examples of this.) Other people put the parts in different orders. Can you avoid splitting the name?
The other parts of parsing that substring are easier as we can assume that agent and role will not have spaces in. The trick to this RE is that \w+ matches a “word” character sequence, \s+ matches a space character sequence (more robustly than a single space), and .*? matches anything, but as little of it as possible.
regexp {^\s*(\w+)\s+(.*?)\s+(\w+)\s*$} $substring -> agent name role
OK, that's great for the substring, but what about the whole line? It's really just a matter of adjusting the anchors. (\y matches a word boundary.)
regexp {\yedited by:\s*(\w+)\s+(.*?)\s+(\w+)\s+bye\y} $line -> agent name role
It's often not a good idea to feed more than a line at a time into a regular expression search, not unless you need to. Fortunately your records are newline-delimited so that's not a problem here.

Need to use regex to extract a part of a string

I'm a regex noob that's trying to use the regexp_extract() function in data studio to extract part of a string. Could you help me out?
I need to extract the part of the string that comes after 'May'. Everything before 'May' is exactly the same across all campaigns.
I've tried googling the solution and killed a lot of time on regexer.com but i can't figure it out
Current Campaign Name:
Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24
Expected Campaign Names:
Comedy Movie Fans18-24
South Asian Film Fans18-24
Cricket Enthusiasts18-24
Motorcycle Enthusiasts18-24
EDIT: I'm trying to use this in data studio in the REGEXP_EXTRACT(Campaign,"regex_code_here") function. I think the acceptable syntax is re2.
You may actually use REGEXP_REPLACE here to remove all before and including May:
REGEXP_REPLACE(Campaign, '.*May', '')
See the regex demo:
The regex you need is this:
(?<=May).*$
Test it here.
You can use replace
^.*?May - Match everything up-to first occurrence of May
"$`" - replace with portion that follows substring Ref
let arr = ["Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24"]
let op = arr.map(str=> str.replace(/^.*?May/g, "$`"))
console.log(op)

Regex String for Restructuring Author Firstname, Lastname, Title

I want to convert strings in the format
The European Union - A Very Short Introduction - Pinder, John
to
John Pinder - The European Union - A Very Short Introduction
I am having trouble matching on "Pinder" and "John" to reformat in the desired way.
You can use:
^(.*?)(?:-\s+(\w+),\s+(\w+))$
Demo
If you may have authors with multiple names (such as 'von Clausewitz, Carl') this won't work. Instead, maybe:
^(.*)(?:-\s+([^,]+?),\s+(\w+))$
Demo 2
There are many ways to approach the problem, all requiring some assumptions not specified in your question. Here is one solution...
^.+-\s+(.+),\s+(.+)$
regexper diagram
It is working by consuming as many characters as possible (up to first capture group, using hyphen and whitespace as delimiter) then it assumes there is a comma followed by whitespace separating first name from last name, which it assumes is the end of the string.
Depending on what you know about the uniformity of the data, this may or may not work for you, but I thought it would be nice to have a solution which does not try to restrict characters in name, but rather the rest of the format.
Use this code:
$code = preg_match_all('/(?:.*?) - (?:.*?) -(.*?),(.*)/', $string,$matches);
This will give you an array and $matches[1] will give you the last name (in this case "Pinder") and $matches[2] will give you the first name ("John"). You can then turn it back into a string if you want to using $lastname = implode('',$matches[1]);.

Regular expression for matching diffrent name format in python

I need a regular expression in python that will be able to match different name formats like
I have 4 different names format for same person.like
R. K. Goyal
Raj K. Goyal
Raj Kumar Goyal
R. Goyal
What will be the regular expression to get all these names from a single regular expression in a list of thousands.
PS: My list have thousands of such name so I need some generic solution for it so that I can combine these names together.In the above example R and Goyal can be used to write RE.
Thanks
"R(\.|aj)? (K(\.|umar)? )?Goyal" will only match those four cases. You can modify this for other names as well.
Fair warning: I haven't used Python in a while, so I won't be giving you specific function names.
If you're looking for a generic solution that will apply to any possible name, you're going to have to construct it dynamically.
ASSUMING that the first name is always the one that won't be dropped (I know people whose names follow the format "John David Smith" and go by David) you should be able to grab the first letter of the string and call that the first initial.
Next, you need to grab the last name- if you have no Jr's or Sr's or such, you can just take the last word (find the last occurrence of ' ', then take everything after that).
From there, "<firstInitial>* <lastName>" is a good start. If you bother to grab the whole first name as well, you can reduce your false positive matches further with "<firstInitial>(\.|<restOfFirstName>)* <lastName>" as in joon's answer.
If you want to get really fancy, detecting the presence of a middle name could reduce false positives even more.
I may be misunderstanding the problem, but I'm envisioning a solution where you iterate over the list of names and dynamically construct a new regexp for each name, and then store all of these regexps in a dictionary to use later:
import re
names = [ 'John Kelly Smith', 'Billy Bob Jones', 'Joe James', 'Kim Smith' ]
regexps={}
for name in names:
elements=name.split()
if len(elements) == 3:
pattern = '(%s(\.|%s)?)?(\ )?(%s(\.|%s)? )?%s$' % (elements[0][0], \
elements[0][1:], \
elements[1][0], \
elements[1][1:], \
elements[2])
elif len(elements) == 2:
pattern = '%s(\.|%s)? %s$' % (elements[0][0], \
elements[0][1:], \
elements[1])
else:
continue
regexps[name]=re.compile(pattern)
jksmith_regexp = regexps['John Kelly Smith']
print bool(jksmith_regexp.match('K. Smith'))
print bool(jksmith_regexp.match('John Smith'))
print bool(jksmith_regexp.match('John K. Smith'))
print bool(jksmith_regexp.match('J. Smith'))
This way you can easily keep track of which regexp will find which name in your text.
And you can also do handy things like this:
if( sum([bool(reg.match('K. Smith')) for reg in regexps.values()]) > 1 ):
print "This string matches multiple names!"
Where you check to see if some of the names in your text are ambiguous.

How can I use regexextract function in Google Docs spreadsheets to get "all" occurrences of a string?

My text string is in cell D2:
Decision, ERC Case No. 2009-094 MC, In the Matter of the Application for Authority to Secure Loan from the National Electrification Administration (NEA), with Prayer for Issuance of Provisional Authority, Dinagat Island Electric Cooperative, Inc. (DIELCO) applicant(12/29/2011)
This function:
=regexextract(D2,"\([A-Z]*\)")
will grab (NEA) but not (DIELCO)
I would like it to extract both (NEA) and (DIELCO)
You can use capture groups, which will cause regexextract() to return an array. You can use this as the cell result, in which case you will get a range of results, or you can feed the array to another function to reformat it to your purpose. For example:
regexextract( "abracadabra" ; "(bra).*(bra)" )
will return the array:
{bra,bra}
Another approach would be to use regexreplace(). This has the advantage that the replace is global (like s/pattern/replacement/g), so you do not need to know the number of results in advance. For example:
regexreplace( "aBRAcadaBRA" ; "[a-z]+" ; "..." )
will return the string:
...BRA...BRA
Here are two solutions, one using the specific terms in the author's example, the other one expanding on the author's sample regex pattern which appears to match all ALLCAPS terms. I'm not sure which is wanted, so I gave both.
(Put the block of text in A1)
Generic solution for all words in ALLCAPS
=regexreplace(regexreplace(REGEXREPLACE(A1,"\b\w[^A-Z]*\b","|"),"\W+","|"),"^\||\|$","")
Result:
ERC|MC|NEA|DIELCO
NB: The brunt of the work is in the CAPITALIZED formula, the lowercase functions are just for cleanup.
If you want space separation, the formula is a little simpler:
=trim(regexreplace(REGEXREPLACE(A1,"\b\w[^A-Z]*\b"," "),"\W+"," "))
Result:
ERC MC NEA DIELCO
(One way I like playing with regex in google spreadsheets is to read the regex pattern from another cell so I can change it without having to edit or re-paste into all the cells using that pattern. This looks so:
Cell A1:
Block of text
Cell B1 (no quote marks):
\b\w[^A-Z]*\b
Formula, in any cell:
=trim(regexreplace(REGEXREPLACE(A1,B$1," "),"\W+"," "))
By anchoring it to B$1, I can fill all my rows at once and the reference won't increment.)
Previous answer:
Specific solution for selected terms (ERC, DIELCO)
=regexreplace(join("|",IF(REGEXMATCH(A1,"ERC"),"ERC",""),IF(REGEXMATCH(A1,"DIELCO"),"DIELCO","")),"(^\||\|$)","")
Result:
ERC|DIELCO
As before, the brunt of the work is in the CAPITALIZED formula, the lowercase functions are just for cleanup.
This formula will find any ERC or DIELCO, or both in the block of text. The initial order doesn't matter, but the output will always be ERC followed by DIELCO (the order of appearance is lost). This fixes the shortcoming with the previous answer using "(bra).*(bra)" in that isolated ERC or DIELCO can still be matched.
This also has a simpler form with space separation:
=trim(join(" ",IF(REGEXMATCH(A1,"ERC"),"ERC",""),IF(REGEXMATCH(A1,"DIELCO"),"DIELCO","")))
Result:
ERC DIELCO
Please try:
=SPLIT(regexreplace(A1 ; "(?s)(.)?\(([A-Z]+)\)|(.)" ; "🧸$2");"🧸")
or
=REGEXEXTRACT(A1;"\Q"&REGEXREPLACE(A1;"\([A-Z]+\)";"\\E(.*)\\Q")&"\E")