regex extract exact one string - regex

I have two strings:
"John Johnson Phone Number"
"John Johnson Alternate Phone Number"
Need to extract first one, name and last name might change
I was matching first string with this regex as Name and Last name might change.
^\w+ \w+( \w+)? Phone Number$
Seems pretty easy but I've brain freeze cannot solve it for few hours.
Issue now that same regex picks up 2nd string which I do not want to be picked up.
Maybe someone could give me a hint how to match only first string and do not take strings which contains Alternate word? Thanks

If I understand correctly, you want to capture the whole string, and extract the words before "Phone number". You can do this with capture groups. You can name your capture groups to such that you do not have to worry about which index number the group is at (if you add/remove groups later).
The syntax is (?P<name>...).
So for your situation I put the first two \w+ into the capture group name. The returned matches is the full string matched in index 0. Indices after are the subgroups. You can use re.SubexpIndex("name") to find the correct subgroup index for the named subgroup name.
https://goplay.tools/snippet/dcwWg3FBWUd
re := regexp.MustCompile(`^(?P<name>\w+ \w+)( \w+)? Phone Number$`)
str := "John Johnson Alternate Phone Number"
index := re.SubexpIndex("name")
matches := re.FindStringSubmatch(str)
if len(matches) > 0 {
fmt.Printf("Name: %s\n", matches[index])
} else {
fmt.Println("No Match")
}
EDIT: I thought this was a golang question :facepalm:
This still works using capture groups to extract the relevant sub matches out.

Related

How to capture text between a specific word and the semicolon immediately preceding it with regex?

I have many rows of people and titles in Excel, and am looking to filter out certain people by title. For example, cells may contain the following:
John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder
These cells are varying lengths and have varying numbers of people and titles. My plan is to add semicolons at the beginning and end to standardize it. This would give me:
;John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder;
Currently, I have a code that can iterate through and uses the following regex Founder.*?; which will return each instance of founder based on my code (i.e. Founder;Founder;) but the trouble is that I can't seem to figure out how to also capture the names of the people. I would think I would need to designate the semicolon immediately preceding "Founder" but so far I have not been able to get this. My ultimate goal would be to return something like the following, which I have the code for with the exception of the correct regular expression.
;John Smith, Co-Founder;James Jackson, Co-Founder;
Depending on your version of Excel, you could also do this with a formula:
=FILTERXML("<t><s>" & SUBSTITUTE(A1,";","</s><s>")&"</s></t>","//s[contains(.,'Co-Founder')]")
However, for a regex, you could use
(?:^|;)([^;]*?Co-Founder)
which will return the Co-Founders in capturing group 1.
There is no need for leading/trailing semicolons.
Even though VBA regex does not support look-behind, you can work with that limitation.
the Co-Founders Regex
(?:^|;)([^;]*?Co-Founder)
Options: Case sensitive (or not, as you prefer); ^$ match at line breaks
Match the regular expression below (?:^|;)
Match this alternative ^
Assert position at the beginning of the string ^
Or match this alternative ;
Match the character “;” literally ;
Match the regex below and capture its match into backreference number 1 ([^;]*?Co-Founder)
Match any character that is NOT a “;” [^;]*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the character string “Co-Founder” literally Co-Founder
Created with RegexBuddy
Split the whole string combined with a positive filtering and the getCoFounders() function will return an array of findings:
Sub ExampleCall()
Dim s As String
s = ";John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder;"
Debug.Print Join(getCoFounders(s), "|")
End Sub
Function getCoFounders(s As String)
getCoFounders = Filter(Split(s, ";"), "Co-Founder", True, vbTextCompare)
End Function
Results in VB Editor's immediate window
John Smith, Co-Founder|James Jackson, Co-Founder

Regex get every string from start until new line?

I have a string like this :
Name: Yoza Jr
Address: Street 123, Canada
Email: yoza#gmail.com
I need get data using regex until new line, for example
Start with Name: get Yoza Jr until new line for name data
so I can have 3 data Name, Address, Email
How to Regex get every string from start until new line?
btw I will use it in golang : https://regex-golang.appspot.com/assets/html/index.html
The pattern ^.*$ should work, see the demo here. This assumes that .* would not be running in dot all mode, meaning that .* will not extend past the \r?\n newline at the end of each line.
If you want to capture the field value, then use:
^[^:]+:\s*(\S+)$
The quantity you want will be present in the first capture group.
I would suggest you use the pattern ^(.+):\s*(.*)$
Demo: https://regex101.com/r/Q9D4RM/1
Not only will it result in 3 distinct matches for the string given by you, the field name (before the ":") will be read as group 1 of the match, and the value (after the ":") will be read as group 2. So, if you want the key-value pairs, you can just search for groups 1 and 2 for each match.
Please let me know if it's unclear so I can elaborate.

Match Regex Starting After "X" Number of Characters

I am using regex in a Google script to normalize company names, and while I am getting very close to perfect with a combination of replacing certain words, punctuation, and spaces, my last step was to replace any word with 3 or fewer letters.
But that gets rid of a few companies with acronyms at the start of their name, ie AB Holding Company. I don't want this to match AB, I want it to find the rare "the", or company code (particularly foreign ones like SPA and NV along with Co and Inc). These codes are not necessarily at the end of the string, but they seem to always be at least 4 characters after the beginning.
I am currently using
text = text.replace(/\b[a-z]{1,3}\b)/i," ");
Ignore the [a-z] as missing caps, I've dealt with that separately
What I think would work is to "skip over" the first few characters, probably 4 to be safe, and maybe learn how to include spaces and/or digits in there for the future. So I wrote this after seeing 1 other related question here.
text = text.replace(/((.{4})(.*)\b[a-z]{1,3}\b)/i," ");
Scipts does not seem to allow a lookbehind, and my version doesn't seem to work. I'm lost.
I appreciate your help.
Here is a solution:
text = text.replace("/^(.{4}.*)(\b[a-z]{1,3}\b)(.*)/gmi", "$1$3");
What I have changed is:
enclosed all groups in parenthesis - so that they can be captured and used in the replacement;
since you mentioned that the word-to-be-replaced might not be in the end of the string, I also added a third group - to match everything after.
included the part before and after the word in the replacement string (group 1 and group 3).
However, note that it might return false positives - i.e. if a company name is Company ABC, Inc., it will also capture ABC. Thus, if you know the words you want to replace, it might be better to just use an alteration:
text = text.replace("/^(.{4}.*)\b(Co|Inc|SPA|NV|the)\b(.*)/gmi", "$1$3");

Regular expression string followed by numbers

I am writing a regular expression to extract phrases like #Question1# or #Question125# from html string like
Patitent name #Question1#, Patient was suffering from #Question2#, Patient's gender is #Question3#, patient has #Question4# drinking for the last month. His DOB is #Question5#
The first half of the expression is simple just #Question, but I also need to match for a series of digits with unspecified length, and the whole string ends with #.
Once I find the matching phrase, how I extract only the digits from the string? Like for example, #Question312#, I just want to get 312 out?
Any suggestion?
The regexp you are looking for is
/#Question[0-9]+#/
If you need to extract the number you can just wrap the [0-9]+ part in parenthesis
/#Question([0-9]+)#/
making it a group. How you use a captured group depends on the specific regexp implementation (e.g. python, perl, javascript ...). For example in python you can replace all those questions with corresponding answers from a list with
answers = ["Andrea", "Griffini"]
text = "My first name is #Question1# and my last name is #Question2#"
print re.sub("#Question([0-9]+)#",
lambda x:answers[int(x.group(1)) - 1],
text)
I think what you are looking for is:
#Question[0-9]+#
#Question
Any character in this class: [0-9], one or more repetitions
#

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.