Extract different variations of hyphenated personal names with regex - regex

I need to extract names after titles but I need to include hyphenated names too which can come in different variations.
The script below fails to pick up hyphenated names.
text = 'This is the text where Lord Lee-How and Sir Alex Smith are mentioned.\
Dame Ane Paul-Law is mentioned too. And just Lady Ball.'
names = re.compile(r'(Lord|Baroness|Lady|Baron|Dame|Sir) ([A-Z][a-z]+)[ ]?([A-Z][a-z]+)?')
names_with_titles = list(set(peers.findall(text)))
print(names_with_titles)
The current output is:
[('Lord', 'Lee', ''), ('Sir', 'Alex', 'Smith'), ('Dame', 'Ane', 'Paul'), ('Lady', 'Ball', '')]
The desired output should be:
[('Lord', 'Lee-How', ''), ('Sir', 'Alex', 'Smith'), ('Dame', 'Ane', 'Paul-Law'), ('Lady', 'Ball', '')]
I managed to extract hyphenated names with this pattern -
hyph_names = re.compile(r'(Lord|Baroness|Lady|Baron|Dame|Sir) ([A-Z]\w+(?=[\s\-][A-Z])(?:[\s\-][A-Z]\w+)+)')
But I cannot figure out how to combine the two. Will appreciate your help!

You may add a (?:-[A-Z][a-z]+)? optional group to the name part patterns:
(Lord|Baroness|Lady|Baron|Dame|Sir)\s+([A-Z][a-z]+(?:-[A-Z][a-z]+)?)(?:\s+([A-Z][a-z]+(?:-[A-Z][a-z]+)?))?
See the regex demo
Details
(Lord|Baroness|Lady|Baron|Dame|Sir) - one of the titles
\s+ - one or more whitespace chars
([A-Z][a-z]+(?:-[A-Z][a-z]+)?) - a capturing group #1:
[A-Z][a-z]+ - an uppercase letter followed with 1+ lowercase ones
(?:-[A-Z][a-z]+)? - an optional non-capturing group matching a hyphen and then an uppercase letter followed with 1+ lowercase ones
(?:\s+([A-Z][a-z]+(?:-[A-Z][a-z]+)?))? - an optional non-capturing group:
\s+ - 1+ whitespaces
([A-Z][a-z]+(?:-[A-Z][a-z]+)?) - a capturing group #2 with the same pattern as in Group 1.
You may build it in Python 3.7 like
title = r'(Lord|Baroness|Lady|Baron|Dame|Sir)'
name = r'([A-Z][a-z]+(?:-[A-Z][a-z]+)?)'
rx = rf'{title}\s+{name}(?:\s+{name})?'
In older versions,
rx = r'{0}\s+{1}(?:\s+{1})?'.format(title, name)

Related

Match specific letter from group N Regex

I have the following log message:
Aug 25 03:07:19 localhost.localdomainASM:unit_hostname="bigip1",management_ip_address="192.168.41.200",management_ip_address_2="N/A",http_class_name="/Common/log_to_elk_policy",web_application_name="/Common/log_to_elk_policy",policy_name="/Common/log_to_elk_policy",policy_apply_date="2020-08-10 06:50:39",violations="HTTP protocol compliance failed",support_id="5666478231990524056",request_status="blocked",response_code="0",ip_client="10.43.0.86",route_domain="0",method="GET",protocol="HTTP",query_string="name='",x_forwarded_for_header_value="N/A",sig_ids="N/A",sig_names="N/A",date_time="2020-08-25 03:07:19",severity="Eror",attack_type="Non-browser Client,HTTP Parser Attack",geo_location="N/A",ip_address_intelligence="N/A",username="N/A",session_id="0",src_port="39348",dest_port="80",dest_ip="10.43.0.201",sub_violations="HTTP protocol compliance failed:Bad HTTP version",virus_name="N/A",violation_rating="5",websocket_direction="N/A",websocket_message_type="N/A",device_id="N/A",staged_sig_ids="",staged_sig_names="",threat_campaign_names="N/A",staged_threat_campaign_names="N/A",blocking_exception_reason="N/A",captcha_result="not_received",microservice="N/A",tap_event_id="N/A",tap_vid="N/A",vs_name="/Common/adv_waf_vs",sig_cves="N/A",staged_sig_cves="N/A",uri="/random",fragment="",request="GET /random?name=' or 1 = 1' HTTP/1.1\r\n",response="Response logging disabled"
And I have the following RegEx:
request="(?<Flag1>.*?)"
I trying now to match some text again from the previous group under name "Flag1", the new match that I'm trying to flag it is /random?name=' or 1 = 1' as Flag2.
How can I match the needed text from other matched group number or flag name without insert the new flag inside the targeted group like:
request="(?<Flag1>\w+\s+(?<Flag2>.*?)\s+HTTP.*?)"
https://regex101.com/r/EcBv7p/1
Thanks.
You can use
request="(?<Flag1>[A-Z]+\s+(?<Flag2>\/\S+='[^']*')[^"]*)"
See the regex demo.
Details:
(?<Flag1> - Flag1 group:
[A-Z]+ - one or more uppercase ASCII letters
\s+ - one or more whitespaces
(?<Flag2>\/\S+='[^']*') - Group Flag2: /, one or more non-whitespace chars, =', zero or more chars other than ', and then a ' char
[^"]* - zero or more chars other than "
) - end of Flag1 group.
If I understand you correctly, you want to match whatever string a previous group has matches, right?
In that case you can use \n or in this case \1 to match the same thing that your first capture group matched

regex to split string into parts

I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you
You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.
As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo

Terminating match at multiple space in Regex Pattern

I am reading a text which is like this:
BROKER : 0012301 AB ABCDEF/ABC
VENDOR NUMBER: 511111 A/P NUMBER: 3134
VENDOR NAME: KING ARTHUR FLOURCO INC OUR INVOICE #: 553121117 DATE: 05/03/2021
I want to extract the field Vendor Name, Vendor Number. Hence I'm using the regex
(?<=:\s).[^\s]*
But this helps me to extract any field which doesn't have any white space. However, the fields having spaces in between aren't extracted properly like Vendor Name. How do I modify my regex pattern to fetch all fields? I've tried (?<=:\s).[^\s\s]* but that didn't work.
One option could be to match either VENDOR NAME or VENDOR NUMBER and capture what follows until the first encounter of 3 whitespace chars.
Note that \s can also match a newline.
\bVENDOR\s+(?:NAME|NUMBER):\s+(\S.*?)\s{3}
The pattern matches:
\bVENDOR\s+(?:NAME|NUMBER) A word boundary to prevent a partial match, 1+ witespace chars and then match either NAME or NUMBER
:\s+ Match : and 1+ whitespace chars
(\S.*?) Capture group 1, Match a non whitespace char followed by as least as possible chars
\s{3} Match 3 whitspace chars
See a regex demo.

Comma separated prefix list with commas inside

I'm trying to match a comma separated list with prefixed values which contains also a comma.
I finally made it to match all occurrence which doesn't have a ,.
Sample String (With NL for visualization - original string doesn't have NL):
field01=Value 1,
field02=Value 2,
field03=<xml value>,
field04=127.0.0.1,
field05=User-Agent: curl/7.28.0\r\nHost: example.org\r\nAccept: */*,
field06=Location, Resource,
field07={Item 1},{Item 2}
My actual RegEx looks like this not optimized piece ....
(?'fields'(field[0-9]{2,3})=?([\s\w\d_<>.:="*?\-\/\\(){}<>'#]+))([^,](?&fields))*
Any one has a clue how to solve this?
EDIT:
The first pattern is near to my expected result.
This is a anonymized full example of the string:
asm01=Predictable Resource Location,Information Leakage,asm02=N/A,asm04=Uncategorized,asm08=2021-02-15 09:18:16,asm09=127.0.0.1,asm10=443,asm11=N/A,asm15=,asm16=DE,asm17=User-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n,asm18=/Common/_www.example.com_live_v1,asm20=127.0.0.1,asm22=,asm27=HEAD,asm34=/Common/_www.example.com_live_v1,asm35=HTTPS,asm39=blocked,asm41=0,asm42=3,asm43=0,asm44=Error,asm46=200000028,200100015,asm47=Unix hidden (dot-file) access,.htaccess access,asm48={Unix/Linux Signatures},{Apache/NCSA HTTP Server Signatures},asm50=40622,asm52=200000028,asm53=Unix hidden (dot-file) access,asm54={Unix/Linux Signatures},asm55=,asm61=,asm62=,asm63=8985143867830069446,asm64=example-waf.example.com,asm65=/.htaccess,asm67=Attack signature detected,asm68=<?xml version='1.0' encoding='UTF-8'?><BAD_MSG><violation_masks><block>13020008202d8a-f803000000000000</block><alarm>417020008202f8a-f803000000000000</alarm><learn>13000008202f8a-f800000000000000</learn><staging>200000-0</staging></violation_masks><request-violations><violation><viol_index>42</viol_index><viol_name>VIOL_ATTACK_SIGNATURE</viol_name><context>request</context><sig_data><sig_id>200000028</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>2</length></kw_data></sig_data><sig_data><sig_id>200000028</sig_id><blocking_mask>4</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>3</length></kw_data></sig_data><sig_data><sig_id>200100015</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>1</offset><length>9</length></kw_data></sig_data></violation></request-violations></BAD_MSG>,asm69=5,asm71=/Common/_dev.example.com_SSL,asm75=127.0.0.1,asm100=,asm101=HEAD /.htaccess HTTP/1.1\r\nUser-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n#015
The pattern does not work as the fields group matches the string field
You are trying to repeat the named group fields but the example strings do not have the string field.
Note that [^,] matches any char except a comma, you can omit the capture group inside the named group field as it already is a group and \w also matches \d
With 2 capture groups:
\b(asm[0-9]+)=(.*?)(?=,asm[0-9]+=|$)
\b A word boundary
(asm[0-9]+) Capture group 1, match asm and 1+ digits
= Match literally
(.*?) Capture group 2, match any char as least as possible
(?= Positive lookahead, assert what is at the right is
,asm[0-9]+= Match ,asm followed by 1+ digits and =
| Or
$ Assert the end of the string
) Close lookahead
Regex demo
A simple solution would be (see regexr.com/5mg1b):
/((asm\d{2,3})=(.*?))(?=,asm|$)/g
Match groupings will be:
group #1 - asm01=Predictable Resource Location,Information Leakage
group #2 - asm01
group #3 - Predictable Resource Location,Information Leakage
Conditions:
This will match everything including empty values
The key here is to make sure that each match is delimited by either a comma and your field descriptor, or an end of string. A look ahead will be handy here: (?=,asm|$).

Splunk - regex extract fields from source

I am trying to extract the job name , region from Splunk source using regex .
Below is the format of my sample source :
/home/app/abc/logs/20200817/job_DAILY_HR_REPORT_44414_USA_log
With the below , I am able to extract job name :
(?<logdir>\/[\W\w]+\/[\W\w]+\/)(?<date>[^\/]+)\/job_(?<jobname>.+)_\d+
Here is the match so far :
Full match 0-53 /home/app/abc/logs/20200817/job_DAILY_HR_REPORT_44414
Group `logdir` 0-19 /home/app/abc/logs/
Group `date` 19-27 20200817
Group `jobname` 32-47 DAILY_HR_REPORT
I also need USA (region) from the source . Can you please help suggest.
Region will always appear after number field (44414) , which can vary in number of digits.
Ex: 123, 1234, 56789
Thank you in advance.
You could make the pattern a bit more specific about what you would allow to match as [\W\w]+ and .+ will cause more backtracking to fit the rest of the pattern.
Then for the region you can add a named group at the end (?<region>[^\W_]+) matching one or more times any word character except an underscore.
In parts
(?<logdir>\/(?:[^\/]+\/)*)(?<date>(?:19|20)\d{2}(?:0?[1-9]|1[012])(?:0[1-9]|[12]\d|3[01]))\/job_(?<jobname>\w+)_\d+_(?<region>[^\W_]+)_log
(?<logdir> Group logdir
\/(?:[^\/]+\/)* match / and optionally repeat any char except / followed by matching the / again
) Close group
(?<date> Group date
(?:19|20)\d{2} Match a year starting with 19 or 20
(?:0?[1-9]|1[012]) Match a month
(?:0[1-9]|[12]\d|3[01]) Match a day
) Close group
\/job_ Match /job_
(?<jobname>\w+) Group jobname, match 1+ word chars
_\d+_ Match 1+ digits between underscores
(?<region>[^\W_]+) Group region Match 1+ occurrences of a word char except _
_log Match literally
Regex demo