Terminating match at multiple space in Regex Pattern - regex

I am reading a text which is like this:
BROKER : 0012301 AB ABCDEF/ABC
VENDOR NUMBER: 511111 A/P NUMBER: 3134
VENDOR NAME: KING ARTHUR FLOURCO INC OUR INVOICE #: 553121117 DATE: 05/03/2021
I want to extract the field Vendor Name, Vendor Number. Hence I'm using the regex
(?<=:\s).[^\s]*
But this helps me to extract any field which doesn't have any white space. However, the fields having spaces in between aren't extracted properly like Vendor Name. How do I modify my regex pattern to fetch all fields? I've tried (?<=:\s).[^\s\s]* but that didn't work.

One option could be to match either VENDOR NAME or VENDOR NUMBER and capture what follows until the first encounter of 3 whitespace chars.
Note that \s can also match a newline.
\bVENDOR\s+(?:NAME|NUMBER):\s+(\S.*?)\s{3}
The pattern matches:
\bVENDOR\s+(?:NAME|NUMBER) A word boundary to prevent a partial match, 1+ witespace chars and then match either NAME or NUMBER
:\s+ Match : and 1+ whitespace chars
(\S.*?) Capture group 1, Match a non whitespace char followed by as least as possible chars
\s{3} Match 3 whitspace chars
See a regex demo.

Related

How can I write a regex expression to capture characters up to a certain character

I am trying to write a PCRE regex to pull specific information from syslog. I am needing a portion of the log and do not care about anything that comes after it. The problem I am facing is that the character I am trying to cause the "no match" is still showing. Here is the full log:
Aug 15 20:41:30 10.240.8.160 42286: servername Aug 16 2022 01:41:28.245 +0000: %ICM_Router_CallRouter-3-1050042: %[comp=Router-*][pname=rtr][iid=prod][mid=1050042][sev=error]: **No default label available for dialed number: SM01.GGB.ACCT.BILLING.5555550778** (ID: 44043).
The part I am needing is No default label available for dialed number: SM01.GGB.ACCT.BILLING.5555550778
The closest I have gotten is by using \bNo.+[\(] which matches No default label available for dialed number: SM01.GGB.ACCT.BILLING.5555550778 (. I have also tried using ^\s with no success. When I anchor the parentheses \bNo.+[^\(] the following is still matched:
No default label available for dialed number: SM01.GGB.ACCT.BILLING.5555550778 (ID: 44043).
Can someone let me know what I am missing?
The the portion always ends on a dot followed by digits and you don't want to match ( in between:
\bNo\b[^(]+\.\d+
Explanation
\bNo\b Match the word No between word boundaries
[^(]+ Match 1+ chars other than (
\.\d+ Match a dot and 1+ digits
Regex demo
Or taking the ** into account:
\*\*\KNo\b[^(]+(?=\*\*\s*\()
Explanation
\*\* Match **
\K Clear the current match buffer (forget what is matched until now)
No\b Match the word No
[^(]+ Match 1+ chars other than (
(?= Positive lookahead
\*\*\s*\( Match ** followed by optional spaces and (
) Close the lookahead
Regex demo
With your shown samples, please try following regex. Here is the Online demo for used regex.
\bNo\b.*?\sdialed number:.*?\bACCT\.BILLING\.\d+
Explanation:
\bNo\b ##Matching string/word No with word boundaries.
.*?\s ##using lazy match matching till space here.
dialed number: ##Matching dialed number: here.
.*?\bACCT ##using lazy match followed by word boundaries followed by ACCT.
\.BILLING ##Matching literal dot followed by BILLING.
\.\d+ ##Matching literal dot followed by 1 or more occurrences by digits.
How about using a positive lookhead and optional characters like this?
\bNo.+?(?=\**\s?\()
Well, there must be a certain rule which denotes the text you want to match.
I think the rule is something like:
The text "No default label available for dialed number: ", followed by some alphanumeric, dot-separated identification code.
An associated regex would then be:
No default label available for dialed number: [0-9A-Za-z.]+
Note that the [0-9A-Za-z.]+ part is simplified, as this matches dots at the beginning and end of the identification code, as well as multiple consecutive dots, which may be undesired. Also note that you may as well loosen the identification code regex as you like. For example, \S matches all non-whitespace characters, and if you assume that the identification code part is always followed by a whitespace, the pattern \S+ works just fine.

I need to extract all words prior to 4th Space in a line

Good Day
I need to extract all words prior to 5th Space in a line.
Sample Data
Article Number Crt.DI No. Date
6ZZ 999 123 S 000000093 19.01.2021
Article description Replace DI No. Date
I have written a expression to extract what is in between Date and Article and the result is this
(?<=Date)(.|\n)*(?=Article)
6RU 999 123 S 000000093 19.01.2021
however I need to retrieve all those characters before the 4 space
6ZZ 999 123 S
This is a material number and this can be 13 or 14 characters before the 4th space.
Appreciate your support.
Sample Data
Article Number Crt.DI No. Date
6RU 999 123 S 000000093 19.01.2021
Article description Replace DI No. Date
(Please Note : There is new lines in between, these are three consecutive lines and each line is followed by an enter key)
Regards,
Manjesh
You can use a capture group, and use \s to match a whitespace character or a newline.
The capture group approach can be more flexible in case you want to match more than one whitespace chars or newlines after Date and a quantifier in a lookbehind assertion is not supported.
\bDate\s+(\S+(?:\s+\S+){3})[\s\S]*?\bArticle\b
See a regex demo.
Or using lookarounds to get a match only.
(?<=\bDate\s)\S+(?:\s+\S+){3}(?=[\s\S]*?\bArticle\b)
The pattern matches:
(?<=\bDate\s) Positive lookbehind to assert Date to the left followed by a whitespace char that can also match a newline
\S+ Match 1 or more non whitespace chars
(?:\s+\S+){3}
(?= Positive lookahead to assert that what at the right is
[\s\S]*? Match any character including newlines
\bArticle\b Match the word Article
) Close the lookahead
See another regex demo.

RegEx for example three comma-separated words

We are doing lose validation on zipcode of form CITY, ST, ZIP. These can span countries, so all of the following are valid:
PITTSBURGH, PA, 15020
HAMILTON,ONTARIO,L8E 4B3
All I want to validate is that we have three comma-separated words (whitespace is fine). All of these would be valid:
foo, bar, baz
foo,bar,baz123
However these would be invalid because they don't have exactly two commas and three words:
foo, bar
boo,bar,baz,bang
foo, bar,
foo,bar,baz,
What I've Tried Unsuccessfully
^[\w],[\w],[\w]$
^[a-zA-Z0-9_.-]*,[a-zA-Z0-9_.-]*,[a-zA-Z0-9_.-]*$ (Doesnt allow sapces)
Also just curious - do yall typically allow whitespaces in regex or prefer an application filters whitespace first and then applies the regex? We can do either.
The pattern ^[\w],[\w],[\w]$ that you tried, can be written as ^\w,\w,\w$ and matches 3 times a single word char with a comma in between.
The pattern ^[a-zA-Z0-9_.-]*,[a-zA-Z0-9_.-]*,[a-zA-Z0-9_.-]*$ matches 3 times repeating 0 or more times any of the listed chars/ranges in the character class with a comma in between.
As the quantifier * is 0 or more times, it could possibly also match ,,
If the word chars should be present at all 3 occasions, and there can not be spaces at the start and end:
^\w+(?:\s*,\s*\w+){2}$
^ Start of string
\w+ Match 1+ word chars
(?:\s*,\s*\w+){2} Repeat 2 times matching a comma between optional whitspace chars and 1+ word chars
$ End of string
Regex demo
Note that \s can also match a newline. If you want to match spaces only, and the string can also start and end with a space you could use the pattern from #anubhava
from the comments.
Try
^\w*\W?,\W?\w*\W?,\W?(\w| ){1,}
(I tested by your examples)

regex to split string into parts

I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you
You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.
As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo

Match certain string on second line of text with regex

I'm new to regex, and would appreciate some guidance/help.
Currently, I'm looking to write an expression, that derives a certain part of text from the 2nd line of the provided text.
Here is the text:
123 anywhere Avenue
Winnipeg, Manitoba R3E 0L7
Canada
Pharmacy Manager: person person
Pharmacy Licence Holder/Owner: 123456 Manitoba Ltd.
see correct formatting with code here
My goal is to derive the 'Manitoba' string from the second line, however I'd like to make it dynamic rather than writing an expression to always fetch Manitoba as a static. I used the below code to target the second line:
(.*)(?=(\n.*){3}$)
(It matches 3 lines up from the last line, thus targeting the desired line)
I noticed, that within the dataset, that the Province (Manitoba) is always in between two spaces.
Is there any addition I can make to the code, so that the expression only targets the second line, then matches the first string in-between spaces?
Perhaps using a lazy expression with a positive lookaround?
If I target all matches in between spaces, it would take both 'Manitoba' and 'R3E 0L7' which I dont want.
I want it to only match the first piece of text in between spaces on the second line.
Any help is much appreciated :-)
Thanks.
One option could be to match the first line, then capture the second word in the second lines in capturing group 1.
Then match the rest of the second line and assert what follows is 3 times a line.
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?=(?:\r?\n.*){3}$)
In parts:
^ Start of string
.*\r?\n Match the whole lines and a newline
\S+ Match 1+ non whitespace char (the first "word")
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
(\S+) Capture group 1 Match 1+ times a non whitespace char (the second "word')
.* Match the rest of the line
(?= Positive lookahead, assert what follows on the right is
(?:\r?\n.*){3}$ Match 3 times a newline followed by 0+ times any except a newline and assert the end of the string
) Close lookahead
Regex demo
You could also turn the lookahead in to a match instead
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?:\r?\n.*){3}$
Regex demo