C# Regex trying to understand the logic - regex

I'm starving to understand the logic of regex let's say I got this string
1 SM-TEST S/M-BLEU, 25.00 EA 96.00
private void Window_Loaded(object sender, RoutedEventArgs e)
{
var test = ReadPdfFile("C:\\Users\\mducharme\\Desktop\\PO # 70882.pdf");
var result = Regex.Split(test, "\r\n|\r|\n");
foreach (var lines in result)
{
if (Regex.IsMatch(lines, #"^\d\s"))
{
string line = lines.ToString();
string pattern = #"^(\S+\s+\S+).*?,(?=\s*\d+\.\d+\b)";
string replacement = "$1";
string result2 = Regex.Replace(line, pattern, replacement);
System.Diagnostics.Debug.WriteLine(result2);
}
}
}
Each lines show a different value like the first one and so on
2 SM-BLABLA S-M-YELLOW, 50.00 EA 96.00...
In the end I want to show up in my MessageBox for the first value only
1 SM-TEST 25.00 EA 96.00
but the regex doesn't seems to do it's job compared to regex101 website code.
Thank you,

Use
^(\d+\s+(?:(?!\d+\.\d+\s+[xX]\s+\d+\.\d+)[A-Z0-9-])+).*?(?=\s\d+\.\d+\s)
See regex proof. Replace with $1.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
[xX] any character of: 'x', 'X'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[A-Z0-9-] any character of: 'A' to 'Z', '0' to
'9', '-'
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
) end of look-ahead

Related

Regex pattern to extract Hearst patterns

I am new to Regex and I am unable to extract hyponym-hypernym pairs in the form of a list or tuple.
I tried using this pattern but I get no matches
(NP_[\w.]*(, NP_[\w.]*)*,? (and)? other NP_[\w.]*)
I have the following annotated sentences for 'and other' pattern:
NP_kimmel faces NP_dui , NP_fleeing or NP_evading_police , and other NP_possible_charges .
The NP_network has asked NP_big_bang_theory_co-creator_bill prady to mastermind the NP_revival , which would see the NP_return of NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear and other NP_old_favorites .
I want to extract a list such as :
[NP_dui,NP_fleeing or NP_evading_police, NP_possible_charges]
OR
(NP_dui,NP_possible_charges)
(NP_fleeing or NP_evading_police,NP_possible_charges)
Similarly for the sentence 2:
[NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear, NP_old_favorites]
or Similar tuples.
Any help would be appreciated.
Use
NP_[\w.]*(?:\s*(?:,|\bor\b|,?\s*and(?:\s+other)?\b)\s*NP_[\w.]*)+
This extracts strings with your matches. Next, extract expected ents with NP_[\w.]*.
Python code:
import re
test_strs = ["NP_kimmel faces NP_dui , NP_fleeing or NP_evading_police , and other NP_possible_charges.",
"The NP_network has asked NP_big_bang_theory_co-creator_bill prady to mastermind the NP_revival , which would see the NP_return of NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear and other NP_old_favorites ."]
p = r'NP_[\w.]*(?:\s*(?:,|\bor\b|,?\s*and(?:\s+other)?\b)\s*NP_[\w.]*)+'
for test_str in test_strs:
matches = []
for match in re.findall(p, test_str):
matches.extend(re.findall(r'NP_[\w.]*\b', match))
print(matches)
Results:
['NP_dui', 'NP_fleeing', 'NP_evading_police', 'NP_possible_charges']
['NP_frog', 'NP_miss_piggy', 'NP_fozzie_bear', 'NP_old_favorites']
EXPLANATION
--------------------------------------------------------------------------------
NP_ 'NP_'
--------------------------------------------------------------------------------
[\w.]* any character of: word characters (a-z, A-
Z, 0-9, _), '.' (0 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
or 'or'
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
,? ',' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
other 'other'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
NP_ 'NP_'
--------------------------------------------------------------------------------
[\w.]* any character of: word characters (a-z,
A-Z, 0-9, _), '.' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)+ end of grouping

condition regex capture only if previous group matching else set capture to null?

Trying to get the condition regex to work to capture domain and user-agent values with the 2 events in https://regex101.com/r/51mp2i/1
but only getting 1 match. How to update the regex to get 2 matches using condition regex? Thanks.
Match 1:
domain: example.org
useragent: "" or not capture
Match 2:
domain: example.org
useragent: Mozilla/5.0 (compatible;example-checks/1.0;+https://www.example.com/; check-id: 9EXc112795a4766a)
Use
"headers":\s+\[{"name":\s+"Host",\s+"value":\s+"(?<domain>[^"]+)(?:.*?"(?i)User-?(?i)Agent",\s+"value":\s+"(?<useragent>[^"]*))?
See proof.
EXPLANATION
--------------------------------------------------------------------------------
"headers": '"headers":'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
{"name": '{"name":'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
"Host", '"Host",'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
"value": '"value":'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^"]+ any character except: '"' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
"User '"User'
--------------------------------------------------------------------------------
-? '-' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
Agent", 'Agent",'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
"value": '"value":'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
)? end of grouping

How can I change this regex in order to validate phone number without internation prefix?

I am not so into REGEX and I am finsing some problem to adapt this regex that verify a phone number to my use case.
I have this REGEX validating phone number with international prefix: https://www.regextester.com/97440
(([+][(]?[0-9]{1,3}[)]?)|([(]?[0-9]{4}[)]?))\s*[)]?[-\s\.]?[(]?[0-9]{1,3}[)]?([-\s\.]?[0-9]{3})([-\s\.]?[0-9]{3,4})
It correctly validates string as: +39 3298494333 but it doesn't validate string representing a number without the international prefix, for example this string doesn't match my regex 3298494333
How can be changed in order to accept also phone number that doesn't have the prefix?
Use this fix:
(?:(?:\+\(?[0-9]{1,3}|\(?\b[0-9]{4})\)?)?\s*\)?[-\s.]?\(?[0-9]{1,3}\)?[-\s.]?[0-9]{3}[-\s.]?[0-9]{3,4}
See proof.
Explanation
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\+ '+'
--------------------------------------------------------------------------------
\(? '(' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
[0-9]{1,3} any character of: '0' to '9' (between
1 and 3 times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\(? '(' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
[0-9]{4} any character of: '0' to '9' (4 times)
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
[-\s.]? any character of: '-', whitespace (\n, \r,
\t, \f, and " "), '.' (optional (matching
the most amount possible))
--------------------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
[0-9]{1,3} any character of: '0' to '9' (between 1
and 3 times (matching the most amount
possible))
--------------------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
[-\s.]? any character of: '-', whitespace (\n, \r,
\t, \f, and " "), '.' (optional (matching
the most amount possible))
--------------------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
--------------------------------------------------------------------------------
[-\s.]? any character of: '-', whitespace (\n, \r,
\t, \f, and " "), '.' (optional (matching
the most amount possible))
--------------------------------------------------------------------------------
[0-9]{3,4} any character of: '0' to '9' (between 3
and 4 times (matching the most amount
possible))

Regex to separate groups of key, operator and value of a clause

What is the best regex to get groups of keys, operator and values from a clause like the image below?
What I have done so far is not accurate and is only able to get the first group: (^.*?(=|!=)+([^.]*))
Use
(\w+(?:\.\w+)*)\s*(!=|=)\s*(\w+)
See proof
Explanation
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1
or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
!= '!='
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \3

What regex to use in this case

Correct code:
"key1=val1;key2=val2;key3=val3" -- Correct as each pair is having ";" at the end except the last pair
Incorrect code:
"key1=val1;key2=val2; key3=val3;" -- Invalid as last pair is having ";" at the end
"key1=val1;;;key2=val2;;;key3=val3" -- Invalid as there are multiple ";" in the middle
I got the regex below from some old link in stackoverflow, but it is not working in the above case:
^(?:\s*\w+\s*=\s*[^;]*;)+$
You might use
^\w+\s*=\s*\w+(?:;\s*\w+\s*=\s*\w+)*$
Explanation
^ Start of string
\w+\s*=\s*\w+ Match 1+ word chars, = and 1+ word chars with optional whitespace chars
(?: Non capture group
;\s*\w+\s*=\s*\w+ Match ; and the same patter as mentioned above
)* Close the group and repeat 0+ times
$ End of string
Regex demo
With the doubled backslashes
^\\w+\\s*=\\s*\\w+(?:;\\s*\\w+\\s*=\\s*\\w+)*$
Also, a shorter one:
^(?:\s*\w+\s*=\s*\w+(?:;(?!\s*$)|\s*$))+\s*$
See proof
Explanation
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end
of the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Try below:
.*\w;\w.*\w;\w.*[^;]$
Test here
Explanation:
.* --> matches any character
\w --> matches any word character
[^;]$ --> Will exclude any line ending with ;
I find things like this much easier without regex. For eample with JavaScript:
function isValid(string) {
return string.split(/;/).map(e => e.split(/=/)).every(e => e.length === 2);
}