Support required in regex writting in PCRE - regex

I am not very good in regex and learning this on daily basis. I got issue where I want to extract data after # and before > if it exist in the field value else it should return as its data.
Data example: <abc#xyz.com>, chene.com abc.xyz#xyz.com
Expected output of my regex should be xyz.com, chene.com and xyz.com.
What I wrote is
([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})
but this is not fetching all of the required data.

I suggest capturing the part you need using
(?:<?[\w.-]+#)?\b(?<from_domain>\w[\w.-]*\.[a-zA-Z]{2,5})\b
See the regex demo
Details
(?:<?[\w.-]+#)? - an optional non-capturing group that matches
<? - an optional < char
[\w.-]+ - 1+ word chars, . or - chars
# - a # char
\b - a word boundary
(?<from_domain>\w[\w.-]*\.[a-zA-Z]{2,5}) - Group "from_domain":
\w[\w.-]* - a word char followed with 0 or more word, dot or hyphen chars
\. - a dot
[a-zA-Z]{2,5} - two to five ASCII letters
\b - a word boundary

Related

Regex exclude whitespaces from a group to select only a number

I need to take only a number (a float number) from a text, but I can't remove the whitespaces...
** Update
I have a problem with this method, I only need to consider numbers and ',' between '- EUR' and 'Fee' as rule.
You can use
- EUR\W*(.*?)\W*Fee
See the regex demo.
Variations of the regex that might work in different regex engines:
- EUR\W*\K.*?(?=\W*Fee)
(?<=- EUR\W*).*?(?=\W*Fee)
Details:
- EUR - literal text
\W* - zero or more non-word chars
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\W*- zero or more non-word chars
Fee - a string.
You could also match the number format in capture group 1
- EUR\b\D*(\d+(?:,\d+)?)\s+Fee\b
- EUR\b Match - EUR and a word boundary
\D* Match 0+ times any char except a digit
( Capture group 1
\d+(?:,\d+)? Match 1+ digits with an optional decimal part
) Close group 1
\s+Fee\b Match 1+ whitespace chars, Fee and a word boundary
Regex demo
this is working i removed the , from (.) in test string.
Regex example - working

Regex exclude trailing text from company names

CURRENTLY
I am try to match valid company names from strings with 4 conditions:
the name can ONLY contain alphanumeric characters + spaces + hyphens
the name can contain a hyphen (inside the name)
there are company suffixes that should be excluded from the company name i.e. Pty Ltd, Pty. Ltd., Limited, and Ltd.
If there are additional matches on the same line, these are to be excluded
What I am trying to achieve:
My regex so far:
(?:\s|^)([a-zA-Z0-9]+[a-zA-Z0-9\s-]*?[a-zA-Z0-9]+)(?: Pty Ltd| Ltd(\.){0,1}| Limited){0,1}(?:\s|$)
ISSUES
https://regex101.com/r/Gpbdln/4
It seems I am struggling with:
Excluding the suffixes to be ignored
Making the capture include spaces for the company name (while at the same time excluded suffixes)
I have been stuck on this for over an hour and would appreciate some help.
You may use
^[a-zA-Z0-9]+(?:[\s-]+[a-zA-Z0-9]+)*?(?=(?:\s+(?:(?:Pty\.?\s+)?Ltd\.?|Limited|[a-zA-Z0-9]*[^a-zA-Z0-9\s]).*)?$)
See the regex demo
If you only need to get matches that do not span across lines, replace \s with \h or [\p{Zs}\t] if supported, or [^\S\r\n], to only match horizontal whitespaces.
Details
^ - start of string
[a-zA-Z0-9]+ - 1+ ASCII alphanumeric chars
(?:[\s-]+[a-zA-Z0-9]+)*? - 0 or more (but as few as possible) occurrences of
[\s-]+ - 1+ whitespaces or hyphens
[a-zA-Z0-9]+ - 1+ ASCII alphanumeric chars
(?=(?:\s+(?:(?:Pty\.?\s+)?Ltd\.?|Limited|[a-zA-Z0-9]*[^a-zA-Z0-9\s]).*)?$) - immediately to the right, there must be
(?:\s+(?:(?:Pty\.?\s+)?Ltd\.?|Limited|[a-zA-Z0-9]*[^a-zA-Z0-9\s]).*)? - an optional occurrence of a sequence of patterns:
\s+ - 1+ whitespaces
(?:(?:Pty\.?\s+)?Ltd\.?|Limited|[a-zA-Z0-9]*[^a-zA-Z0-9\s]) - any of
(?:Pty\.?\s+)?Ltd\.?| - an optional sequence of Pty, an optional dot and then 1+ whitespaces and then Ltd string and an optional . char, or
Limited| - Limited string, or
[a-zA-Z0-9]*[^a-zA-Z0-9\s] - any 0 or more ASCII alphanumeric chars followed with a char other than whitespace and alphanumeric char
.* - the rest of the string
$ - end of string.

Regex to match variable length, spaces and special chars?

I've got some strings like so
2020-03-05 11:23:25: zone 10 type Interior name 'Study PIR'
2020-03-05 11:57:15: zone 13 type Entry/Exit 1 name 'Front Door'
I've got the below regex that works for the first string, however I'm not sure how to get the product group to match the full group "Entry/Exit 1" The number can range from 1 - 100
(?<Date>[0-9]{4}-[0-2][1-9]-[0-2][1-9]) (?<Time>2[0-3]|[01][0-9]:[0-5][0-9]:[0-5][0-9]): (?<msgType>\w+) (?<id>[0-9]+) (?<type>\w+) (?<product>\w+) \w+ (?<deviceName>'([^']*)')
Any ideas how I can modify this to match?
Your product group pattern should be
(?<product>\w+(?:\/\w+\s+\d+)?)
See the regex demo
Details
\w+ - 1+ word chars
(?:\/\w+\s+\d+)? - an optional sequence of
\/ - a / char
\w+ - 1+ word chars
\s+ - 1+ whitespaces
\d+ - 1+ digits.
If the format is unknown, or does not fit the above description, just use (?<product>.*?), see demo.

Regex search the 2 nearest keywords

I want to search keyword TIMESTAMP in CREATE TABLE. This is my regex:
(?i)(\s+|^)CREATE\s+TABLE\s+\[\s*\bdbo\b\s*\]\.\[\w+\]\s*\(\s*((.|\n)*)\bTIMESTAMP
But it search CREATE TABLE in a query and TIMESTAMP in another query.
Like this
Can you help me, please?
When you just want to search Create Table and Timestamp you can use this simple regex:
(?i)(CREATE TABLE|TIMESTAMP)
The (?i) optional for case insensive.
You may use
(?im)^CREATE\s+TABLE\s+\[\s*dbo\s*\]\.\[\w+\]\s*\(\s*(.*(?:\n(?!CREATE\s+TABLE\b).*)*)\bTIMESTAMP\b
See this regex demo
If your regex can't match a CR char with . add \r? before \n.
Note you do not need \b word boundaries on both ends of dbo as it is inside [...].
Details
(?im) - ignore case and multiline modes on
^ - start of a line
CREATE\s+TABLE\s+\[ - CREATE TABLE [ with any 1+ whitespaces in between words
\s*dbo\s* - a dbo string enclosed with 0+ whitespaces
\]\.\[ - ].[ string
\w+ - 1+ word chars
\] - ] char
-\s*\(\s* - a ( enclosed with 0+ whitespaces
(.*(?:\n(?!CREATE\s+TABLE\b).*)*) - Group 1:
.* - any 0+ chars other than line break chars
(?:\n(?!CREATE\s+TABLE\b).*)* - 0 or more sequences of
\n(?!CREATE\s+TABLE\b) - a newline not followed with CREATE TABLE
.* - any 0+ chars other than line break chars
\bTIMESTAMP\b - a whole word TIMESTAMP
It might be easier to do it in two steps.
Step 1: find "complete" CREATE TABLE statements. Actually, find the span of the outermost parentheses.
(?i)(^ *)CREATE\s+TABLE\s+[^()]*\(([^()]*\([^()]*\))*[^()]*\)
Test here.
Step 2: find timestamp in the resulting found strings.

Regex help for Event Match that are unique, though the pattern is same

here is my regex: https://regex101.com/r/g56UzY/1
i have this pattern
pdlvkw6v INFO 18:25:03.994 pdlvkw6v WARN 18:25:03.994 pdlvkw6v INFO
18:25:03.994 rg9n9bz7 INFO 18:23:52.987 rg9n9bz7 ERROR 19:23:52.987
rg9n9bz7 INFO 21:23:52.987 5y6n9bz7 WARN 18:23:52.987
and my current regex is: [\w]{8}\s+(INFO|WARN|ERROR)\s+\d\d:\d\d:\d\d\.\d\d\d
I want the regex to only determine the first unique string ie. show pdlvkw6v and after that it should show me rg9n9bz7 and then 5y6n9bz7, it should not match the repititive strings.
What i am trying is to break events from multiline based on this fixed string and since one event can have multiple string and i want to be able to break it by the first matching string and leave the rest into the event.
You need to capture the word you are interested in and add a negative lookahead check:
(?s)\b(\w{8})\b(?!.*\b\1\b)\s+(?:INFO|WARN|ERROR)\s+\d\d(?::\d\d){2}\.\d{3}
^^^^^^^^^^^^^^^^^^^^^^^
Or, if (?s) modifier is not supported:
\b(\w{8})\b(?![\s\S]*\b\1\b)\s+(?:INFO|WARN|ERROR)\s+\d\d(?::\d\d){2}\.\d{3}
See the regex demo
Explanation:
(?s) - a DOTALL modifier making . match any char
\b - a word boundary
(\w{8}) - Group 1: 8 word chars
\b - a word boundary
(?!.*\b\1\b) - the negative lookahead that fails the match if immediately to the right of the current location, after 0+ chars, there is a whole word equal to the one stored in the Group 1 buffer
\s+ - 1+ whitespaces
(?:INFO|WARN|ERROR) - one of the three substrings
\s+ - 1+ whitespaces
\d\d - 2 digits
(?::\d\d){2} - 2 sequences of :, digit, digit
\. - a dot
\d{3} - three digits