Regex to match variable length, spaces and special chars? - regex

I've got some strings like so
2020-03-05 11:23:25: zone 10 type Interior name 'Study PIR'
2020-03-05 11:57:15: zone 13 type Entry/Exit 1 name 'Front Door'
I've got the below regex that works for the first string, however I'm not sure how to get the product group to match the full group "Entry/Exit 1" The number can range from 1 - 100
(?<Date>[0-9]{4}-[0-2][1-9]-[0-2][1-9]) (?<Time>2[0-3]|[01][0-9]:[0-5][0-9]:[0-5][0-9]): (?<msgType>\w+) (?<id>[0-9]+) (?<type>\w+) (?<product>\w+) \w+ (?<deviceName>'([^']*)')
Any ideas how I can modify this to match?

Your product group pattern should be
(?<product>\w+(?:\/\w+\s+\d+)?)
See the regex demo
Details
\w+ - 1+ word chars
(?:\/\w+\s+\d+)? - an optional sequence of
\/ - a / char
\w+ - 1+ word chars
\s+ - 1+ whitespaces
\d+ - 1+ digits.
If the format is unknown, or does not fit the above description, just use (?<product>.*?), see demo.

Related

I need to extract all the characters behind a certain date. using regex

I need to extract all the characters behind a certain date using regex.
I tried something list before since I knew the pattern,
(\d{7,8})|([A-Za-z0-9\/]{12})|([0-9\/-]{8,9})
as and when I receive new Invoices there are different Invoice numbers in the PDF. One thing that is certain that after the Invoice number there is an Invoice date which is in the format DD/MM/YYYY
So I need all the data before this date
Sample Data
91504458 26/04/2022
TYRES/REEXPORT 04/07/2022
TYRES/RE-EXPORT 23/09/2022
SAM0112/2022 23/05/2021
020/22-23 17/02/2022
SAM0141/2022 19/03/1975
91/22-23 01/01/2022
SAM0159/2022 15/08/2021
111/22-23 09/09/2021
SAM0106/2022 09/09/2022
017/2022 08/08/2022
91/22-23 07/07/2022
Expected Output Data
91504458
TYRES/REEXPORT
TYRES/RE-EXPORT
SAM0112/2022
020/22-23
SAM0141/2022
91/22-23
SAM0159/2022
111/22-23
SAM0106/2022
017/2022
91/22-23
Appreciate your feedback on the same.
Regards,
Manjesh
You could word boundaries and use an alternation to list and capture the allowed formats in group 1 before matching the date format at the end of the string.
\b([a-zA-Z]+(?:[/-][A-Za-z]+)+|\d{7,8}|(?:[a-zA-Z]+\d+|\d+\/\d\d-?\d\d)(?:/\d{4})?)\s+\d\d/\d\d/\d{4}\b
The pattern matches:
\b A word boundary to prevent a partial word match
( Capture group 1
[a-zA-Z]+ Match 1+ ASCII letters
(?:[/-][A-Za-z]+)+ Repeat 1+ times - or / and again 1+ letters
| Or
\d{7,8} Match 7-8 digits
| Or
(?: Non capture group
[a-zA-Z]+\d+ Match 1+ letters and 1+ digits
| Or
\d+\/\d\d-?\d\d Match digits / and then 2 digits, optional - and 2 digits
) Close non capture group
(?:/\d{4})? Optionally match / and 4 digits
) Close group 1
\s+\d\d/\d\d/\d{4} Match 1+ whitespace chars and a date like format
\b A word boundary
See a regex demo.
Assuming that the dates would always end each row, you could try doing a regex replacement:
Find: \s*\b\d{2}/\d{2}/\d{4}$
Replace: (empty)
Demo

Regex exclude whitespaces from a group to select only a number

I need to take only a number (a float number) from a text, but I can't remove the whitespaces...
** Update
I have a problem with this method, I only need to consider numbers and ',' between '- EUR' and 'Fee' as rule.
You can use
- EUR\W*(.*?)\W*Fee
See the regex demo.
Variations of the regex that might work in different regex engines:
- EUR\W*\K.*?(?=\W*Fee)
(?<=- EUR\W*).*?(?=\W*Fee)
Details:
- EUR - literal text
\W* - zero or more non-word chars
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\W*- zero or more non-word chars
Fee - a string.
You could also match the number format in capture group 1
- EUR\b\D*(\d+(?:,\d+)?)\s+Fee\b
- EUR\b Match - EUR and a word boundary
\D* Match 0+ times any char except a digit
( Capture group 1
\d+(?:,\d+)? Match 1+ digits with an optional decimal part
) Close group 1
\s+Fee\b Match 1+ whitespace chars, Fee and a word boundary
Regex demo
this is working i removed the , from (.) in test string.
Regex example - working

How to get only the first match of a regex Grok filter

goal
I want to retrieve only this string "14" from this message with a logstash Grok
3/03/0 EE 14 GFR 20 AAA XXXXX 50 3365.00
this is my grok code
grok{
match => {
field1 => [
"(?<number_extract>\d{0}\s\d{1,3}\s{1})"
]
}
}
I would like to match just the first match "14" but my Grok filter returns all matches:
14 20 50
If you need to find the first occurrence of a number that consists of 1, 2 or 3 digits only, you may use
^(?:.*?\s)?(?<number_extract>\d{1,3})(?!\S)
Details
^ - start of string
(?:.*?\s)? - an optional substring of any 0+ chars other than line break chars as few as possible, and then a whitespace (this enables a match at the start of the string if it is there)
(?<number_extract>\d{1,3}) - 1 to 3 digits
(?!\S) - a negative lookahead that makes sure there is a whitespace or end of string immediately to the right (enables a match at the end of the string).
Alternative solution
If you know that the number you are looking for is after a date-like field and another field, and you want to force this pre-validation, you may use
^\d+/\d+/\d+\s+\S+\s+(?<number_extract>\d+)
See the regex demo
If you do not have to check if the first field is date-like, you may simply use
^\S+\s+\S+\s+(?<number_extract>\d+)
^(?:\S+\s+){2}(?<number_extract>\d+) // Equivalent
See the regex demo here.
Details
^ - start of string
\d+/\d+/\d+ - 1+ digits, /, 1+ digits, /, 1+ digits
\s+ - 1+ whitespaces
\S+ - 1+ chars other than whitespace
\s+ - 1+ whitespaces
(?<number_extract>\d+) - Capturing group "number_extract": 1+ digits.
Grok demo:

Regex to Capture rest of the line

I have a regex that captures the following expression
XPT 123A
Now I need to add "something" to my regex to capture the remaining string as a group
XPT 123A I AM VERY HAPPY
So XPT would be group 1, 123A group 2, and I AM VERY HAPPY group 3.
Here is my regex (also here http://regexr.com/4mocf):
^([A-Z]{2,4}).((?=\d)[a-zA-Z\d]{0,4})
EDIT:
I dont want to name my groups (editing b/c some people thought it was a dup of another question)
Assuming Group 3 is optional, you may use
^([A-Z]{2,4}) (\d[a-zA-Z\d]{0,3})(?: (.*))?$
^([A-Z]{2,4})\s+(\d[a-zA-Z\d]{0,3})(?:\s+(.*))?$
The \s+ matches any 1+ whitespace chars.
See the regex demo.
Details
^ - start of string
([A-Z]{2,4}) - Group 1: two, three or four uppercase ASCII letters
\s+ - 1+ whitespaces
(\d[a-zA-Z\d]{0,3}) - Group 2: a digit followed with 0 or more alphanumeric chars
(?:\s+(.*))? - an optional non-capturing group matching 1 or 0 occurrences of:
\s+ - 1+ whitespaces
(.*) - Group 3: any 0+ chars other than line break chars as many as possible
$ - end of string
Just add the following suffix to your regex to capture the rest of the line:
(?<rest>.+)?$

Regex Length issue

I'm trying to build a regex where it accepts domain names with the following conditions:
Allows DNS names (only hyphens, periods and alphanumeric characters allowed) upto 255 characters.
Hyphens can only appear in between letters
Should start with a letter and end with a letter. It will have minimum 3 characters (letters and periods mandatory, hyphen is optional.)
The length of the label before a period should be 63
Possible Cases:
a.b.c
a-a.b
Cases that should not pass
a-.b
qwertqwertqwertqwertqwertqwertqwertqwertqwertqwertqwertqwertqwerhhg.v
aaaa
aaa-a
What I have built looks like this:
^(([a-zA-z0-9][A-Z0-9a-z-]{1,61}[a-zA-Z0-9][.])+[a-zA-Z0-9]+)$
But this does not accept a.b.c
You may use
^(?=.{1,255}$)(?=[^.]{1,63}(?![^.]))[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*(?:[.](?=[^.]{1,63}(?![^.]))[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*)+(?:[.][a-zA-Z0-9-]*[a-zA-Z0-9])?$
See the regex demo here.
Pattern details
^ - start of string
(?=.{1,255}$) - the whole string should have 1 to 255 chars
(?=[^.]{1,63}(?![^.])) - there must be 1 to 63 chars other than . before the char other than . or end of string
[a-zA-Z0-9]+ - 1 or more alphanumeric chars
(?: - start of a non-capturing group:
- - a hyphen
[a-zA-Z0-9]+ - 1+ alphanumeric chars
)* - zero or more repetitions
(?: - start of a non-capturing group...
[.] - a dot
(?=[^.]{1,63}(?![^.])) - there must be 1 to 63 chars other than . before the char other than . or end of string
[a-zA-Z0-9]+ - 1+ alphanumeric chars
(?:-[a-zA-Z0-9]+)* - 0 or more repetitions of a - followed with 1+ alphanumeric chars
)+ -... 1 or more times
(?: - start of a non-capturing group...
[.] - a dot
[a-zA-Z0-9-]* - 1+ alphanumeric or - chars
[a-zA-Z0-9] - an alphanumeric char (no hyphens at the end)
)? -... 1 or 0 times (it is optional)
$ - end of string.
You can use the following regex:
/^(?=[A-Z])((?:[A-Z\d]|(?<=[A-Z])-(?=[A-Z])){1,63})(?<=[A-Z])(?:\.[A-Z\d]+){1,2}$/im
Details:
^ - Start of the string.
(?=[A-Z]) - Positive lookahead: The whole string must start with a letter.
( - A capturing group - the domain name.
(?: - Start of a non-capturing group, needed due to the following quantifier.
[A-Z\d] - The first alternative: Either a letter or a digit.
| - Or.
(?<=[A-Z])-(?=[A-Z]) - The second alternative: A hyphen, preceded with a letter
and followed with a letter.
) - End of the non-capturing group.
{1,63} - This group (either alternative) must occur up to 63 times.
) - End of the capturing group.
(?<=[A-Z]) - Positive lookbehid: The capturing group just matched (domain name)
must end with a letter.
(?: - A non-capturing group, also needed due to the following quantifier.
\.[A-Z\d]+ - A dot and a sequence of letters or digits.
) - End of the non-capturing group.
{1,2} - This group must occur 1 or 2 times.
$ - End of the string.
You should definitely use i (case insensitive) option and if you check
a number of strings, each in a separate row, also m (multiline) option.
I didn't include any test for the whole length, but you didn't include it either.
I think, the main task here was to show how to match the case your regex failed.