Regular expression enforcing at least one in two groups - regex

I have to parse a string using regular expressions in which at least one group in a set of two is required. I cannot figure out how to write this case.
To illustrate the problem we can think parsing this case:
String: aredhouse theball bluegreencar the
Match: ✓ ✓ ✓ ✗
Items are separated by spaces
Each item is composed by an article, a colour and an object defined by groups in the following expression (?P<article>the|a)?(?P<colour>(red|green|blue|yellow)*)(?P<object>car|ball|house)?\s*
An item can have an 'article' but must have a 'colour' or/and an 'object'.
Is there a way of making 'article' optional but require at least one 'colour' or 'object' using regular expressions?
Here is the coded Go version of this example, however I guess this is generic regexp question that applies to any language.

This is working with your testcases.
/
(?P<article>the|a)? # optional article
(?: # non-capture group, mandatory
(?P<colour>(?:red|green|blue|yellow)+) # 1 or more colors
(?P<object>car|ball|house) # followed by 1 object
| # OR
(?P<colour>(?:red|green|blue|yellow)+) # 1 or more colors
| # OR
(?P<object>car|ball|house) # 1 object
) # end group
/x
It can be reduced to:
/
(?P<article>the|a)? # optional article
(?: # non-capture group, mandatory
(?P<colour>(?:red|green|blue|yellow)+) # 1 or more colors
(?P<object>car|ball|house)? # followed by optional object
| # OR
(?P<object>car|ball|house) # 1 object
) # end group
/x

In regex, there's a few special signs that indicate the expected number of matches for a character or a group:
* - zero or more
+ - one or more
? - zero or one
These applied, your regex looks like this:
(?P<article>(the|a)?)(?P<colour>(red|green|blue|yellow)+)(?P<object>(car|ball|house)+)\s*
None or one article.
One or more colors.
Finally one or more objects.

Related

Grok pattern/Regex to parse string with nested parenthesis

I am trying to parse out several dynamic strings via Grok/Regex that exist in log messages between (). For example (SenderPartyName below):
2021/05/23 16:01:26.094 High Messaging.Message.Delivered Id(ci1653336085475.12327434#test_te) MessageId(EPIUM#1130754#84601671) SenderPartyName(Mcdonalds (CFH) Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN) SenderRoutingId(08Mdsfkm853)
I would want to parse each key-value out from the string that follow the () format. Here is my grok pattern so far. I've been testing with https://grokdebug.herokuapp.com/
%{DATESTAMP:ts} %{WORD:loglevel} %{DATA:reason}\s ?(Id\(%{DATA:id}\))? ?(MessageId\(%{DATA:originalmessageid}\))? ?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))? ?(ReceiverPartyName\(%{DATA:receiverpartyname}\))? ?(SenderRoutingId\(%{DATA:senderroutingid}\))?
This works when there are () within the nested string like this:
Mcdonalds (CFH) Restaurant Glen
...but it is dynamic and could appear without () like such: Mcdonalds Restaurant Glen
Trying to build regex to account for both scenarios with this portion of the grok pattern:
?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))?
Currently this parses the non-parenthesis case like this though:
"senderpartyname": "Mcdonalds Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN"
..where desired state is one of the following depending on the string:
"senderpartyname": "Mcdonalds Restaurant Glen"
or
"senderpartyname": "Mcdonalds (CFH) Restaurant Glen"
You can use
%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}\s+Id\(%{DATA:id}\)(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?
Note I revamped it so that all optional fields match one or more whitespaces and the fields as obligatory patterns, but they are made optional as a sequence, which makes matching more efficient.
The main thing changed is (?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?, it matches
(?: - start of a non-capturing group:
\s+ - one or more whitespaces
SenderPartyName - a fixed word
(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)) - Group "senderpartyname": ( (matched with \(), then zero or more repetitions of any char other than ( and ) or the Group "senderpartyname" pattern recursed ( see (?:[^()]++|\g<senderpartyname>)*) and then a ) char (matched with \))
)? - end of the group, one or zero repetitions (optional)

Regex for returning multiple values between strings separated by new line

I'm using PowerShell to read output from an executable and needing to parse the output into an array. I've tried regex101 and I start to get close but not able to return everything.
Identity type: group
Group type: Generic
Project scope: PartsUnlimited
Display name: [PartsUnlimited]\Contributors
Description: {description}
5 member(s):
[?] test
[A] [PartsUnlimited]\PartsUnlimited-1
[A] [PartsUnlimited]\PartsUnlimited-2
[?] test2
[A] [PartsUnlimited]\PartsUnlimited 3
Member of 3 group(s):
e [A] [org]\Project Collection Valid Users
[A] [PartsUnlimited]\Endpoint Creators
e [A] [PartsUnlimited]\Project Valid Users
I need returned an array of:
test
[PartsUnlimited]\PartsUnlimited-1
[PartsUnlimited]\PartsUnlimited-2
test2
[PartsUnlimited]\PartsUnlimited 3
At first I tried:
$pattern = "(?<=\[A|\?\])(.*)"
$matches = ([Regex]$pattern).Matches(($output -join "`n")).Value
But that will return also the "Member of 3 group(s):" section which I don't want.
I also can only get the first value under 5 member(s) with (?<=member\(s\):\n).*?\n ([?] test).
No matches are returned when I add in a positive lookahead: (?<=member\(s\):\n).*?\n(?=Member).
I feel like I'm getting close, just not sure how to handle multiple \n and get strings in between strings if that's needed.
You could do it in two steps (not sure if \G is supported in PowerShell).
The first step would be to separate the block in question with
^\d+\s+member.+[\r\n]
(?:.+[\r\n])+
With the multiline and verbose flags, see a demo on regex101.com.
On this block we then need to perform another expression such as
^\s+\[[^][]+\]\s+(.+)
Again with the multiline flag enabled, see another demo on regex101.com.
The expressions explained:
^\d+\s+member.+[\r\n] # start of the line (^), digits,
# spaces, "member", anything else + newline
(?:.+[\r\n])+ # match any consecutive line that is not empty
The second would be
^\s+ # start of the string, whitespaces
\[[^][]+\]\s+ # [...] (anything allowed within the brackets),
# whitespaces
(.+) # capture the rest of the line into group 1
If \G was supported, you could do it in one rush:
(?:
\G(?!\A)
|
^\d+\s+member.+[\r\n]
)
^\s+\[[^][]*\]\s+
(.+)
[\r\n]
See a demo for the latter on regex101.com as well.

Regular Expression to match many coordinate formats

I am working on a regex that will match many different types of of location coordinates. So far it matches about 90% of the formats:
([SNsn][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([SNsn]?)[^\\dSNsnEWew]+([EWew][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([EWew]?)
Testing the formats:
N 45° 55.732 W 122° 29.882
N 047° 38.938', W 122° 20.887'
40.123, -74.123
40.123° N 74.123° W
40° 7´ 22.8" N 74° 7´ 22.8" W
40° 7.38’ , -74° 7.38’
N40°7’22.8, W74°7’22.8"
40°7’22.8"N, 74°7’22.8"W
40 7 22.8, -74 7 22.8
40.123 -74.123
40.123°,-74.123°
144442800, -266842800
40.123N74.123W
4007.38N7407.38W
40°7’22.8"N, 74°7’22.8"W
400722.8N740722.8W
N 40 7.38 W 74 7.38
40:7:23N,74:7:23W
40:7:22.8N 74:7:22.8W
40°7’23"N 74°7’23"W
40°7’23" -74°7’23"
40d 7’ 23" N 74d 7’ 23" W
40.123N 74.123W
40° 7.38, -74° 7.38
Testing if it works: https://regexr.com/3ivu2
As you can see there are issues with the spaces and commas that are causing the regex to not match some of these formats.
I am trying to match the coordinate strings so that they can be highlighted in my iOS app and allow the user to tap them.
What can I do to update the regex and fix the matching issues?
Overview
I'm sure there are many ways to go about this. Since you haven't specified a regex engine or programming language, I'll post one that works in PCRE and what that should work in most engines. The PCRE regex is much easier to understand than the non-PCRE regex, but both use the exact same logic.
The patterns defined below match each string you've presented in your question and properly separates each part of the coordinate (x, y).
Code
PCRE
This method uses the DEFINE construct to pre-define patterns. The beauty of this construct is that you can define reusable parts of your regex in one location, thus, you can edit most of the regex just by editing these subpatterns.
See regex in use here
(?(DEFINE)
(?<ns>[ns])
(?<ew>[ew])
(?<d>[°´’'"d:])
(?<n>[+-]?\d+(?:\.\d+)?)
)
(
(?&ns)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ns)?
)
\ ?,?\ ?
(
(?&ew)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ew)?
)
Flags: gix
Non-PCRE
See regex in use here
(
[ns]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ns]?
)
\ ?,?\ ?
(
[ew]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ew]?
)
Flags: gix.
Some engines don't have the x flag. For those engines you can use the following one-liner (as seen here):
([ns]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ns]?) ?,? ?([ew]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ew]?)
Explanation
Since both patterns are essentially the same (non-PCRE is just an expanded version of the PCRE), I'll define the PCRE regex pattern since it's easier to grasp.
Note that the patterns that use x have escaped spaces since they would otherwise be ignored (x ignores whitespace within the pattern). The i flag allows us to match text regardless of case (i makes our pattern case-insensitive).
DEFINE
(?(DEFINE)...) The DEFINE group is completely ignored by regex. It gets treated as a var name=value, whereas you can recall the specific pattern for use via its name.
(?<ns>[ns]) The group ns matches any character in the set nsNS
(?<ew>[ew]) The group ew matches any character in the set ewEW
(?<d>[°´’'"d:]) The group d matches any character in the set °´’'"d:
(?<n>[+-]?\d+(?:\.\d+)?) The group n matches any number that matches the following structure
[+-]? Optionally match any character in the set +-
\d+ Match one or more digits
(?:\.\d+)? Optionally match a decimal point followed by one or more digits
Pattern
The pattern is composed of 3 larger parts. The first and last are capture groups (the coordinates themselves) and the second is what separates the two.
Capture 1:
(?&ns)? Optionally match the group ns
(?:\ ?(?&n)(?&d)?){1,3} Matches [an optional space, followed by the group n then optionally group d] between one and three times
\ ?(?&ns)? Optionally match a space, optionally match the group ns
\ ?,?\ ? Match an optional space, comma and space (this separates each coordinate part)
Capture 2: This is the same as Capture 1 but replaces the group ns with the group ew
This simplified regex literally matches all the patterns you've given:
^((?:[NW]? ?(?:[-\d.d]+[NW:°´’'",]?[ NW]?)+[, ]*)+[NW]?)$
I'm not an expert for coordinates, but you can modify it easily if I didn't take into account some specifics.
A full test is here.

How to parse a path with regex - optional fields

I am using the following regex: (example here: https://regex101.com/r/dVTUrM/1)
\/(?<field1>.{4})\/(?<field2>.*?)\/(?<field3>.*?)\/(?<field4>.*?)\/(?<field5>.*?)\/(?<field6>.*)
to parse the following text:
pyramid:/A49E/18DA-6FAB-4921-8AEB-45A07B162DA5/{E3646FA1-4652-45E9-885A-3756FC574057}/{F1864679-1D9D-4084-B38D-231D793AA15D}/9/abc.tif
giving the following result:
Group `field1` 9-13 `A49E`
Group `field2` 14-46 `18DA-6FAB-4921-8AEB-45A07B162DA5`
Group `field3` 47-85 `{E3646FA1-4652-45E9-885A-3756FC574057}`
Group `field4` 86-124 `{F1864679-1D9D-4084-B38D-231D793AA15D}`
Group `field5` 125-126 `9`
Group `field6` 127-134 `abc.tif`
But if field5 and field 6 are missing:
pyramid:/A49E/18DA-6FAB-4921-8AEB-45A07B162DA5/{E3646FA1-4652-45E9-885A-3756FC574057}/{F1864679-1D9D-4084-B38D-231D793AA15D}
I would like this to work and for field5 and field6 to be blank.
Is this possible by modifying the regex statement?
Note: only field6 may be missing as well.
Here you go:
(?x)^pyramid:
/(?P<field1>[^/]{4})
/(?P<field2>[^/]+)
/(?P<field3>[^/]+)
/(?P<field4>[^/]+)
(?:
/(?P<field5>[^/]+)
/(?P<field6>[^/]+)
)?
See a demo on regex101.com.
Or, in short (without the verbose flag):
^pyramid:/(?P<field1>[^/]{4})/(?P<field2>[^/]+)/(?P<field3>[^/]+)/(?P<field4>[^/]+)(?:/(?P<field5>[^/]+)/(?P<field6>[^/]+))?
Depending on the programming language / flavour used, you might use other delimiters like ~ so that you don't need to escape the forward slashes anymore. The (?: ... ) construct is a non capturing group which is made optional with ? to allow 4 or 6 (but not five!) fields.

Removing everything except a "part" of the string

Here is the string, a full example:
('1416851040', '1416851040', '50.62.177.118', '84.161.97.189', 'humpy_electro', 393883, '385962628'),
('1416851046', '1416851046', '2607:5300:60:6097::', '80.187.100.105', 'lagbugdc', 393884, '737537953'),
('1416851067', '1416851067', '174.66.174.101', '98.148.244.151', 'maihym', 393885, '1473193487'),
('1416851094', '1416851094', '2607:5300:60:6097::', '92.157.2.230', 'xeosse26', 393886, '737537953'),
I'd like to remove -EVERYTHING- from it except: facebook:jens.pettersson.7568
(the username slot)
And where facebook:jens.pettersson.7568 is actually 'facebook:jens.pettersson.7568', I'd like it to appear as:
facebook:jens.pettersson.7568 (see the white space there?)
Then sort my list where all 361k lines line up like so:
x x xx xcx xzx xyx xtz
All with spaces, in technically 1 line, if possible.
Or if removing and just collecting the 1 line I need would suffice, I could manually do the sorting i suppose
I'm going to read between the lines and guess that what you want is this:
BEFORE:
('1416851040', '1416851040', '50.62.177.118', '84.161.97.189', 'humpy_electro', 393883, '385962628'),
^ this is username
AFTER:
facebook:humpy_electro
You could handle that with the following regex:
s/(?:[^,]*,){4}[\s'"]*([^'",]*).*/facebook:$1, /
i.e.
(?: # begin non-capturing group
[^,]*, # zero or more non-comma characters, followed by a comma
){4} # end non-capturing group, and repeat 4 times
# this skips the first 4 columns of data
[\s'"]* # matches any whitespace and the first quote
( # begin capturing group 1
[^'",]* # capture all non-comma characters until the end quote
) # end capturing group 1
.* # match rest of line
# REPLACE WITH
facebook: # literal text
$1 # capturing group 1
, # comma and a trailing space (not shown here)
And voila.
This turns this:
('1416851040', '1416851040', '50.62.177.118', '84.161.97.189', 'humpy_electro', 393883, '385962628'),
('1416851046', '1416851046', '2607:5300:60:6097::', '80.187.100.105', 'lagbugdc', 393884, '737537953'),
('1416851067', '1416851067', '174.66.174.101', '98.148.244.151', 'maihym', 393885, '1473193487'),
('1416851094', '1416851094', '2607:5300:60:6097::', '92.157.2.230', 'xeosse26', 393886, '737537953'),
Into this
facebook:humpy_electro, facebook:lagbugdc, facebook:maihym, facebook:xeosse26,
I got it, from a friend, to do this was a 2 part: First step: ^((.? '){4}) replace with nothing, then, second step '((.?$){1}) replace with nothing.