Regex a Block of Yaml Data - regex

Im currently using regex101 to try and work out the following, id like to be able to capture a full items data for example name_template_2 and its associated description, define and write data
Here's my data model
templates:
name_template:
description: test_description
define: yes
write: true
name_template_2:
description: test_description2
define: false
write: true
I can capture the lines I need with the following
^[[:space:]][[:space:]][[:space:]][[:space:]].*
and
^[[:space:]][[:space:]]name_template_2:
but I am unable to join both patterns together to filter just the key and data related to name_template_2. The more I read online the more I understand it less. Has anyone achieved this before or is there a much more efficient way of doing this?

Using 2 capture groups:
^[^\S\n]{2}(name_template_2:)((?:\n[^\S\n]{4}\S.*)+)
Explanation
^ Start of string
[^\S\n]{2} Match 2 spaces without newlines
(name_template_2:) Match the string and capture in group 1
( Capture group 2
(?: Non capture group
\n Match a newline
[^\S\n]{4} Match 4 spaces without newlines
\S.* Match a non whitespace char and the rest of the line
)+ Repeat the non capture group 1 or more times
) Close group 2
Regex demo

I suggest using a structure-aware tool like yq to manipulate YAML and not using regular expressions.
#!/bin/bash
INPUT='
templates:
name_template:
description: test_description
define: yes
write: true
name_template_2:
description: test_description2
define: false
write: true
'
echo "$INPUT" | yq '{"name_template_2": .templates.name_template_2}'
Output
name_template_2:
description: test_description2
define: false
write: true

Related

Grok pattern/Regex to parse string with nested parenthesis

I am trying to parse out several dynamic strings via Grok/Regex that exist in log messages between (). For example (SenderPartyName below):
2021/05/23 16:01:26.094 High Messaging.Message.Delivered Id(ci1653336085475.12327434#test_te) MessageId(EPIUM#1130754#84601671) SenderPartyName(Mcdonalds (CFH) Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN) SenderRoutingId(08Mdsfkm853)
I would want to parse each key-value out from the string that follow the () format. Here is my grok pattern so far. I've been testing with https://grokdebug.herokuapp.com/
%{DATESTAMP:ts} %{WORD:loglevel} %{DATA:reason}\s ?(Id\(%{DATA:id}\))? ?(MessageId\(%{DATA:originalmessageid}\))? ?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))? ?(ReceiverPartyName\(%{DATA:receiverpartyname}\))? ?(SenderRoutingId\(%{DATA:senderroutingid}\))?
This works when there are () within the nested string like this:
Mcdonalds (CFH) Restaurant Glen
...but it is dynamic and could appear without () like such: Mcdonalds Restaurant Glen
Trying to build regex to account for both scenarios with this portion of the grok pattern:
?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))?
Currently this parses the non-parenthesis case like this though:
"senderpartyname": "Mcdonalds Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN"
..where desired state is one of the following depending on the string:
"senderpartyname": "Mcdonalds Restaurant Glen"
or
"senderpartyname": "Mcdonalds (CFH) Restaurant Glen"
You can use
%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}\s+Id\(%{DATA:id}\)(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?
Note I revamped it so that all optional fields match one or more whitespaces and the fields as obligatory patterns, but they are made optional as a sequence, which makes matching more efficient.
The main thing changed is (?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?, it matches
(?: - start of a non-capturing group:
\s+ - one or more whitespaces
SenderPartyName - a fixed word
(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)) - Group "senderpartyname": ( (matched with \(), then zero or more repetitions of any char other than ( and ) or the Group "senderpartyname" pattern recursed ( see (?:[^()]++|\g<senderpartyname>)*) and then a ) char (matched with \))
)? - end of the group, one or zero repetitions (optional)

Regex capture groups and use OR statement

I'm trying to create a regex expression that has has multiple conditions separated by | (OR). I want to use capture groups but I'm not getting it to work fully.
3 sample strings:
--- {source-charset: '', encoding-error-limit: '', class: stat-direct, directory: \\\myserver\C\FOLDER\SUB_FOLDER}
--- {odbc-connect-string-extras: '', server: hello.sample.com, dbname: X_DB, port: '80', class: hello, username: USERX}
--- {cleaning: 'no', filename: //myserver/D/FOLDER/SUB_FOLDER/File name.xlsx, dataRefreshTime: '', interpretationMode: '0'}
For each sample string I would like the regex to return:
\\\myserver\C\FOLDER\SUB_FOLDER
X_DB
//myserver/D/FOLDER/SUB_FOLDER/File name.xlsx
Basically the value after either directory:, dbname: or filename: and ending with } for one of them and , for two.
I've managed to use OR statements to get the three conditions in.
regex extract
'directory: [^}]+|dbname: [^,]+|filename: [^,]+'
That returns:
directory: \\\myserver\C\FOLDER\SUB_FOLDER}
dbname: X__DB,
filename: //myserver/D/FOLDER/SUB_FOLDER/File name.xlsx,
If I introduce capturing groups I only get the right return for one of the parts:
'directory: ([^}]+)|dbname: ([^,]+)|filename: ([^,]+)'
That returns:
\\\myserver\C\FOLDER\SUB_FOLDER
null
null
I've managed to get it working with a nested regex that takes the result from
'directory: [^}]+|dbname: [^,]+|filename: [^,]+'
and uses:
': ([^,}]+)'
That gives me the result I want but I would like to do this as one regex.
Any help would be greatly appreciated.
/Aron
You could use a negated character class to match not a {, } or a comma, match any of the options in a non capturing group and use and a single capturing group the capture the values:
{[^{]+(?:filename|directory|dbname): ([^,}]+)[^}]*}
Explanation
{ Match {
[^{]+ Match 1+ times not { using a negated character class
(?:filename|directory|dbname): Match any of the listed options followed by : and a space
( Capture group1
[^,}]+ Match 1+ times not , or }
) Close group 1
[^}]*} Match 0+ times not }, then match }
Regex demo

Regex for returning multiple values between strings separated by new line

I'm using PowerShell to read output from an executable and needing to parse the output into an array. I've tried regex101 and I start to get close but not able to return everything.
Identity type: group
Group type: Generic
Project scope: PartsUnlimited
Display name: [PartsUnlimited]\Contributors
Description: {description}
5 member(s):
[?] test
[A] [PartsUnlimited]\PartsUnlimited-1
[A] [PartsUnlimited]\PartsUnlimited-2
[?] test2
[A] [PartsUnlimited]\PartsUnlimited 3
Member of 3 group(s):
e [A] [org]\Project Collection Valid Users
[A] [PartsUnlimited]\Endpoint Creators
e [A] [PartsUnlimited]\Project Valid Users
I need returned an array of:
test
[PartsUnlimited]\PartsUnlimited-1
[PartsUnlimited]\PartsUnlimited-2
test2
[PartsUnlimited]\PartsUnlimited 3
At first I tried:
$pattern = "(?<=\[A|\?\])(.*)"
$matches = ([Regex]$pattern).Matches(($output -join "`n")).Value
But that will return also the "Member of 3 group(s):" section which I don't want.
I also can only get the first value under 5 member(s) with (?<=member\(s\):\n).*?\n ([?] test).
No matches are returned when I add in a positive lookahead: (?<=member\(s\):\n).*?\n(?=Member).
I feel like I'm getting close, just not sure how to handle multiple \n and get strings in between strings if that's needed.
You could do it in two steps (not sure if \G is supported in PowerShell).
The first step would be to separate the block in question with
^\d+\s+member.+[\r\n]
(?:.+[\r\n])+
With the multiline and verbose flags, see a demo on regex101.com.
On this block we then need to perform another expression such as
^\s+\[[^][]+\]\s+(.+)
Again with the multiline flag enabled, see another demo on regex101.com.
The expressions explained:
^\d+\s+member.+[\r\n] # start of the line (^), digits,
# spaces, "member", anything else + newline
(?:.+[\r\n])+ # match any consecutive line that is not empty
The second would be
^\s+ # start of the string, whitespaces
\[[^][]+\]\s+ # [...] (anything allowed within the brackets),
# whitespaces
(.+) # capture the rest of the line into group 1
If \G was supported, you could do it in one rush:
(?:
\G(?!\A)
|
^\d+\s+member.+[\r\n]
)
^\s+\[[^][]*\]\s+
(.+)
[\r\n]
See a demo for the latter on regex101.com as well.

How can I do multiple replace using a shared backreference?

I have a need to do some data-transformation for data load compatibility. The nested key:value pairs need to be flattened and have their group id prepended to each piece of child data.
I've been trying to understand the page at
Repeating a Capturing Group vs. Capturing a Repeated Group but can't seem to wrap my head around it.
My expression so far:
"(?'group'[\w]+)": {\n((\s*"(?'key'[^"]+)": "(?'value'[^"]+)"(?:,\n)?)+)\n},?
Working sample: https://regex101.com/r/Wobej7/1
I'm aware that using 1 or more intermediate steps would simplify the process but at this point I want to know if it's even possible.
Source Data Example:
"g1": {
"k1": "v1",
"k2": "v2",
"k3": "v3"
},
"g2": {
"k4": "v4",
"k5": "v5",
"k6": "v6"
},
"g3": {
"k7": "v7",
"k8": "v8",
"k9": "v9"
}
Desired transformation:
{"g1","k1","v1"},
{"g1","k2","v2"},
{"g1","k3","v3"},
{"g2","k4","v4"},
{"g2","k5","v5"},
{"g2","k6","v6"},
{"g3","k7","v7"},
{"g3","k8","v8"},
{"g3","k9","v9"}
TL; DR
Step 1
Search for:
("[^"]+"):\s*{[^}]*},?\K
Replace with \1
Live demo
Step 2
Search for:
(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
Replace with:
{\3,\1,\2}\4\n
Live demo
Whole philosophy
This is not going to be a one-liner regex solution for different reasons. The most important one is we can neither store a part of a match for later referring nor are able to do infinite lookbehinds in PCRE. But fortunately most of similar problems could be done in two steps.
Very first step should be moving group name to end of {...} block. This way we can have group name each time we want to transform our matches into a single line output.
("[^"]+"):\s*{[^}]*},?\K
( Start of capturing group #1
"[^"]+" Match a group name
) End of CG #1
:\s*{ Group name should precede bunch of other characters
[^}]*},? We have to go further up to end of block
\K Throw away every thing matched so far
We have our group name held in first capturing group and have to replace whole match with it:
\1
Now a block like this:
"g1": {
.
.
.
},
Appears like this one:
"g1": {
.
.
.
},"g1"
Next step is to match key:value pairs of each block beside capturing recent added group name at the end of block.
(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
(?: Start of a non-capturing group
"[^"]+" Try to match a group name
:\s*{ A group name should come after bunch of other characters
| Or
\G(?!\A) Continue from previous match
) End of NCG
\s*("[^"]+"):\s*((?1)) Then try to match and capture a key:value pair
(?=[^}]*},?((?1))) Simultaneously match and capture group name at the end of block
(?|(,)|\s*}(,?).*\R*) Match remaining characters such as commas, brace or newlines
This way in each single successful try of regex engine we have four captured data that their order is the key:
{\3,\1,\2}\4\n
\3 Group name (that one added at the end of block)
\1 Key
\2 Value
\4 Comma (may be there or may not)

How to parse a path with regex - optional fields

I am using the following regex: (example here: https://regex101.com/r/dVTUrM/1)
\/(?<field1>.{4})\/(?<field2>.*?)\/(?<field3>.*?)\/(?<field4>.*?)\/(?<field5>.*?)\/(?<field6>.*)
to parse the following text:
pyramid:/A49E/18DA-6FAB-4921-8AEB-45A07B162DA5/{E3646FA1-4652-45E9-885A-3756FC574057}/{F1864679-1D9D-4084-B38D-231D793AA15D}/9/abc.tif
giving the following result:
Group `field1` 9-13 `A49E`
Group `field2` 14-46 `18DA-6FAB-4921-8AEB-45A07B162DA5`
Group `field3` 47-85 `{E3646FA1-4652-45E9-885A-3756FC574057}`
Group `field4` 86-124 `{F1864679-1D9D-4084-B38D-231D793AA15D}`
Group `field5` 125-126 `9`
Group `field6` 127-134 `abc.tif`
But if field5 and field 6 are missing:
pyramid:/A49E/18DA-6FAB-4921-8AEB-45A07B162DA5/{E3646FA1-4652-45E9-885A-3756FC574057}/{F1864679-1D9D-4084-B38D-231D793AA15D}
I would like this to work and for field5 and field6 to be blank.
Is this possible by modifying the regex statement?
Note: only field6 may be missing as well.
Here you go:
(?x)^pyramid:
/(?P<field1>[^/]{4})
/(?P<field2>[^/]+)
/(?P<field3>[^/]+)
/(?P<field4>[^/]+)
(?:
/(?P<field5>[^/]+)
/(?P<field6>[^/]+)
)?
See a demo on regex101.com.
Or, in short (without the verbose flag):
^pyramid:/(?P<field1>[^/]{4})/(?P<field2>[^/]+)/(?P<field3>[^/]+)/(?P<field4>[^/]+)(?:/(?P<field5>[^/]+)/(?P<field6>[^/]+))?
Depending on the programming language / flavour used, you might use other delimiters like ~ so that you don't need to escape the forward slashes anymore. The (?: ... ) construct is a non capturing group which is made optional with ? to allow 4 or 6 (but not five!) fields.