How can I do multiple replace using a shared backreference? - regex

I have a need to do some data-transformation for data load compatibility. The nested key:value pairs need to be flattened and have their group id prepended to each piece of child data.
I've been trying to understand the page at
Repeating a Capturing Group vs. Capturing a Repeated Group but can't seem to wrap my head around it.
My expression so far:
"(?'group'[\w]+)": {\n((\s*"(?'key'[^"]+)": "(?'value'[^"]+)"(?:,\n)?)+)\n},?
Working sample: https://regex101.com/r/Wobej7/1
I'm aware that using 1 or more intermediate steps would simplify the process but at this point I want to know if it's even possible.
Source Data Example:
"g1": {
"k1": "v1",
"k2": "v2",
"k3": "v3"
},
"g2": {
"k4": "v4",
"k5": "v5",
"k6": "v6"
},
"g3": {
"k7": "v7",
"k8": "v8",
"k9": "v9"
}
Desired transformation:
{"g1","k1","v1"},
{"g1","k2","v2"},
{"g1","k3","v3"},
{"g2","k4","v4"},
{"g2","k5","v5"},
{"g2","k6","v6"},
{"g3","k7","v7"},
{"g3","k8","v8"},
{"g3","k9","v9"}

TL; DR
Step 1
Search for:
("[^"]+"):\s*{[^}]*},?\K
Replace with \1
Live demo
Step 2
Search for:
(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
Replace with:
{\3,\1,\2}\4\n
Live demo
Whole philosophy
This is not going to be a one-liner regex solution for different reasons. The most important one is we can neither store a part of a match for later referring nor are able to do infinite lookbehinds in PCRE. But fortunately most of similar problems could be done in two steps.
Very first step should be moving group name to end of {...} block. This way we can have group name each time we want to transform our matches into a single line output.
("[^"]+"):\s*{[^}]*},?\K
( Start of capturing group #1
"[^"]+" Match a group name
) End of CG #1
:\s*{ Group name should precede bunch of other characters
[^}]*},? We have to go further up to end of block
\K Throw away every thing matched so far
We have our group name held in first capturing group and have to replace whole match with it:
\1
Now a block like this:
"g1": {
.
.
.
},
Appears like this one:
"g1": {
.
.
.
},"g1"
Next step is to match key:value pairs of each block beside capturing recent added group name at the end of block.
(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
(?: Start of a non-capturing group
"[^"]+" Try to match a group name
:\s*{ A group name should come after bunch of other characters
| Or
\G(?!\A) Continue from previous match
) End of NCG
\s*("[^"]+"):\s*((?1)) Then try to match and capture a key:value pair
(?=[^}]*},?((?1))) Simultaneously match and capture group name at the end of block
(?|(,)|\s*}(,?).*\R*) Match remaining characters such as commas, brace or newlines
This way in each single successful try of regex engine we have four captured data that their order is the key:
{\3,\1,\2}\4\n
\3 Group name (that one added at the end of block)
\1 Key
\2 Value
\4 Comma (may be there or may not)

Related

Grok pattern/Regex to parse string with nested parenthesis

I am trying to parse out several dynamic strings via Grok/Regex that exist in log messages between (). For example (SenderPartyName below):
2021/05/23 16:01:26.094 High Messaging.Message.Delivered Id(ci1653336085475.12327434#test_te) MessageId(EPIUM#1130754#84601671) SenderPartyName(Mcdonalds (CFH) Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN) SenderRoutingId(08Mdsfkm853)
I would want to parse each key-value out from the string that follow the () format. Here is my grok pattern so far. I've been testing with https://grokdebug.herokuapp.com/
%{DATESTAMP:ts} %{WORD:loglevel} %{DATA:reason}\s ?(Id\(%{DATA:id}\))? ?(MessageId\(%{DATA:originalmessageid}\))? ?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))? ?(ReceiverPartyName\(%{DATA:receiverpartyname}\))? ?(SenderRoutingId\(%{DATA:senderroutingid}\))?
This works when there are () within the nested string like this:
Mcdonalds (CFH) Restaurant Glen
...but it is dynamic and could appear without () like such: Mcdonalds Restaurant Glen
Trying to build regex to account for both scenarios with this portion of the grok pattern:
?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))?
Currently this parses the non-parenthesis case like this though:
"senderpartyname": "Mcdonalds Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN"
..where desired state is one of the following depending on the string:
"senderpartyname": "Mcdonalds Restaurant Glen"
or
"senderpartyname": "Mcdonalds (CFH) Restaurant Glen"
You can use
%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}\s+Id\(%{DATA:id}\)(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?
Note I revamped it so that all optional fields match one or more whitespaces and the fields as obligatory patterns, but they are made optional as a sequence, which makes matching more efficient.
The main thing changed is (?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?, it matches
(?: - start of a non-capturing group:
\s+ - one or more whitespaces
SenderPartyName - a fixed word
(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)) - Group "senderpartyname": ( (matched with \(), then zero or more repetitions of any char other than ( and ) or the Group "senderpartyname" pattern recursed ( see (?:[^()]++|\g<senderpartyname>)*) and then a ) char (matched with \))
)? - end of the group, one or zero repetitions (optional)

Regex for SQL Query

Hello together I have the following problem:
I have a long list of SQL queries which I would like to adapt to one of my changes. Finally, I have a renaming problem and I'm afraid I want to solve it more complicated than expected.
The query looks like this:
INSERT member (member, prename, name, street, postalcode, town, tel1, tel2, fax, bem, anrede, salutation, email, name2, name3, association, project) VALUES (2005, N'John', N'Doe', N'Street 4711', N'1234', N'Town', N'1234-5678', N'1234-5678', N'1234-5678', N'Leader', NULL, N'Dear Mr. Doe', N'a#b.com', N'This is the text i want to delete', N'Name2', N'Name3', NULL, NULL);
In the "Insert" there was another column which I removed (which I did simply via Notepad++ by typing the search term - "example, " - and replaced it with an empty field. Only the following entry in Values I can't get out using this method, because the text varies here. So far I have only worked with the text file in which I adjusted the list of queries.
So as you can see there is one more entry in Values than in the insertions (there was another column here, but it was removed by my change).
It is the entry after the email address. I would like to remove this including the comma (N'This is the text i want to delete',).
My idea was to form a group and say that the 14th digit after the comma should be removed. However, even after research I do not know how to realize this.
I thought it could look like this (tried in https://regex101.com/)
VALUES\s?\((,) something here
Is this even the right approach or is there another method? I only knew Regex to solve this problem, because of course the values look different here.
And how can I finally use the regex to get the queries adapted (because the queries are local to my computer and not yet included in the code).
Short summary:
Change the query from
VALUES (... test5, test6, test7 ...)
To
VALUES (... test5, test7 ...)
As per my comment, you could use find/replace, where you search for:
(\bVALUES +\((?:[^,]+,){13})[^,]+,
And replace with $1
See the online demo
( - Open 1st capture group.
\bValues +\( - Match a word-boundary, literally 'VALUES', followed by at least a single space and a literal open paranthesis.
(?: - Open non-capturing group.
[^,]+, - Match anything but a comma at least once followed by a comma.
){13} - Close non-capture group and repeat it 13 times.
) - Close 1st capture group.
[^,]+, - Match anything but a comma at least once followed by a comma.
You may use the following to remove / replace the value you need:
Find What: \bVALUES\s*\((\s*(?:N'[^']*'|\w+))(?:,(?1)){12}\K,(?1)
Replace With: (empty string, or whatever value you need)
See the regex demo
Details
\bVALUES - whole word VALUES
\s* - 0+ whitespaces
\( - a (
(\s*(?:N'[^']*'|\w+)) - Group 1: 0+ whitespaces and then either N' followed with any 0 or more chars other than ' and then a ', or 1+ word chars
(?:,(?1)){12} - twelve repetitions of , followed with the Group 1 pattern
\K - match reset operator that discards the text matched so far from the match memory buffer
, - a comma
(?1) - Group 1 pattern.
Settings screen:

Regex for returning multiple values between strings separated by new line

I'm using PowerShell to read output from an executable and needing to parse the output into an array. I've tried regex101 and I start to get close but not able to return everything.
Identity type: group
Group type: Generic
Project scope: PartsUnlimited
Display name: [PartsUnlimited]\Contributors
Description: {description}
5 member(s):
[?] test
[A] [PartsUnlimited]\PartsUnlimited-1
[A] [PartsUnlimited]\PartsUnlimited-2
[?] test2
[A] [PartsUnlimited]\PartsUnlimited 3
Member of 3 group(s):
e [A] [org]\Project Collection Valid Users
[A] [PartsUnlimited]\Endpoint Creators
e [A] [PartsUnlimited]\Project Valid Users
I need returned an array of:
test
[PartsUnlimited]\PartsUnlimited-1
[PartsUnlimited]\PartsUnlimited-2
test2
[PartsUnlimited]\PartsUnlimited 3
At first I tried:
$pattern = "(?<=\[A|\?\])(.*)"
$matches = ([Regex]$pattern).Matches(($output -join "`n")).Value
But that will return also the "Member of 3 group(s):" section which I don't want.
I also can only get the first value under 5 member(s) with (?<=member\(s\):\n).*?\n ([?] test).
No matches are returned when I add in a positive lookahead: (?<=member\(s\):\n).*?\n(?=Member).
I feel like I'm getting close, just not sure how to handle multiple \n and get strings in between strings if that's needed.
You could do it in two steps (not sure if \G is supported in PowerShell).
The first step would be to separate the block in question with
^\d+\s+member.+[\r\n]
(?:.+[\r\n])+
With the multiline and verbose flags, see a demo on regex101.com.
On this block we then need to perform another expression such as
^\s+\[[^][]+\]\s+(.+)
Again with the multiline flag enabled, see another demo on regex101.com.
The expressions explained:
^\d+\s+member.+[\r\n] # start of the line (^), digits,
# spaces, "member", anything else + newline
(?:.+[\r\n])+ # match any consecutive line that is not empty
The second would be
^\s+ # start of the string, whitespaces
\[[^][]+\]\s+ # [...] (anything allowed within the brackets),
# whitespaces
(.+) # capture the rest of the line into group 1
If \G was supported, you could do it in one rush:
(?:
\G(?!\A)
|
^\d+\s+member.+[\r\n]
)
^\s+\[[^][]*\]\s+
(.+)
[\r\n]
See a demo for the latter on regex101.com as well.

Using RegEx to grab a field in brackets

I have multiple square bracketed data in the log file of a splunk log. I am attempting to find a particular field named UserDataGuid and then gather the data in the bracket after this. My only option seems to be regular expressions in a standard that seems similar to perl to me. Yet does not work what am I doing wrong here ?
| rex "\]\s(?<UserDataGuid>.*?)\s*$"
// this trial looks more promising but grabs the last bracket :( and doesn't name the field, to be used in a subSearch.
| rex "(?i)UserDataGuid\s*\[([^\}]*)\]
the data looks like this
[21] INFO UserDataGuid [fas08f0da-faf6-4308-aad6-hfld5643gs] [(null)] [(null)] [(null)]
and I want only the guid
fas08f0da-faf6-4308-aad6-hfld5643gs
and I would love for it to be a field I could reuse like fields are used in splunk.
It looks like you want
(?<=UserDataGuid\s\[)([^\]]*)
I'd try the following regex:
(?<=UserDataGuid \[).*?(?=\])/g
This will capture fas08f0da-faf6-4308-aad6-hfld5643gs. See a demo here.
With
\]\s(?<UserDataGuid>.*?)\s*$
you say: match a ] > \], follow by any space character (only one) > \s, follow by a group with name UserDataGuid > (?<UserDataGuid> ... ) that contains any character, except newline (zero times, to unlimited times) > .*? ( in lazy mode, ? ), follow by any space character (zero times, to unlimited times) > \s*, follow by end of string > $
I think that you don't want this (?<UserDataGuid> ... );
you want match (in some way) UserDataGuid, no call UserDataGuid at the group that match " any character, except newline (zero times, to unlimited times) > .*? ( in lazy mode, ? ) "
In
(?i)UserDataGuid\s*\[([^\}]*)\]
change the }, for a ], and then, you captured your GUID in group #1
but, you don't need match "UserDataGuid\s[*"
you could use:
(?<=UserDataGuid \[)([^\]]*)
and then, you only match the GUID, and find it in the group #1
you can remove the parenthesis of group #1, because is a full match:
(?<=UserDataGuid \[)[^\]]*
https://regex101.com/r/sI3kW4/1

Go ReplaceAllString

I read the example code from golang.org website. Essentially the code looks like this:
re := regexp.MustCompile("a(x*)b")
fmt.Println(re.ReplaceAllString("-ab-axxb-", "T"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
The output is like this:
-T-T-
--xx-
---
-W-xxW-
I understand the first output, but I don't understand the the rest three. Can someone explain to me the results 2,3 and 4. Thanks.
The most intriguing is the fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W")) line. The docs say:
Inside repl, $ signs are interpreted as in Expand
And Expand says:
In the template, a variable is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores.
A reference to an out of range or unmatched index or a name that is not present in the regular expression is replaced with an empty slice.
In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.
So, in the 3rd replacement, $1W is treated as ${1W} and since this group is not initialized, an empty string is used for replacement.
When I say "the group is not initialized", I mean to say that the group is not defined in the regex pattern, thus, it was not populated during the match operation. Replacing means getting all matches and then they are replaced with the replacement pattern. Backreferences ($xx constructs) are populated during the matching phase. The $1W group is missing in the pattern, thus, it was not populated during matching, and only an empty string is used when replacing phase occurs.
The 2nd and 4th replacements are easy to understand and have been described in the above answers. Just $1 backreferences the characters captured with the first capturing group (the subpattern enclosed with a pair of unescaped parentheses), same is with Example 4.
You can think of {} as a means to disambiguate the replacement pattern.
Now, if you need to make the results consistent, use a named capture (?P<1W>....):
re := regexp.MustCompile("a(?P<1W>x*)b") // <= See here, pattern updated
fmt.Println(re.ReplaceAllString("-ab-axxb-", "T"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
Results:
-T-T-
--xx-
--xx-
-W-xxW-
The 2nd and 3rd lines now produce consistent output since the named group 1W is also the first group, and $1 numbered backreference points to the same text captured with a named capture $1W.
$number or $name is index of subgroup in regex or subgroup name
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
$1 is subgroup 1 in regex = x*
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
$1W no subgroup name 1W => Replace all with null
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
$1 and ${1} is the same. replace all subgroup 1 with W
for more information : https://golang.org/pkg/regexp/
$1 is a shorthand for ${1}
${1} is the value of the first (1) group, e.g. the content of the first pair of (). This group is (x*) i.e. any number of x.
ReplaceAllString replaces every match. There are two matches. The first is ab, the second is axxb.
No 2. replaces any match with the content of the group: This is "" in the first match and "xx" in the second.
No 4. adds a "W" after the content of the group.
No 3. Is left as an exercise. Hint: The twelfth capturing group would be $12.