Grok pattern/Regex to parse string with nested parenthesis

Grok pattern/Regex to parse string with nested parenthesis - regex

I am trying to parse out several dynamic strings via Grok/Regex that exist in log messages between (). For example (SenderPartyName below):
2021/05/23 16:01:26.094 High Messaging.Message.Delivered Id(ci1653336085475.12327434#test_te) MessageId(EPIUM#1130754#84601671) SenderPartyName(Mcdonalds (CFH) Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN) SenderRoutingId(08Mdsfkm853)
I would want to parse each key-value out from the string that follow the () format. Here is my grok pattern so far. I've been testing with https://grokdebug.herokuapp.com/
%{DATESTAMP:ts} %{WORD:loglevel} %{DATA:reason}\s ?(Id\(%{DATA:id}\))? ?(MessageId\(%{DATA:originalmessageid}\))? ?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))? ?(ReceiverPartyName\(%{DATA:receiverpartyname}\))? ?(SenderRoutingId\(%{DATA:senderroutingid}\))?
This works when there are () within the nested string like this:
Mcdonalds (CFH) Restaurant Glen
...but it is dynamic and could appear without () like such: Mcdonalds Restaurant Glen
Trying to build regex to account for both scenarios with this portion of the grok pattern:
?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))?
Currently this parses the non-parenthesis case like this though:
"senderpartyname": "Mcdonalds Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN"
..where desired state is one of the following depending on the string:
"senderpartyname": "Mcdonalds Restaurant Glen"
or
"senderpartyname": "Mcdonalds (CFH) Restaurant Glen"

You can use
%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}\s+Id\(%{DATA:id}\)(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?
Note I revamped it so that all optional fields match one or more whitespaces and the fields as obligatory patterns, but they are made optional as a sequence, which makes matching more efficient.
The main thing changed is (?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?, it matches
(?: - start of a non-capturing group:
\s+ - one or more whitespaces
SenderPartyName - a fixed word
(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)) - Group "senderpartyname": ( (matched with \(), then zero or more repetitions of any char other than ( and ) or the Group "senderpartyname" pattern recursed ( see (?:[^()]++|\g<senderpartyname>)*) and then a ) char (matched with \))
)? - end of the group, one or zero repetitions (optional)

Related

Regex expression to bold using asterisks

I have a question regarding using a regular expression to bold text within a string using asterisks.
The other questions on this topic work well for simple scenarios however we have encountered some issues.
Our particular scenario is for asterisks to be replaced with <bold></bold> tags.
It must also be able to handle multiple asterisks as well as an uneven number of asterisks.
Our example input text is as follows;
string exampleText1 = "**** PLEASE NOTE *** Testing, *nuts*, **please note..., test";
string exampleText2 = "**Test text (10)";
Our current regex is as follows;
Regex _boldRegex = new Regex(#"(\*)+([^*?$]+)+(\*)");
string value = _boldRegex.Replace(exampleText1, #"<bold>$2</bold>");
Example 1 should show "<bold> PLEASE NOTE </bold> Testing, <bold>nuts</bold>, *please note..., test" where the groups of asterisks are treated as single asterisks and an unfinished tag is ignored.
Example 2 crashes the program because it expects a 'closing' asterisk. It should show "*Text text (10)"
Can anyone help by suggesting a new regex, bearing in mind the ability to handle groups of asterisks and also an uneven number of asterisks?
Thanks in advance.

For you examle data, you might use an optional part with a capture group to capture the repeated character class without newlines between 1 or more *
In the callback of replace, you can test for the existence of group 1, and do the replacements based on that.
\*+(?:([^*?$\n\r]+)\*+)?
The pattern matches:
\*+ Match 1+ times *
(?: Non capture group
( Capture group 1
[^*?$\n\r]+ Match 1+ times any char other than the listed in the character class
) Close group 1
\*+ Match 1+ times *
)? Close on capture group
See a regex demo.
For example
Regex _boldRegex = new Regex(#"\*+(?:([^*?$\n\r]+)\*+)?");
string exampleText1 = #"**** PLEASE NOTE *** Testing, *nuts*, **please note..., test
**Test text (10)";
string value = _boldRegex.Replace(exampleText1, m =>
m.Groups[1].Success ? String.Format("<bold>{0}</bold>", m.Groups[1].Value) : "*"
);
Console.WriteLine(value);
Output
<bold> PLEASE NOTE </bold> Testing, <bold>nuts</bold>, *please note..., test
*Test text (10)

Regex for SQL Query

Hello together I have the following problem:
I have a long list of SQL queries which I would like to adapt to one of my changes. Finally, I have a renaming problem and I'm afraid I want to solve it more complicated than expected.
The query looks like this:
INSERT member (member, prename, name, street, postalcode, town, tel1, tel2, fax, bem, anrede, salutation, email, name2, name3, association, project) VALUES (2005, N'John', N'Doe', N'Street 4711', N'1234', N'Town', N'1234-5678', N'1234-5678', N'1234-5678', N'Leader', NULL, N'Dear Mr. Doe', N'a#b.com', N'This is the text i want to delete', N'Name2', N'Name3', NULL, NULL);
In the "Insert" there was another column which I removed (which I did simply via Notepad++ by typing the search term - "example, " - and replaced it with an empty field. Only the following entry in Values I can't get out using this method, because the text varies here. So far I have only worked with the text file in which I adjusted the list of queries.
So as you can see there is one more entry in Values than in the insertions (there was another column here, but it was removed by my change).
It is the entry after the email address. I would like to remove this including the comma (N'This is the text i want to delete',).
My idea was to form a group and say that the 14th digit after the comma should be removed. However, even after research I do not know how to realize this.
I thought it could look like this (tried in https://regex101.com/)
VALUES\s?\((,) something here
Is this even the right approach or is there another method? I only knew Regex to solve this problem, because of course the values look different here.
And how can I finally use the regex to get the queries adapted (because the queries are local to my computer and not yet included in the code).
Short summary:
Change the query from
VALUES (... test5, test6, test7 ...)
To
VALUES (... test5, test7 ...)

As per my comment, you could use find/replace, where you search for:
(\bVALUES +\((?:[^,]+,){13})[^,]+,
And replace with $1
See the online demo
( - Open 1st capture group.
\bValues +\( - Match a word-boundary, literally 'VALUES', followed by at least a single space and a literal open paranthesis.
(?: - Open non-capturing group.
[^,]+, - Match anything but a comma at least once followed by a comma.
){13} - Close non-capture group and repeat it 13 times.
) - Close 1st capture group.
[^,]+, - Match anything but a comma at least once followed by a comma.

You may use the following to remove / replace the value you need:
Find What: \bVALUES\s*\((\s*(?:N'[^']*'|\w+))(?:,(?1)){12}\K,(?1)
Replace With: (empty string, or whatever value you need)
See the regex demo
Details
\bVALUES - whole word VALUES
\s* - 0+ whitespaces
\( - a (
(\s*(?:N'[^']*'|\w+)) - Group 1: 0+ whitespaces and then either N' followed with any 0 or more chars other than ' and then a ', or 1+ word chars
(?:,(?1)){12} - twelve repetitions of , followed with the Group 1 pattern
\K - match reset operator that discards the text matched so far from the match memory buffer
, - a comma
(?1) - Group 1 pattern.
Settings screen:

How to handle redundant cases in regex?

I have to parse a file data into good and bad records the data should be of format
Patient_id::Patient_name (year of birth)::disease
The diseases are pipe separated and are selected from the following:
1.HIV
2.Cancer
3.Flu
4.Arthritis
5.OCD
Example: 23::Alex.jr (1969)::HIV|Cancer|flu
The regex expression I have written is
\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(HIV|Cancer|flu|Arthritis|OCD)
(\|(HIV|Cancer|flu|Arthritis|OCD))*
But it's also considering the records with redundant entries
24::Robin (1980)::HIV|Cancer|Cancer|HIV
How to handle these kind of records and how to write a better expression if the list of diseases is very large.
Note: I am using hadoop maponly job for parsing so give answer in context with java.

What you might do is capture the last part with al the diseases in one group (named capturing group disease) and then use split to get the individual ones and then make the list unique.
^\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$
For example:
String regex = "^\\d*::[a-zA-Z]+[^\\(]*\\(\\d{4}\\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$";
String string = "24::Robin (1980)::HIV|Cancer|Cancer|HIV";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
String[] parts = matcher.group("disease").split("\\|");
Set<String> uniqueDiseases = new HashSet<String>(Arrays.asList(parts));
System.out.println(uniqueDiseases);
}
Result:
[HIV, Cancer]
Regex demo | Java demo

You need the negative lookahead.
Try using this regex: ^\d*::[^(]+?\s*\(\d{4}\)::(?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1)((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$.
Explanation:
The initial string ^\d*::[^(]+?\s*\(\d{4}\):: is just an optimized one to match Alex.jr example (your version did not respect any non-alphabetic symbols in names)
The negative lookahead block (?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1) stands for "look forth for any disease name, encountered twice, and reject the string, if found any. Its distinctive feature is the (?! ... ) signature.
Finally, ((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$ is also an optimized version of your block (HIV|Cancer|flu|Arthritis|OCD)(\|(HIV|Cancer|flu|Arthritis|OCD))*, oriented to avoid redundant listing.

Probably the easier to maintain method is that you use a bit changed regex,
like below:
^\d*::[a-zA-Z.]+\s\(\d{4}\)::((?:HIV|Cancer|flu|Arthritis|OCD|\|(?!\|))+)$
It contains:
^ and $ anchors (you want that the entire string is matched,
not its part).
A capturing group, including a repeated non-capturing group (a container
for alternatives). One of these alternatives is |, but with a negative
lookahead for immediately following | (this way you disallow 2 or
more consecutive |).
Then, if this regex matched for a particular row, you should:
Split group No 1 by |.
Check resulting string array for uniqueness (it should not contain
repeating entries).
Only if this check succeeds, you should accept the row in question.

How can I do multiple replace using a shared backreference?

I have a need to do some data-transformation for data load compatibility. The nested key:value pairs need to be flattened and have their group id prepended to each piece of child data.
I've been trying to understand the page at
Repeating a Capturing Group vs. Capturing a Repeated Group but can't seem to wrap my head around it.
My expression so far:
"(?'group'[\w]+)": {\n((\s*"(?'key'[^"]+)": "(?'value'[^"]+)"(?:,\n)?)+)\n},?
Working sample: https://regex101.com/r/Wobej7/1
I'm aware that using 1 or more intermediate steps would simplify the process but at this point I want to know if it's even possible.
Source Data Example:
"g1": {
"k1": "v1",
"k2": "v2",
"k3": "v3"
},
"g2": {
"k4": "v4",
"k5": "v5",
"k6": "v6"
},
"g3": {
"k7": "v7",
"k8": "v8",
"k9": "v9"
}
Desired transformation:
{"g1","k1","v1"},
{"g1","k2","v2"},
{"g1","k3","v3"},
{"g2","k4","v4"},
{"g2","k5","v5"},
{"g2","k6","v6"},
{"g3","k7","v7"},
{"g3","k8","v8"},
{"g3","k9","v9"}

TL; DR
Step 1
Search for:
("[^"]+"):\s*{[^}]*},?\K
Replace with \1
Live demo
Step 2
Search for:
(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
Replace with:
{\3,\1,\2}\4\n
Live demo
Whole philosophy
This is not going to be a one-liner regex solution for different reasons. The most important one is we can neither store a part of a match for later referring nor are able to do infinite lookbehinds in PCRE. But fortunately most of similar problems could be done in two steps.
Very first step should be moving group name to end of {...} block. This way we can have group name each time we want to transform our matches into a single line output.
("[^"]+"):\s*{[^}]*},?\K
( Start of capturing group #1
"[^"]+" Match a group name
) End of CG #1
:\s*{ Group name should precede bunch of other characters
[^}]*},? We have to go further up to end of block
\K Throw away every thing matched so far
We have our group name held in first capturing group and have to replace whole match with it:
\1
Now a block like this:
"g1": {
.
.
.
},
Appears like this one:
"g1": {
.
.
.
},"g1"
Next step is to match key:value pairs of each block beside capturing recent added group name at the end of block.
(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
(?: Start of a non-capturing group
"[^"]+" Try to match a group name
:\s*{ A group name should come after bunch of other characters
| Or
\G(?!\A) Continue from previous match
) End of NCG
\s*("[^"]+"):\s*((?1)) Then try to match and capture a key:value pair
(?=[^}]*},?((?1))) Simultaneously match and capture group name at the end of block
(?|(,)|\s*}(,?).*\R*) Match remaining characters such as commas, brace or newlines
This way in each single successful try of regex engine we have four captured data that their order is the key:
{\3,\1,\2}\4\n
\3 Group name (that one added at the end of block)
\1 Key
\2 Value
\4 Comma (may be there or may not)

Using RegEx to grab a field in brackets

I have multiple square bracketed data in the log file of a splunk log. I am attempting to find a particular field named UserDataGuid and then gather the data in the bracket after this. My only option seems to be regular expressions in a standard that seems similar to perl to me. Yet does not work what am I doing wrong here ?
| rex "\]\s(?<UserDataGuid>.*?)\s*$"
// this trial looks more promising but grabs the last bracket :( and doesn't name the field, to be used in a subSearch.
| rex "(?i)UserDataGuid\s*\[([^\}]*)\]
the data looks like this
[21] INFO UserDataGuid [fas08f0da-faf6-4308-aad6-hfld5643gs] [(null)] [(null)] [(null)]
and I want only the guid
fas08f0da-faf6-4308-aad6-hfld5643gs
and I would love for it to be a field I could reuse like fields are used in splunk.

It looks like you want
(?<=UserDataGuid\s\[)([^\]]*)

I'd try the following regex:
(?<=UserDataGuid \[).*?(?=\])/g
This will capture fas08f0da-faf6-4308-aad6-hfld5643gs. See a demo here.

With
\]\s(?<UserDataGuid>.*?)\s*$
you say: match a ] > \], follow by any space character (only one) > \s, follow by a group with name UserDataGuid > (?<UserDataGuid> ... ) that contains any character, except newline (zero times, to unlimited times) > .*? ( in lazy mode, ? ), follow by any space character (zero times, to unlimited times) > \s*, follow by end of string > $
I think that you don't want this (?<UserDataGuid> ... );
you want match (in some way) UserDataGuid, no call UserDataGuid at the group that match " any character, except newline (zero times, to unlimited times) > .*? ( in lazy mode, ? ) "
In
(?i)UserDataGuid\s*\[([^\}]*)\]
change the }, for a ], and then, you captured your GUID in group #1
but, you don't need match "UserDataGuid\s[*"
you could use:
(?<=UserDataGuid \[)([^\]]*)
and then, you only match the GUID, and find it in the group #1
you can remove the parenthesis of group #1, because is a full match:
(?<=UserDataGuid \[)[^\]]*
https://regex101.com/r/sI3kW4/1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js