Regex capture groups and use OR statement - regex

I'm trying to create a regex expression that has has multiple conditions separated by | (OR). I want to use capture groups but I'm not getting it to work fully.
3 sample strings:
--- {source-charset: '', encoding-error-limit: '', class: stat-direct, directory: \\\myserver\C\FOLDER\SUB_FOLDER}
--- {odbc-connect-string-extras: '', server: hello.sample.com, dbname: X_DB, port: '80', class: hello, username: USERX}
--- {cleaning: 'no', filename: //myserver/D/FOLDER/SUB_FOLDER/File name.xlsx, dataRefreshTime: '', interpretationMode: '0'}
For each sample string I would like the regex to return:
\\\myserver\C\FOLDER\SUB_FOLDER
X_DB
//myserver/D/FOLDER/SUB_FOLDER/File name.xlsx
Basically the value after either directory:, dbname: or filename: and ending with } for one of them and , for two.
I've managed to use OR statements to get the three conditions in.
regex extract
'directory: [^}]+|dbname: [^,]+|filename: [^,]+'
That returns:
directory: \\\myserver\C\FOLDER\SUB_FOLDER}
dbname: X__DB,
filename: //myserver/D/FOLDER/SUB_FOLDER/File name.xlsx,
If I introduce capturing groups I only get the right return for one of the parts:
'directory: ([^}]+)|dbname: ([^,]+)|filename: ([^,]+)'
That returns:
\\\myserver\C\FOLDER\SUB_FOLDER
null
null
I've managed to get it working with a nested regex that takes the result from
'directory: [^}]+|dbname: [^,]+|filename: [^,]+'
and uses:
': ([^,}]+)'
That gives me the result I want but I would like to do this as one regex.
Any help would be greatly appreciated.
/Aron

You could use a negated character class to match not a {, } or a comma, match any of the options in a non capturing group and use and a single capturing group the capture the values:
{[^{]+(?:filename|directory|dbname): ([^,}]+)[^}]*}
Explanation
{ Match {
[^{]+ Match 1+ times not { using a negated character class
(?:filename|directory|dbname): Match any of the listed options followed by : and a space
( Capture group1
[^,}]+ Match 1+ times not , or }
) Close group 1
[^}]*} Match 0+ times not }, then match }
Regex demo

Related

Regex a Block of Yaml Data

Im currently using regex101 to try and work out the following, id like to be able to capture a full items data for example name_template_2 and its associated description, define and write data
Here's my data model
templates:
name_template:
description: test_description
define: yes
write: true
name_template_2:
description: test_description2
define: false
write: true
I can capture the lines I need with the following
^[[:space:]][[:space:]][[:space:]][[:space:]].*
and
^[[:space:]][[:space:]]name_template_2:
but I am unable to join both patterns together to filter just the key and data related to name_template_2. The more I read online the more I understand it less. Has anyone achieved this before or is there a much more efficient way of doing this?
Using 2 capture groups:
^[^\S\n]{2}(name_template_2:)((?:\n[^\S\n]{4}\S.*)+)
Explanation
^ Start of string
[^\S\n]{2} Match 2 spaces without newlines
(name_template_2:) Match the string and capture in group 1
( Capture group 2
(?: Non capture group
\n Match a newline
[^\S\n]{4} Match 4 spaces without newlines
\S.* Match a non whitespace char and the rest of the line
)+ Repeat the non capture group 1 or more times
) Close group 2
Regex demo
I suggest using a structure-aware tool like yq to manipulate YAML and not using regular expressions.
#!/bin/bash
INPUT='
templates:
name_template:
description: test_description
define: yes
write: true
name_template_2:
description: test_description2
define: false
write: true
'
echo "$INPUT" | yq '{"name_template_2": .templates.name_template_2}'
Output
name_template_2:
description: test_description2
define: false
write: true

vscode snippet - transform and replace filename

my filename is
some-fancy-ui.component.html
I want to use a vscode snippet to transform it to
SOME_FANCY_UI
So basically
apply upcase to each character
Replace all - with _
Remove .component.html
Currently I have
'${TM_FILENAME/(.)(-)(.)/${1:/upcase}${2:/_}${3:/upcase}/g}'
which gives me this
'SETUP-PRINTER-SERVER-LIST.COMPONENT.HTML'
The docs doesn't explain how to apply replace in combination with their transforms on regex groups.
If the chunks you need to upper are separated with - or . you may use
"Filename to UPPER_SNAKE_CASE": {
"prefix": "usc_",
"body": [
"${TM_FILENAME/\\.component\\.html$|(^|[-.])([^-.]+)/${1:+_}${2:/upcase}/g}"
],
"description": "Convert filename to UPPER_SNAKE_CASE dropping .component.html at the end"
}
You may check the regex workings here.
\.component\.html$ - matches .component.html at the end of the string
| - or
(^|[-.]) capture start of string or - / . into Group 1
([^-.]+) capture any 1+ chars other than - and . into Group 2.
The ${1:+_}${2:/upcase} replacement means:
${1:+ - if Group 1 is not empty,
_ - replace with _
} - end of the first group handling
${2:/upcase} - put the uppered Group 2 value back.
Here is a pretty simple alternation regex:
"upcaseSnake": {
"prefix": "rf1",
"body": [
"${TM_FILENAME_BASE/(\\..*)|(-)|(.)/${2:+_}${3:/upcase}/g}",
"${TM_FILENAME/(\\..*)|(-)|(.)/${2:+_}${3:/upcase}/g}"
],
"description": "upcase and snake the filename"
},
Either version works.
(\\..*)|(-)|(.) alternation of three capture groups is conceptually simple. The order of the groups is important, and it is also what makes the regex so simple.
(\\..*) everything after and including the first dot . in the filename goes into group 1 which will not be used in the transform.
(-) group 2, if there is a group 2, replace it with an underscore ${2:+_}.
(.) group 3, all other characters go into group 3 which will be upcased ${3:/upcase}.
See regex101 demo.

Regex to extract text within exact given function

I am reading text from a .config file and then I have a long string where I need to extract a text which matches the below given pattern. .config file has 2 functions defined (input and filter)
This is the text extracted from the .config file
input {
name: "abc",
age: "20"
}
filter {
name: "pqr",
age: "25"
}
I need to extract only the text within the filter function including the filter text itself
expected output
filter {
name: "pqr",
age: "25"
}
Here I have written a regex where I can extract all the text within the { } parenthesis.
Created Regex
At the moment it extracts text within the overall file. Anyone can help me to update the regex to extract only the filter function with its name by updating the regex ( we need to consider both the intermediate space exist and non-exist behavior as well)
scenario 1 - space between filter text and the parenthesis
filter {
name: "pqr",
age: "25"
}
and
scenario 2 - no space between filter text and the parenthesis
filter{
name: "pqr",
age: "25"
}
You can use this regex which will match your filter block and also the space between filter and {...} is optional and will match with or without space.
^filter\s*\{[^{}]+\}$
Notice: I have enabled the m flag in the demo, so you will need to enable it in your programming language or use inline modifier before the regex like this (?m)^filter\s*\{[^{}]+\}$
Regex Explanation:
^filter - Starts matching the text with filter
\s* - Allows for matching optional whitespace
\{ - Matches literal {
[^{}]+ - Matches one or more any character except { or }
\}$ - Matches the closing } and marks end of input
Regex Demo

Conditionals and regex doubts with grok filter in logstash

I'm taking my first steps with elastic-stack with a practical approach, trying to make it work with an appliacation in my enviroment. I'm having difficulties understanding from scratch how to write grok filters. I would like to have one like this one working, so from that one, I can work the rest of them.
I've taken some udemy courses, I'm reading this "Elastic Stack 6.0", I'm reading the documentation, but I can't find a way to make this work as intended.
So far, the only grok filter I'm using that actually works, is as simple as (/etc/logstash/config.d/beats.conf)
input {
beats {
port => 5044
}
}
filter {
grok {
match => { 'message' => "%{DATE:date} %{TIME:time} %
{LOGLEVEL:loglevel}"
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
This is one of the log entries I'll need to work with, but there are many with different forms. I just need to have this one sorted out so I can adapt the filters to the rest.
2019-02-05 19:13:04,394 INFO [qtp1286783232-574:http://localhost:8080/service/soap/AuthRequest] [name=admin#example.com;oip=172.16.1.69;ua=zclient/8.8.9_GA_3019;soapId=3bde7ed0;] SoapEngine - handler exception: authentication failed for [admin], invalid password
I'd like to have this info, only when there is a "soapId" and when the field next to "INFO" starts with "qtq":
date: 2019-02-05
time: 19:13:04,394
loglevel: INFO
identifier: qtp1286783232-574
soap: http://localhost:8080/service/soap/AuthRequest
Which could also end in things like "GetInfoRequest" or "NoOpRequest"
account: admin#example.com
oip: 172.16.1.69
client: zclient/8.8.9_GA_3019
soapid: 3bde7ed0
error: true (if either "invalid password" or "authentication failed" are found in the line)
If the conditions are not met, then I will apply other filters (which hopefully I will be able to write adapting this one as a base).
You can't have false in the output if you have invalid password in the input. You can only match what is there in the string.
I think you may use
%{DATE:date} %{TIME:time} %{LOGLEVEL:loglevel} *\[(?<identifier>qtp[^\]\[:]*):(?<soap>[^\]\[]*)]\s*\[name=(?<account>[^;]+);oip=(?<oip>[0-9.]+);ua=(?<client>[^;]+);soapId=(?<soapId>[^;]+);].*?(?:(?<error>authentication failed).*)?$
Here are the details of the added patterns:
* - 0+ spaces
\[ - a [ char
(?<identifier>qtp[^\]\[:]*) - Named group "identifier": qtp and then 0+ chars other than :, ] and [
: - a colon
(?<soap>[^\]\[]*) - Named group "soap": 0+ chars other than ] and [
]\s*\[name= - a ], then 0+ whitespaces and [name= substring
(?<account>[^;]+) - Named group "account": 1+ chars other than ;
;oip= - a literal substring
(?<oip>[0-9.]+) - Named group "oip": 1+ digits and/or dots
;ua= - a literal substring
(?<client>[^;]+) - Named group "client": 1+ chars other than ;
;soapId= - a literal substring
(?<soapId>[^;]+) - Named group "soapId": 1+ chars other than ;
;] - a literal substring
.*? - any 0+ chars other than line break chars, as few as possible
(?:(?<error>authentication failed).*)? - an optional group matching 1 or 0 occurrences of
Named group "error": authentication failed substring
.* - all the rest of the line
$ - end of input.

How can I do multiple replace using a shared backreference?

I have a need to do some data-transformation for data load compatibility. The nested key:value pairs need to be flattened and have their group id prepended to each piece of child data.
I've been trying to understand the page at
Repeating a Capturing Group vs. Capturing a Repeated Group but can't seem to wrap my head around it.
My expression so far:
"(?'group'[\w]+)": {\n((\s*"(?'key'[^"]+)": "(?'value'[^"]+)"(?:,\n)?)+)\n},?
Working sample: https://regex101.com/r/Wobej7/1
I'm aware that using 1 or more intermediate steps would simplify the process but at this point I want to know if it's even possible.
Source Data Example:
"g1": {
"k1": "v1",
"k2": "v2",
"k3": "v3"
},
"g2": {
"k4": "v4",
"k5": "v5",
"k6": "v6"
},
"g3": {
"k7": "v7",
"k8": "v8",
"k9": "v9"
}
Desired transformation:
{"g1","k1","v1"},
{"g1","k2","v2"},
{"g1","k3","v3"},
{"g2","k4","v4"},
{"g2","k5","v5"},
{"g2","k6","v6"},
{"g3","k7","v7"},
{"g3","k8","v8"},
{"g3","k9","v9"}
TL; DR
Step 1
Search for:
("[^"]+"):\s*{[^}]*},?\K
Replace with \1
Live demo
Step 2
Search for:
(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
Replace with:
{\3,\1,\2}\4\n
Live demo
Whole philosophy
This is not going to be a one-liner regex solution for different reasons. The most important one is we can neither store a part of a match for later referring nor are able to do infinite lookbehinds in PCRE. But fortunately most of similar problems could be done in two steps.
Very first step should be moving group name to end of {...} block. This way we can have group name each time we want to transform our matches into a single line output.
("[^"]+"):\s*{[^}]*},?\K
( Start of capturing group #1
"[^"]+" Match a group name
) End of CG #1
:\s*{ Group name should precede bunch of other characters
[^}]*},? We have to go further up to end of block
\K Throw away every thing matched so far
We have our group name held in first capturing group and have to replace whole match with it:
\1
Now a block like this:
"g1": {
.
.
.
},
Appears like this one:
"g1": {
.
.
.
},"g1"
Next step is to match key:value pairs of each block beside capturing recent added group name at the end of block.
(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
(?: Start of a non-capturing group
"[^"]+" Try to match a group name
:\s*{ A group name should come after bunch of other characters
| Or
\G(?!\A) Continue from previous match
) End of NCG
\s*("[^"]+"):\s*((?1)) Then try to match and capture a key:value pair
(?=[^}]*},?((?1))) Simultaneously match and capture group name at the end of block
(?|(,)|\s*}(,?).*\R*) Match remaining characters such as commas, brace or newlines
This way in each single successful try of regex engine we have four captured data that their order is the key:
{\3,\1,\2}\4\n
\3 Group name (that one added at the end of block)
\1 Key
\2 Value
\4 Comma (may be there or may not)