how to capture from group from end line in js regex? - regex

I'm trying to capture a text into 3 groups I have managed to capture 2 groups but having an issue with the 3rd group.
This is the text :
<13>Apr 5 16:09:47 node2 Services: 2016-04-05 16:09:46,914 INFO [3]
Drivers.KafkaInvoker - KafkaInvoker.SendMessages - After sending
itemsCount=1
I'm using the following regex:
(?=- )(.*?)(?= - )|(?=])(.*?)(?= -)
My 3rd group should be : "After sending itemsCount=1"
any suggestions?

Your original expression is fine, just missing a $:
(?=- )(.*?)(?= - |$)|(?=])(.*?)(?= -)
Demo
and maybe we would slightly modify that to an expression similar to:
(?=-\s+).*?([A-Z].*?)(?=\s+-\s+|$)|(?=]\s+).*?([A-Z].*?)(?=\s+-)
Demo

You have 2 capturing groups. You don't get the match for the third part because the postitive lookahead in the first alternation is not considering the end of the string. You might solve that by using an alternation to look at either a space or assert the end of the string
(?=[-\]] )(.*?)(?= - |$)
^^
If those matches are ok, you could simplify that pattern by making use of a character class to match either - or ] like [-\]] and omit the alternation and the group as you now have only the matches.
Your pattern then might look like (also capturing the leading hyphen like the first 2 matches)
(?=[-\]] ).*?(?= - |$)
Regex demo
If this is your string and you want to have 3 capturing groups, you might use:
^.*?\[\d+\]([^-]+)-([^-]+)-\s*([^-]+)$
^ Start of string
.*? Match any char except a newline non greedy
\[\d+\] match [ 1+ digits ]
([^-]+)- Capture group 1, match 1+ times not -, then match -
([^-]+)- Capture group 2, match 1+ times not -, then match -
\s* Match 0+ whitespace chars
([^-]+) Capture group 2, match 1+ times not -
$ End of string
Regex demo
For example creating the desired object from the comments, you could first get all the matches from match[0] and store those in an array.
After you have have all the values, assemble the object using the keys and the values.
var output = {};
var regex = new RegExp(/(?=[-\]] ).*?(?= - |$)/g);
var str = `<13>Apr 5 16:09:47 node2 Services: 2016-04-05 16:09:46,914 INFO [3] Drivers.KafkaInvoker - KafkaInvoker.SendMessages - After sending itemsCount=1`;
var match;
var values = [];
var keys = ['Thread', 'Class', 'Message'];
while ((match = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (match.index === regex.lastIndex) {
regex.lastIndex++;
}
values.push(match[0]);
}
keys.forEach((key, index) => output[key] = values[index]);
console.log(output);

Related

How to extract parameter definitions using regex?

I am trying to extract parameter definitions from a Jenkins script and can't work out an appropriate regex (I' working in Dyalog APL which supports PCRE8).
Here's how the subject looks like:
pipeline {
agent none
parameters {
string(name: 'foo', defaultValue: 'bar')
string(name: 'goo', defaultValue: 'hoo')
}
stages {
stage('action') {
steps {
echo "foo = ${params.foo}"
}
}
}
}
I would like to get the individual param definitions captured in group 1 (in other words: I'm looking for a results that reports two matches: string(name: 'foo', defaultValue: 'bar') and string(name: 'goo', defaultValue: 'hoo') ), but the matches are either too long or too short (depending on greediness).
My regex:
parameters\s*{(\s*\D*\(.*\)\s*)*} (dot matches nl)
Parameter types may vary, so my best idea was to use \D* for those (any # of non-digits). I am suspicious that this captures more than I expected - but replacing that with \w did not help.
An alternative idea was
parameters\s*{(\s*(\w*)\(([^\)]*)\))*\s*}
which seemed more precise wrt matching parameter types and also the content of the parens - but surprisingly that returned goo only and skipped foo.
What am I missing?
Using PCRE you can use this regex in MULTILINE mode:
(?m)(?:^\h*parameters\h*{|(?!^)\G).*\R\h*\w+\(\w+:\h*'\K[^']+
RegEx Demo
RegEx Details:
(?m): Enable MULTILINE mode
(?:: Start non-capture group
^\h*parameters\h*{: Match a line that starts with parameters {
|: OR
(?!^)\G:
): End non-capture group
.*: Match anything
\R: Match a line break
\h*: Match 0 or more whitespaces
\w+: Match 1+ word chars
\(: Match (
\w+: Match 1+ word chars
:: Match a :
\h*: Match 0 or more whitespaces
': Match a '
\K: Reset all the matched info
[^']+: Match 1+ of any char that is not ' (this is our parameter name)

Regex to extract string if there is or not a specific word

Hi I'm a regex noob and I'd like to make a regex in order to extract the penultimate string from the URL if the word "xxxx" is contained or the last string if the word "xxxx" is not contained.
For example, I could have 2 scenarios:
www.hello.com/aaaa/1adf0023efae456
www.hello.com/aaaa/1adf0023efae456/xxxx
In both cases I want to extract the string 1adf0023efae456.
I've tried something like (?=(\w*xxxx\w*)\/.*\/(.*?)\/|[^\/]+$) but doesn't work properly.
You can match the forward slash before the digits, then match digits and assert what follows is either xxxx or the end of the string.
\d+(?=/xxxx|$)
Regex demo
If there should be a / before matching the digits, you could use a capturing group and get the value from group 1
/(\d+)(?=/xxxx|$)
/ Match /
(\d+) Capture group 1, match 1+ digits
(?=/xxxx|$) Positive lookahead, assert what is on the right is either xxxx or end of string
Regex demo
Edit
If there could possibly also be alphanumeric characters instead of digits, you could use a character class [a-z0-9]+ with an optional non capturing group.
/([a-z0-9]+)(?:/xxxx)?$
Regex demo
To match any char except a whitespace char or a forward slash, use [^\s/]+
Using lookarounds, you could assert a / on the left, match 1+ alphanumerics and assert what is at the right is either /xxxx or the end of the string which did not end with /xxxx
(?<=/)[a-z0-9]+(?=/xxxx$|$(?<!/xxxx))
Regex demo
You could avoid Regex:
string[] strings =
{
"www.hello.com/aaaa/1adf0023efae456",
"www.hello.com/aaaa/1adf0023efae456/xxxx"
};
var x = strings.Select(s => s.Split('/'))
.Select(arr => new { upper = arr.GetUpperBound(0), arr })
.Select(z => z.arr[z.upper] == "xxxx" ? z.arr[z.upper - 1] : z.arr[z.upper]);

How to capture group no of every group in a repeated capturing group

My regex is something like this **(A)(([+-]\d{1,2}[YMD])*)** which is matching as expected like A+3M, A-3Y+5M+3D etc..
But I want to capture all the groups of this sub pattern**([+-]\d{1,2}[YMD])***
For the following example A-3M+2D, I can see only 4 groups. A-3M+2D (group 0), A(group 1), -3M+2D (group 2), +2D (group 3)
Is there a way I can get the **-3M** as a separate group?
Repeated capturing groups usually capture only the last iteration. This is true for Kotlin, as well as Java, as the languages do not have any method that would keep track of each capturing group stack.
What you may do as a workaround, is to first validate the whole string against a certain pattern the string should match, and then either extract or split the string into parts.
For the current scenario, you may use
val text = "A-3M+2D"
if (text.matches("""A(?:[+-]\d{1,2}[YMD])*""".toRegex())) {
val results = text.split("(?=[-+])".toRegex())
println(results)
}
// => [A, -3M, +2D]
See the Kotlin demo
Here,
text.matches("""A(?:[+-]\d{1,2}[YMD])*""".toRegex()) makes sure the whole string matches A and then 0 or more occurrences of + or -, 1 or 2 digits followed with Y, M or D
.split("(?=[-+])".toRegex()) splits the text with an empty string right before a - or +.
Pattern details
^ - implicit in .matches() - start of string
A - an A substring
(?: - start of a non-capturing group:
[+-] - a character class matching + or -
\d{1,2} - one to two digits
[YMD] - a character class that matches Y or M or D
)* - end of the non-capturing group, repeat 0 or more times (due to * quantifier)
\z - implicit in matches() - end of string.
When splitting, we just need to find locations before - or +, hence we use a positive lookahead, (?=[-+]), that matches a position that is immediately followed with + or -. It is a non-consuming pattern, the + or - matched are not added to the match value.
Another approach with a single regex
You may also use a \G based regex to check the string format first at the start of the string, and only start matching consecutive substrings if that check is a success:
val regex = """(?:\G(?!^)[+-]|^(?=A(?:[+-]\d{1,2}[YMD])*$))[^-+]+""".toRegex()
println(regex.findAll("A-3M+2D").map{it.value}.toList())
// => [A, -3M, +2D]
See another Kotlin demo and the regex demo.
Details
(?:\G(?!^)[+-]|^(?=A(?:[+-]\d{1,2}[YMD])*$)) - either the end of the previous successful match and then + or - (see \G(?!^)[+-]) or (|) start of string that is followed with A and then 0 or more occurrences of +/-, 1 or 2 digits and then Y, M or D till the end of the string (see ^(?=A(?:[+-]\d{1,2}[YMD])*$))
[^-+]+ - 1 or more chars other than - and +. We need not be too careful here since the lookahead did the heavy lifting at the start of string.

regex capture with g modifier captures only first occurrence

Using EcmaScript 6 RegExp
From this : "-=Section A=- text A -=Section B=- text b"
I want to get this: ['Section A', 'text A', 'Section B', 'text B']
Apart from the delimiters, everything else is variable. (Eventually '-=someString=-' will be '' but for now I did not want to clutter things up or create errors with characters that need escaping.)
I am not a regex expert, but I have searched all day for an example or guidance to make this work without success.
For example using this code:
let templateString = "-=Section A=- text A -=Section B=- text b";
let regex = RegExp('-=(.*?)=-(.*?)','g');
I only get this: ["-=Section A=-", "Section A", ""]
I am not sure how to make the second of the captures capture 'text A'. Also I do not understand why the g modifier is not making it continue after the first match and go on to find 'Section B' and 'text B'.
Any pointers to some examples would be appreciated - I have failed to find any.
Note that (.*?) at the end of the pattern will always match an empty string since it is lazy, and is not executed in the first place. text A cannot be matched because the matches ends with =-, since .*? does not have to be matched.
You may use
let templateString = "-=Section A=- text A -=Section B=- text b";
let regex = /\s*-=(.*?)=-\s*/;
console.log(templateString.split(regex).filter(Boolean));
The \s*-=(.*?)=-\s* pattern finds
\s* - 0+ whitespaces
-= - a -= substring
(.*?) - Group 1: any 0+ chars, as few as possible up to the first occurrence of the subsequent subpatterns
=- - a =- substring
\s* - 0+ whitespaces.
The String#split method adds to the resulting array all substrings captured into Group 1.
If you want to use a matching approach, you would need to match any char, 0 or more occurrences, that does not start the leading char sequence, which seems to be -= in your scenario:
let templateString = "-=Section A=- text A -=Section B=- text b";
let regex = /-=(.*?)=-\s*([^-]*(?:-(?!=)[^-]*)*)/g;
let m, res=[];
while (m=regex.exec(templateString)) {
res.push([m[1], m[2].trim()]);
}
console.log(res);
See this regex demo
Details
-=(.*?)=-\s* - same as in the first regex (see the split regex above)
([^-]*(?:-(?!=)[^-]*)*) - Group 2 that matches and captures:
[^-]* - 0+ chars other than -
(?: - start of a non-capturing group that matches
-(?!=) - a hyphen that is not immediately followed with =
[^-]* - 0+ chars other than -
)* - ...zero or more times

regex how to match a capture group more than once

I have the following regex:
\{(\w+)(?:\{(\w+))+\}+\}
I need it to match any of the following
{a{b}}
{a{b{c}}}
{a{b{c{d...}}}}
But by using the regex for example on the last one it only matches two groups: a and c it doesn't match the b and 'c', or any other words that might be in between.
How do I get the group to match each single one like:
group #1: a
group #2: b
group #3: c
group #4: d
group #4: etc...
or like
group #1: a
group #2: [b, c, d, etc...]
Also how do I make it so that you have the same amount of { on the left is there are } on the right, otherwise don't match?
Thanks for the help,
David
In .NET, a regex can 1) check balanced groups and 2) stores a capture collection per each capturing group in a group stack.
With the following regex, you may extract all the texts inside each {...} only if the whole string starting with { and ending with } contains a balanced amount of those open/close curly braces:
^{(?:(?<c>[^{}]+)|(?<o>){|(?<-o>)})*(?(o)(?!))}$
See the regex demo.
Details:
^ - start of string
{ - an open brace
(?: - start of a group of alternatives:
(?<c>[^{}]+) - 1+ chars other than { and } captured into "c" group
| - or
(?<o>{) - { is matched and a value is pushed to the Group "o" stack
| - or
(?<-o>}) - a } is matched and a value is popped from Group "o" stack
)* - end of the alternation group, repeated 0+ times
(?(o)(?!)) - a conditional construct checking if Group "o" stack is empty
} - a close }
$ - end of string.
C# demo:
var pattern = "^{(?:(?<c>[^{}]+)|(?<o>{)|(?<-o>}))*(?(o)(?!))}$";
var result = Regex.Matches("{a{bb{ccc{dd}}}}", pattern)
.Cast<Match>().Select(p => p.Groups["c"].Captures)
.ToList();
Output for {a{bb{ccc{dd}}}} is [a, bb, ccc, dd] while for {{a{bb{ccc{dd}}}} (a { is added at the beginning), results are empty.
For regex flavours supporting recursion (PCRE, Ruby) you may employ the following generic pattern:
^({\w+(?1)?})$
It allows to check if the input matches the defined pattern but does not capture desired groups. See Matching Balanced Constructs section in http://www.regular-expressions.info/recurse.html for details.
In order to capture the groups we may convert the pattern checking regex into a positive lookahead which would be checked only once at the start of string ((?:^(?=({\w+(?1)?})$)|\G(?!\A))) and then just capture all "words" using global search:
(?:^(?=({\w+(?1)?})$)|\G(?!\A)){(\w+)
The a, b, c, etc. are now in the second capture groups.
Regex demo: https://regex101.com/r/2wsR10/2. PHP demo: https://ideone.com/UKTfcm.
Explanation:
(?: - start of alternation group
[first alternative]:
^ - start of string
(?= - start of positive lookahead
({\w+(?1)?}) - the generic pattern from above
$ - enf of string
) - end of positive lookahead
| - or
[second alternative]:
\G - end of previous match
(?!\A) - ensure the previous \G does not match the start of the input if the first alternative failed
) - end of alternation group
{ - opening brace literally
(\w+) - a "word" captured in the second group.
Ruby has different syntax for recursion and the regex would be:
(?:^(?=({\w+\g<1>?})$)|\G(?!\A)){(\w+)
Demo: http://rubular.com/r/jOJRhwJvR4