repeated, arbitrary capture groups - regex

Given a string, eg.:
static_string.name__john.id__6.foo__bar.final_string
but with an arbitrary number of label__value. components, how can I repeat the capture groups, split them into label & value, and also capture the terminating final_string ?
For the above I'd want [name, john, id, 6, foo, bar, final_string]
Is something like this possible when I don't know the number of label__value. components in advance?
This is for golang / RE2 if that matters.
Update: I don't have the luxury of doing this in a few lines of code, and would need to do this in a single regex. The regex is defined in a config file to an application I don't control, so a code based loop with conditionals etc is unfortunately not possible.

This totally depends on what the thing you are putting this into expects.
This is answer focused on getting you the capture groups in a basic way attempting to avoid any issues with the "thing" you are putting the regex into and RE2.
Note: You might find that the final_string doesn't get the capture group index you expect with this method, but again depends on what you are putting the regex into.
A regular expression that would match "one" and "no" key/value pairs the following is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
static_string.final_string
static_string.name__john.final_string
To support one more key/value pair we repeat part of the regular expression:
Part repeated:
(?:\.([^.]+?)__([^.]+))?
So to support 2 key value pairs the regular expression is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
This now supports the following additional example:
static_string.name__john.foo__bar.final_string
So if I expand that out to support 12 key value pairs the regular expression is:
^[^.]+(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+?)__([^.]+))?(?:\.([^.]+))$
This supports the following additional examples:
static_string.name__john.id__6.foo__bar.final_string
static_string.name2_1b__john.id__6.foo__bar.final_string
static_string.name__john.id__6.foo__bar.name__john.id__6.foo__bar.name__john.id__6.foo__bar.name__john.id__6.foo__bar.final_string

Related

How to extract multiple values with a regular expression in Jmeter

I am running tests with jmeter and I need to extract with a Regular Expression:
insertar?sIws2kyXGJJA_01==
insertar?sIws2kyXGJJA_02==
in the following String:
[\"EMBPAGE1_00010001\",\"**insertar?sIws2kyXGJJA_01==**\",1,100,\"%\",300,\"px\",0,\"center\",\"\",\"[\"EMBPAGE1_00010002\",\"**insertar?sIws2kyXGJJA_02==**\",1,100,\"%\",300,\"px\",0,\"center\",\"\",\"
Use super secret operator (Negative match N)
UPD: G2 - is in my example, as I extract two groups from each encounter.
each encounter is "uuid" in g1 and g2 is second part I need second part here.
that's why $2$ template and g2. If your encounters in one group you ll most likely use $1$ template that will place all encounters into g1.
If you have one match group you don't actually need _gN ending at all.
To understand more the variables after group extraction add a "Debug PostProcessor" and inspect output in TreeView.
It nice two know that control elements like "For each" understand groups and can work with prefix like regexUUID_ and walk through. In most cases it's next you do after extraction.
UPD2. primitive version of regexp in question (insertar\?sIws2kyXGJJA_\d*)==([^[]*)
with template $1$$2$
you ll have the first parts in g1 group and the second parts in g2
In answer given by DMC, you need to add regular expression extractor TWICE to match/retrieve both the values with different Match No. (1, 2). Though it is also correct, suggesting better approach to achieve the same.
Another Approach:
1. Capture Both Values:
You can use Template to capture both the values at the same time, and later, refer it using indexing.
Please check the following screen shot:
Here, we captured both the values using two groups into two different templates, as $1$ and $2$ respectively. Here, templates store the data in the order of the groups specified in regular expression by default. (FYI, you can change the order also by tweaking the order of templates like $2$ and then $1$.)
Now, as in the diagram, we are capturing two values and storing them using templates: $1$ (refers to first group match) and $2$ (refers to second group match)
2. Retrieve Values:
Now, refer these values in your script by using the following syntax:
${insert_values_gn} (n refers to match no.)
eg:
${insert_values_g1} - refers to the first match
${insert_values_g2} - refers to the second match
To make it simple, You can think "insert_values" as list of strings captured using multiple groups and use 'n' (1,2,3 etc) as the index to retrieve the values.
Note: using templates, you can have any number of values can be retrieved using multiple groups and refer to them by indexing, using a single regular expression extractor.
I'm sure there is a more efficient way but this worked:
\*\*(.*?)\*\*.*\"\*\*(.*?)\*\*
You can also use only \*\*(.*?)\*\*
It will match both of them anyway, so make sure you set the right 'Matching No.' in Jmeter if you pass one of the values:
The Matching No should be 1 for the first, and 2 for the second match i believe.

Regular expression to get value with duplicate data

Hi trying to extract my required string from given string. Given string looks like below.
1|a1|id11-name11,x|a2|id21-name21,y|a3|id31-name31~id32-name32,y4|a4|id41-name41~id42-name42~id43-name43
Expected output:
a1~name11|a2~name21|a3~name31|a3~name32|a4~name41|a4~name42|a4~name43
Regular Expression:
(^|,)[^|]{0,}\|([^|]{0,})\|(~){0,}[^-]{0,}-([^,~]{0,})
Extracting $2~$4| or \2~\4|
Regular Expression output:
a1~name11|a2~name21|a3~name31|
Is it possible to get a3~name32 along with a3~name31 using regular expression? Using multiple regular expression is also fine. Values in the third part after pipe symbol is not limited to 4 different values(id41-name41~id42-name42~id43-name43). This could be like id41-name41~id42-name42~id43-name43~id43-name43~id43-name43~id43-name43...
You have two choices first one is to split the string into many parts and get what you want.
Second one depends on the longest repeated part. In your case it is idxx-namexx.
If it is limited to a reasonable value you can repeat that part in you regex so you get all the parts. For instance for 2 you need to add the second part as follows:
([a-zA-Z]\d)\|(id\d+-(name\d+))(~?id\d+-(name\d+))?
______________-------1-------- _---------2--------_________
The groups will be
\1~\3 and
\1~\5
You can check it in Regex101 Site

regular expression multiple matches

For reference, this is the regex tester I am using:
http://www.rsyslog.com/regex/
How can I modify this regular expression:
[^;]+
to receive multiple sub-matches for the following test string:
;first;second;third;fourth;fifth and sixth;seventh;
I currently only receive one sub-match:
first
Basically I want each sub-match to consist of the content between ; characters, I am hoping for a sub-match list like this:
first
second
third
fourth
fifth and sixth
seventh
Following information given in the comments I discovered that the reason I can't get more than one sub-match is that I need to specify the global modifier - and I can't seem to figure out how to do that in the ryslog regex tester I am using.
However, this did lead me to solve my problem in a slightly different manner. I came up with this regular expression which still only gives one match, but the number near the end acts as the index for the desired match, so for example:
(?:;([^;]+)){5}
matches this from my test string in the question:
fifth and sixth
While this solution allows me to achieve what I wanted - though in a different manner - the true answer to my question is found in HamZa's comments. More specifically:
How can I modify the regular expression to receive multiple
sub-matches?
The answer is, you can't modify the regular expression itself in order to get multiple sub-matches. Setting the global modifier is required in order to do that.
Based on this information I have posted a new question on serverfault targeted specifically to the rsyslog regular expression system.

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Regex Pattern Matching Concatenation

Is it possible to concatenate the results of Regex Pattern Matching using only Regex syntax?
The specific instance is a program is allowing regex syntax to pull info from a file, but I would like it to pull from several portions and concatenate the results.
For instance:
Input string: 1234567890
Desired result string: 2389
Regex Pattern match: (?<=1).+(?=4)%%(?<=7).+(?=0)
Where %% represents some form of concatenation syntax. Using starting and ending with syntax is important since I know the field names but not the values of the field.
Does a keyword that functions like %% exist? Is there a more clever way to do this? Must the code be changed to allow multiple regex inputs, automatically concatenating?
Again, the pieces to be concatenated may be far apart with unknown characters in between. All that is known is the information surrounding the substrings.
2011-08-08 edit: The program is written in C#, but changing the code is a major undertaking compared to finding a regex-based solution.
Without knowing exactly what you want to match and what language you're using, it's impossible to give you an exact answer. However, the usual way to approach something like this is to use grouping.
In C#:
string pattern = #"(?<=1)(.+)(?=4).+(?<=7)(.+)(?=0)";
Match m = Regex.Match(input, pattern);
string result = m.Groups[0] + m.Groups[1];
The same approach can be applied to many other languages as well.
Edit
If you are not able to change the code, then there's no way to accomplish what you want. The reason is that in C#, the regex string itself doesn't have any power over the output. To change the result, you'd have to either change the called method of the Regex class or do some additional work afterwards. As it is, the method called most likely just returns either a Match object or a list of matching objects, neither of which will do what you want, regardless of the input regex string.