regular expression multiple matches - regex

For reference, this is the regex tester I am using:
http://www.rsyslog.com/regex/
How can I modify this regular expression:
[^;]+
to receive multiple sub-matches for the following test string:
;first;second;third;fourth;fifth and sixth;seventh;
I currently only receive one sub-match:
first
Basically I want each sub-match to consist of the content between ; characters, I am hoping for a sub-match list like this:
first
second
third
fourth
fifth and sixth
seventh

Following information given in the comments I discovered that the reason I can't get more than one sub-match is that I need to specify the global modifier - and I can't seem to figure out how to do that in the ryslog regex tester I am using.
However, this did lead me to solve my problem in a slightly different manner. I came up with this regular expression which still only gives one match, but the number near the end acts as the index for the desired match, so for example:
(?:;([^;]+)){5}
matches this from my test string in the question:
fifth and sixth
While this solution allows me to achieve what I wanted - though in a different manner - the true answer to my question is found in HamZa's comments. More specifically:
How can I modify the regular expression to receive multiple
sub-matches?
The answer is, you can't modify the regular expression itself in order to get multiple sub-matches. Setting the global modifier is required in order to do that.
Based on this information I have posted a new question on serverfault targeted specifically to the rsyslog regular expression system.

Related

REGEX to match a pattern within a GET variable

I am looking for a pattern to match if a special character was used within a GET variable, for example, in this case, if the character '<' was used by a user at somepoint in the variable 'x'. Currently, I've created the following regex pattern:
^example\.php\?(.*)x=(.*)<(.*)&
So, for example, the following strings would be matched:
example.php?x=abc"def&y=ghi
example.php?x=abcde"f&y=ghi
But with the pattern that I'm currently using, every other GET variable that might be using "<" is also being matched, for example:
example.php?x=abcdef&y=ghi"&z=klm
How should I overcome this so that it matches everything up to the first appeareance of the '&' character (so that it only analyses the 'x' variable)?
I don't know if the question is understandable, if not, I'll try to provide further details.
It took me sometime but I finally discovered what should I do.
I couldn't parse the url and get the value for X because the regex is going to be on a SQL query (As far as I'm aware, that's no easy task).
The answer would be something in the light of:
example\.php\?(.*)x=[^&]*<
I hope that I can help someone

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regular expression to get value with duplicate data

Hi trying to extract my required string from given string. Given string looks like below.
1|a1|id11-name11,x|a2|id21-name21,y|a3|id31-name31~id32-name32,y4|a4|id41-name41~id42-name42~id43-name43
Expected output:
a1~name11|a2~name21|a3~name31|a3~name32|a4~name41|a4~name42|a4~name43
Regular Expression:
(^|,)[^|]{0,}\|([^|]{0,})\|(~){0,}[^-]{0,}-([^,~]{0,})
Extracting $2~$4| or \2~\4|
Regular Expression output:
a1~name11|a2~name21|a3~name31|
Is it possible to get a3~name32 along with a3~name31 using regular expression? Using multiple regular expression is also fine. Values in the third part after pipe symbol is not limited to 4 different values(id41-name41~id42-name42~id43-name43). This could be like id41-name41~id42-name42~id43-name43~id43-name43~id43-name43~id43-name43...
You have two choices first one is to split the string into many parts and get what you want.
Second one depends on the longest repeated part. In your case it is idxx-namexx.
If it is limited to a reasonable value you can repeat that part in you regex so you get all the parts. For instance for 2 you need to add the second part as follows:
([a-zA-Z]\d)\|(id\d+-(name\d+))(~?id\d+-(name\d+))?
______________-------1-------- _---------2--------_________
The groups will be
\1~\3 and
\1~\5
You can check it in Regex101 Site

Google Analytics Regular Expressions

Kinda new to Rgeluar expressions and for the benefit of learning wanted to know how to do the following on one line:
page matching regular expression: .pdf/$
and page containing "somestring"
and page excluding "someotherstring"
I can obtain my desired output using the 3 rules above. My question is can I put all into one line using regular expression? So the first line would be something like:
page matching reg exp: .pdf/$ somestring+ (then regex for does not contain in GA) someotherstring
Is it possible to put all in a oner?
Lookahead will help you to match multiple independent things in one expression, and even allows to require non-matching. In your case:
/^(?=.*somestring)(?!.*someotherstring).*\.pdf$/

Looking for a regex to match more than one reference string in TortoiseSVN

We used two different methods to reference external documents and Bugzilla bug numbers.
I'm now looking for a regular expression that matches these two possibilities of reference strings for convenient display and linking in the TortoiseSVN 1.6.16 log screen. First should be a bugzilla entry of the form [BZ#123], second is [some text and numbers], which has not to be converted into a url.
This can be matched with
\[BZ#\d+\]
and
\[.*?\]
My problem now is to concatenate those two match strings together. Usually this would be done by the regex (first|second), and I've done it this way:
(\[.*?\]|\[BZ#\d+\])
Unfortunately in this case TortoiseSVN seems to catch it all as the bug number because of the round braces. Even if I add a second expression which (according to the documentation) is meant to be used to extract the issue number itself, this second expression is supposed to be ignored:
(\[.*?\]|\[BZ#\d+\])
\[BZ#(\d+)\]
In this case TortoiseSVN displays the bug and document references correctly in the separate column, but uses them completely for the bugtracker url, which is of course not working:
https://mybugzillaserver/show_bug.cgi?id=[BZ#949]
BTW, Mercurial uses a better way by using {1}, {2}, ... as the placeholder in URLs.
Has anybody an idea how to solve this problem?
EDIT
In short: We have used [BZ#123] as bug number references and [anytext] as references to other (partly non-electronic) documents. We would like to have both patterns listed in TortoiseSVN's extra column, but only the bug number from the first part shpuld be used as %BUGID% in the URL string.
EDIT 2
Supposedly TortoiseSVN cannot handle nested regex groups (round braces), so this question doesn't have any satisfactory answer at the moment.
I'm not familiar with TortoiseSVN regex, but what it looked like the problem was that the first piece of the regex ([.*?\]) would always match, so you would never even get to the part evaluating the second part, \[BZ#(\d+)\]
Try this one instead:
((?<=\[BZ#)\d+(?=\])|\[.*?\])
Explanation:
( #Opening group.
(?<=\[BZ#) #Look behind for a bugzilla placeholder.
\d+ #Capture just the digits.
(?=\]) #Look ahead for the closing bracket (probably not necessary.)
| #Or, if that fails,
\[.*?\] #Find all other placeholders.
) #Closing the group.
Edit: I've just looked at TortoiseSVN docs. You could also try to keep the Message part expression the same, but change the Bug-ID expression to:
(?<=\[BZ#)(\d+)(?=\])
Edit: ?<= represents a zero-width lookbehind. See http://www.regular-expressions.info/lookaround.html. It is possible that TortoiseSVN doesn't support lookbehinds.
What happens if you just use (\d+) for your Bug-ID expression?