Regex PCRE matching on an URL with multiple parameters random values - regex

Sample GET request I want to match on with regex PCRE:
random.php?blue=value1&green=value2&red=value3&orange=value4&grey=value5&black=value6
Facts:
random.php - The filename is random, only the ".php?" is fixed
I have about 10 colors defined as parameters
No specific order to the colors - .php?blue=[a-zA-Z0-9]{1,20}
Can be just 2 colors as parameters, or all the 10, but I want to match on all GET requests, multiple parameters are joined with \&
Values are always between 1-20 and with alphanumerical - .php?blue=[a-zA-Z0-9]{1,20}
How would you approach this?

Perhaps something like:
[^\s/?]+\.php\?((?:blue|orange|red|black)=[a-zA-Z0-9]{1,20})(?:&(?1)){1,9}(?:$|#.*)
(complete with the colours you want)
(?1) is a reference to the first capture group subpattern.
I added a support for an eventual anchor part #.*. Feel free to remove it if you don't need or want it.

Related

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

regex to separate numbers from comma delimited list

I need a pure regex (no language) to separate the numbers of this input array:
L1,3,5,0,5,80,40,31,0,0,0,0,512,412,213,900
Issues:
The first field (L1) is fixed. The array will always start with L1.
The other fields will always be 0 or positive numbers.
But I need to acquire each data separately, so it would be:
A regex for second data (number 3 in the example)
A regex for third data (number 5 in the example)
....
A regex for sixteenth data (number 900 in the example)
I tried this regex [^;,]* but it wasn't able to get each data separately.
Can anyone help me on this issue?
With 'pure regex' to get each field, you have to use separate capture groups:
^L(\d),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+)$
Demo
(Note: In Python, Perl, Ruby, Java, etc you can have a global find and capture like /(\d+)/g but that is the language gathering up the matches into a list...)
If you want just one specific field, you can use numbered repetition.
^L(\d)(,(\d+)){N}
Capture group 3 would always be field N-1 so to capture 213, the 15th field, in your example:
^L(\d)(,(\d+)){14}
Demo2
Trying to refine dawg's approach such that less capturing groups are used:
The fourth field can be matched by
^L1(?:,(\d+)){3}
Online Test
The fifth field can be matched by
^L1(?:,(\d+)){4}
etc.

How to define this regex with alternation to have the same number of groups in the match?

I am trying to parse such strings
99_GOG_A_X1_FOO X-2014-09
99_YAK_A_YZ1_BAR YZY-2014-10
with this regex
99_\w{3}_(A|B)_((X)(0*[1-9][0-9]?)_(FOO|BAR) X-(\b0*20(1[4-9]|[2-9][0-9])\b)-\b0*([1-9]|1[0-2])\b|(YZ)(0*[1-9][0-9]?)_(FOO|BAR) YZY-(\b0*20(1[4-9]|[2-9][0-9])\b)-\b0*([4]|1[0])\b)
The first input can only have 1 to 12 at the end while the second input can only have 04 or 10.
This works. but i would like to have a solution which only returns matching groups.
With this solution I have these groups http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDwsSBlJlY2lwZRiwuuALDA/index.html
I have superfluous groups and the matching groups are not on the same indexes for both inputs.
Is there a way to get rid of the empty groups and to align the indexes?
Update:
I have to keep the following rules.
If the input matches this _((X)(0*[1-9][0-9]?) it must also contain X- and allow this range at the end \b0*([4]|1[0])\b
If the input matches this _(YZ)(0*[1-9][0-9]?) it must also contain YZY- and allow only this inputs at the end \b0*([4]|1[0])\b
So I want to combine these regex:
^99_\w{3}_(A|B)_(X)(0*[1-9][0-9]?)_(FOO|BAR) X-(\b0*20(1[4-9]|[2-9][0-9])\b)-\b0*([1-9]|1[0-2])\b$
^99_\w{3}_(A|B)_(YZ)(0*[1-9][0-9]?)_(FOO|BAR) YZY-(\b0*20(1[4-9]|[2-9][0-9])\b)-\b0*([4]|1[0])\b$
From what I understand, this could work for you:
99_\w{3}_(A|B)_((X|(?>YZ))(0*[1-9][0-9]?)_(FOO|BAR) \3Y?-(\b0?20(1[4-9]|[2-9][0-9])\b)-\b0?((?:[1-9]|1[0-2])|(?:[4]|1[0]))\b$)
DEMO
If this fails (which it might) you should consider use actual code to make sure it passes "((X)(0*[1-9][0-9]?) it must also contain X- and allow this range at the end \b0*([4]|1[0])\b"
Use the capture groups to check if \3 == 'X' or if \3 == 'YZ' and then apply the rest of the regex as necessary

Name Capture, need a list of numbers

I need to produced a named capture of numbers in a list
Example Source Data
This is a comment on line 1
Here is another Comment Line 2
Log ID 1234,5555,2342
using: (?<id>(\d+)*) I will pick up the results of
1
2
1234
5555
2342
But this picks up 1 and 2 in error. I need it to pick up the items after Log ID Only.
I am looking for a regular expression that will return
1234
5555
2342
In a named group called id
If your language supports variable length lookbehinds, you should be able to use the following:
(?<=Log ID.*)(?<id>\d+)
I also made some modifications to your original regex, because I don't really see the point of the additional capture group inside of the named capture group, or the nested repetition ((\d+)* is equivalent to (\d*), but I think you actually want \d+ so that it requires you to match at least one digit).
If you can't use variable length lookbehinds (most languages), then you will probably need to do this in two steps. First match any lines with 'Log ID' then look for numbers in those lines.
Would a negative look behind assertion do the trick?
(?<![Ll]ine )(?<id>\d+)
You can do this also without look(ahead|behind):
"Log\s+ID\s+((?<id>\d+),?)+"
This will give you each of the numbers in a separate group named id
Log\s+ID\s+: match the ID that you are after, but don't capture
(?<id>\d+),?: capture the number and allow an optional comma after it (but don't capture)
+: repeat at least once
However, this introduces a problem because you will have several groups with the same name - it depends on the language how this will be handled.
Alternatively you can use this regex to capture the whole string after Log ID into one group:
"Log\s*ID\s+(?<id>(?:\d+,?)+)"