I need to produced a named capture of numbers in a list
Example Source Data
This is a comment on line 1
Here is another Comment Line 2
Log ID 1234,5555,2342
using: (?<id>(\d+)*) I will pick up the results of
1
2
1234
5555
2342
But this picks up 1 and 2 in error. I need it to pick up the items after Log ID Only.
I am looking for a regular expression that will return
1234
5555
2342
In a named group called id
If your language supports variable length lookbehinds, you should be able to use the following:
(?<=Log ID.*)(?<id>\d+)
I also made some modifications to your original regex, because I don't really see the point of the additional capture group inside of the named capture group, or the nested repetition ((\d+)* is equivalent to (\d*), but I think you actually want \d+ so that it requires you to match at least one digit).
If you can't use variable length lookbehinds (most languages), then you will probably need to do this in two steps. First match any lines with 'Log ID' then look for numbers in those lines.
Would a negative look behind assertion do the trick?
(?<![Ll]ine )(?<id>\d+)
You can do this also without look(ahead|behind):
"Log\s+ID\s+((?<id>\d+),?)+"
This will give you each of the numbers in a separate group named id
Log\s+ID\s+: match the ID that you are after, but don't capture
(?<id>\d+),?: capture the number and allow an optional comma after it (but don't capture)
+: repeat at least once
However, this introduces a problem because you will have several groups with the same name - it depends on the language how this will be handled.
Alternatively you can use this regex to capture the whole string after Log ID into one group:
"Log\s*ID\s+(?<id>(?:\d+,?)+)"
Related
I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.
Sample GET request I want to match on with regex PCRE:
random.php?blue=value1&green=value2&red=value3&orange=value4&grey=value5&black=value6
Facts:
random.php - The filename is random, only the ".php?" is fixed
I have about 10 colors defined as parameters
No specific order to the colors - .php?blue=[a-zA-Z0-9]{1,20}
Can be just 2 colors as parameters, or all the 10, but I want to match on all GET requests, multiple parameters are joined with \&
Values are always between 1-20 and with alphanumerical - .php?blue=[a-zA-Z0-9]{1,20}
How would you approach this?
Perhaps something like:
[^\s/?]+\.php\?((?:blue|orange|red|black)=[a-zA-Z0-9]{1,20})(?:&(?1)){1,9}(?:$|#.*)
(complete with the colours you want)
(?1) is a reference to the first capture group subpattern.
I added a support for an eventual anchor part #.*. Feel free to remove it if you don't need or want it.
EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.
I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.
I need a pure regex (no language) to separate the numbers of this input array:
L1,3,5,0,5,80,40,31,0,0,0,0,512,412,213,900
Issues:
The first field (L1) is fixed. The array will always start with L1.
The other fields will always be 0 or positive numbers.
But I need to acquire each data separately, so it would be:
A regex for second data (number 3 in the example)
A regex for third data (number 5 in the example)
....
A regex for sixteenth data (number 900 in the example)
I tried this regex [^;,]* but it wasn't able to get each data separately.
Can anyone help me on this issue?
With 'pure regex' to get each field, you have to use separate capture groups:
^L(\d),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+)$
Demo
(Note: In Python, Perl, Ruby, Java, etc you can have a global find and capture like /(\d+)/g but that is the language gathering up the matches into a list...)
If you want just one specific field, you can use numbered repetition.
^L(\d)(,(\d+)){N}
Capture group 3 would always be field N-1 so to capture 213, the 15th field, in your example:
^L(\d)(,(\d+)){14}
Demo2
Trying to refine dawg's approach such that less capturing groups are used:
The fourth field can be matched by
^L1(?:,(\d+)){3}
Online Test
The fifth field can be matched by
^L1(?:,(\d+)){4}
etc.
Presently the regex is:
[A-Z]+(?=-\d+$)
This pulls out the correct value for most of the strings which follow the below format:
ANG-RGN-SOR-BCP-0004 i.e. BCP
However it pulls out SS for the following document instead of PMR:
ANG-B31-OPS-PMR-MACE-SS-0229
So basically I want to pull out the fourth term (between the hyphens), so it should pick BCP and PMR.
The following regex will get the 4th item in group 1:
(?:[A-Z0-9]+-){3}([A-Z0-9]+)
The first bit in (?:...) is a "non-capturing group" which acts like a group but won't appear in the backreference list.
The next bit means "3 of these non-capturing groups".
And finally, a capturing group to collect what you want.
I have assumed here that all the groups contain only uppercase letters and digits, you should modify the parts in [square brackets] to represent what these groups could be.
A more easily understandable method in Python:
a = "ANG-B31-OPS-PMR-MACE-SS-0229"
part = a.split('-')[3]
print part
This gives "PMR".
This should suit your needs (demo):
(?:.+?-){3}([^-]+)
You'll be able to access the fourth term in the first capturing group.