regex to separate numbers from comma delimited list - regex

I need a pure regex (no language) to separate the numbers of this input array:
L1,3,5,0,5,80,40,31,0,0,0,0,512,412,213,900
Issues:
The first field (L1) is fixed. The array will always start with L1.
The other fields will always be 0 or positive numbers.
But I need to acquire each data separately, so it would be:
A regex for second data (number 3 in the example)
A regex for third data (number 5 in the example)
....
A regex for sixteenth data (number 900 in the example)
I tried this regex [^;,]* but it wasn't able to get each data separately.
Can anyone help me on this issue?

With 'pure regex' to get each field, you have to use separate capture groups:
^L(\d),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+),(\d+)$
Demo
(Note: In Python, Perl, Ruby, Java, etc you can have a global find and capture like /(\d+)/g but that is the language gathering up the matches into a list...)
If you want just one specific field, you can use numbered repetition.
^L(\d)(,(\d+)){N}
Capture group 3 would always be field N-1 so to capture 213, the 15th field, in your example:
^L(\d)(,(\d+)){14}
Demo2

Trying to refine dawg's approach such that less capturing groups are used:
The fourth field can be matched by
^L1(?:,(\d+)){3}
Online Test
The fifth field can be matched by
^L1(?:,(\d+)){4}
etc.

Related

RegEx works everywhere except in Pentaho RegEx Evaluation Step

I have a couple of RegEx that work on the online regex websites but not in Pentaho. Could you please help?
Here's the string:
:6585d0f0ba88767ac3b590f719596d864d73e9c1:
harmonicbalance/src/harmonicbalance/HarmonicBalanceFlowModel.cpp
harmonicbalance/src/harmonicbalance/HbFlutterModel.cpp
:8302994b565553c83a048b8905ae597349d99627:
emp/src/emp/PhasePairSingleParticleReynoldsNumber.h
emp/src/emp/TomiyamaDragCoefficientMethod.cpp
:9da194f17ec08bb20ad1be8df68b78ca137ab18a:
combustion/src/combustion/ReactingSpeciesTransportBasedModel.cpp
combustion/src/complexchemistry/TurbulentFlameClosure.cpp
:6a59f0be1e347a65e525e58742bb304639ea9bc4:
meshing/src/meshing/SurfaceMeshManipulation.cpp
physics/src/discretization/FvIndirectRegionInterfaceManager.cpp
physics/src/discretization/FvIndirectRegionInterfaceManager.h
physics/src/discretization/FvRepresentation.cpp
physics/src/discretization/FvRepresentation.h
:64b7f6d36b11b6cd94c20cad53463b7deef8c85a:
resourceclient/src/resourceclient/ResourcePool.cpp
resourceclient/src/resourceclient/ResourcePool.h
resourceclient/src/resourceclient/RestClient.cpp
resourceclient/src/resourceclient/RestClient.h
resourceclient/src/resourceclient/test/ResourcePoolTest.cpp
I would like to capture two groups. First group will extract all commit SHA1 and the other group would extract file names.
Below are the expressions I tried:
(?:^:([A-Za-z0-9]+):|(?!^)\G)\n+([A-Za-z/.-]+)
https://regex101.com/r/3IBkPz/1
^:(\w+):\s+((?:\s*(?!:)[^\s]+)+)
https://regex101.com/r/oIoDvM/1
Thoughts?
AFAIK (as of PDI-8.0), the Regex Evaluation step does NOT support the regex 'g' modifier, your regex pattern must cover all the text to be able to make a match.
For example: the following pattern will not match anything in Regex Evaluation step:
:([0-9a-f]+):\s+([^:]+)
but if I prepend .* to this pattern and pick "Enable dotall mode":
.*:([0-9a-f]+):\s+([^:]+)
it will match the last commit(sha1 + filenames). You can try move .* to the end of
the original pattern which will get you the first commit. So if you want to retrieve
the full list of commits(sha1 + filenames) with the g modifier, this step is
probably not a solution for you.
As the fields are basically split by colons ':' and new lines, you can probably try the following approach:
Use Split field to rows step, Delimiter=':' and include rownum in output, this rownum can be used to filter rows where even number is sha1 and odd number is filenames
Use Analytic Query step to create a new field with LEAD = 1, so now you can get sha1 and filenames in the same row
Use Calculator and Fileter step to calculate the remainer of rownum/2 and keep only rows with the odd number of rownum
Use Split fields to rows again to split filenames to filename using "\n"(Delimiter is a Regular Expression). you might want to filter out the EMPTY filename, since the delimiter only support one char

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

Regex PCRE matching on an URL with multiple parameters random values

Sample GET request I want to match on with regex PCRE:
random.php?blue=value1&green=value2&red=value3&orange=value4&grey=value5&black=value6
Facts:
random.php - The filename is random, only the ".php?" is fixed
I have about 10 colors defined as parameters
No specific order to the colors - .php?blue=[a-zA-Z0-9]{1,20}
Can be just 2 colors as parameters, or all the 10, but I want to match on all GET requests, multiple parameters are joined with \&
Values are always between 1-20 and with alphanumerical - .php?blue=[a-zA-Z0-9]{1,20}
How would you approach this?
Perhaps something like:
[^\s/?]+\.php\?((?:blue|orange|red|black)=[a-zA-Z0-9]{1,20})(?:&(?1)){1,9}(?:$|#.*)
(complete with the colours you want)
(?1) is a reference to the first capture group subpattern.
I added a support for an eventual anchor part #.*. Feel free to remove it if you don't need or want it.

regex: substitute character in captured group

EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.
I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.

Name Capture, need a list of numbers

I need to produced a named capture of numbers in a list
Example Source Data
This is a comment on line 1
Here is another Comment Line 2
Log ID 1234,5555,2342
using: (?<id>(\d+)*) I will pick up the results of
1
2
1234
5555
2342
But this picks up 1 and 2 in error. I need it to pick up the items after Log ID Only.
I am looking for a regular expression that will return
1234
5555
2342
In a named group called id
If your language supports variable length lookbehinds, you should be able to use the following:
(?<=Log ID.*)(?<id>\d+)
I also made some modifications to your original regex, because I don't really see the point of the additional capture group inside of the named capture group, or the nested repetition ((\d+)* is equivalent to (\d*), but I think you actually want \d+ so that it requires you to match at least one digit).
If you can't use variable length lookbehinds (most languages), then you will probably need to do this in two steps. First match any lines with 'Log ID' then look for numbers in those lines.
Would a negative look behind assertion do the trick?
(?<![Ll]ine )(?<id>\d+)
You can do this also without look(ahead|behind):
"Log\s+ID\s+((?<id>\d+),?)+"
This will give you each of the numbers in a separate group named id
Log\s+ID\s+: match the ID that you are after, but don't capture
(?<id>\d+),?: capture the number and allow an optional comma after it (but don't capture)
+: repeat at least once
However, this introduces a problem because you will have several groups with the same name - it depends on the language how this will be handled.
Alternatively you can use this regex to capture the whole string after Log ID into one group:
"Log\s*ID\s+(?<id>(?:\d+,?)+)"