Repeating named capture groups - regex

I have a string with a field like this: id="ID-120-1, ID-141-5, ID-92-5, N/A"
I'd like to capture only the "ID"s to a named capture group (i.e. without the "N/A" or other items that might creep in). I thought this might work, but no luck:
\bid=\"(?<id>(ID-\d+-\d+)+)
Any ideas?

The expression you are using only returns one because you are counting on the start of the id to be present in front of each ID value. The following adjustment should fix that.
(?:(?:=\")|(?:,\s))(?<id>(?:ID-\d+-\d+)*)
Another option would be to just drop the id=" check part all together
(?<id>(?:ID-\d+-\d+))
Or you could add the ", " check on to the end of the id to make sure you are in attribute.
(?<id>(?:ID-\d+-\d+))(?:(?:,\s)|(?:"))

You would need to capture commas and spaces also, as they are repeated in your string:
\bid=\"(?<id>(ID-\d+-\d+, )+)

I believe what you are trying to do is not possible with pure regex, especially if IDs and 'N/A' can be intermixed. You will need to have a loop in your program, or if you use Perl or PHP, you can run code in the replacement part of the regex (/e switch) to add the matches to an array.

Related

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

RegExp: How to remove leading and trailing groups if present from a string

I am executing a regular expression against a long string, capturing portions of it.
One of this portion is between quotes and it can have any number of subportions delimited by slash, such as:
'george'
'paul/john'
'john/peter/charles'
...
the subportions are unknown and can be in any order.
I need to retrieve the string between the quotes, but also I would like to be able to remove unwanted leading and trailing groups while executing it.
For example, if the string starts with bruce or bongo, I want to remove it
'bruce/peter/marc' -> peter/marc
'bongo/bob/kevin/chris' -> bob/kevin/chris
However if the strings starts with anything else, then I want to keep it
'alfie/george/paul' -> alfie/george/paul
Only one word in the group can be present at at time, in the example above only bruce or bongo can be present at the beginning.
To do it I successfully used the following regular expression:
/'(?:bruce|bongo|)\/?([^']+)'/
In a similar way I want to remove a trailing group.
Let' say that if the string ends with sam or mark I want to remove this portion as well, for example:
'emily/grace/poppy/sam' -> emily/grace/poppy
'connor/barnaby/mark' -> connor/barnaby
Again, only one word of the group can be present at the end, in the example only sam or mark can end the string.
I thought to use the same as above and going with something similar to:
/'(?:bruce|bongo|)\/?([^']+)(?:sam|mark|)'/
But it's not working: bruce or bongo are removed if present, while sam or mark are always kept if present.
I know I can extract the match as it is and remove it with string manipulation methods. I am using javascript at the moment, and I can use:
"bruce/john/charles/sam".replace(/^(?:bruce|bongo)\//, '').replace(/\/(?:sam|mark)$/, '');
But I was wondering if there's a way to achieve the same result using directly the initial regular expression I execute against the long original string.
What am I missing?
You just have to make the middle part lazy, by adding a ? after the +:
'(?:bruce|bongo|)\/?([^']+?)(?:sam|mark|)'
And if you want the capture group to exclude the / that occurs before sam or mark, then:
'(?:bruce|bongo|)\/?([^']+?)(?:\/sam|\/mark|)'

Regex Wrapping Quotes

I am trying to wrap quotes around certain section of content in a CSV file, the current layout is something like this:
###element1,element2,element3,element4,element5,element6,element7,element8, "element9,
element9,""element9"",element9,
element9,element9,""element9",element10,
###
the ### symbols depict a new line and each new line should have one, the problem is I need to get to all of element 9 in to one set of double quotes, however there are multiple instances of doublequotes within that area which break up the element in to new fields making my table expand beyond the fields I initially set. So I believe I need to remove all the " marks between the start and end of element9 and then reintroduce one set to highlight the whole section.
I approached this firstly by trying to select the 8th Comma from the start and the 2 comma from the end:
^((?:[^,]+,){8})(.+)((?:,[^,]*){2})$
and replacing with
$1"$2"$3
I tried to target the starting ### and ending ### to select those two elements but with no success.
any suggestions on how I can do this
UPDATE
###BLAHBLAH,BLAHBLAH,BLAHBLAH,BLAHBLAH,BLAHBLAH,BLAHBLAH,BLAHBLAH,BLAHBLAH,BLAHBLAH,
BLAHBLAH,
BLAHBLAH,
BLAHBLAH, BLAHBLAH,
BLAHBLAH, BLAHBLAH,
BLAHBLAH,
"BLAHBLAH""",E,
###
The last field always seem to contain a capital letter, the fields before vary in quotation placement so to really target that whole section I need to work out how many commas along and how many back I need to go, remove the quotes and then reinstate them in the correct positions.
###(?:[^,]*,){8}\K([\s\S]*?)(?=,[^,]*,[^,]*?###)
Try this.Replace by "\1" or "$1".See demo.
https://regex101.com/r/tD0dU9/13
/^(?:[^,]*,){8}([^#]*),[^,]*,[^,]*$/s
https://regex101.com/r/hU8yO6/1
I think the regexp you had is about right, except for needing the /s modifier.
For notepad++, get the s modifier by ticking ". matches newline":
^(?:[^,]*,){8}([^#]*),[^,]*,[^,]*$
This looks like a good reference: http://docs.notepad-plus-plus.org/index.php/Regular_Expressions
You'll probably want to add parens appropriately to make capture groups also.
^#+[^"]+"([^#]+),[^,]+,[^,]+###\s*$

Regex: using alternatives

Let's say I would like to get all the 'href' values from HTML.
I could run a regex like this on the content:
a[\s]+href[\s]*=("|')(.)+("|')
which would match
a href="something"
OR
a href = 'something' // quotes, spaces ...
which is OK; but with ("|') I get too many groups captured which is something I do not want.
How does one use alternative in regex without capturing groups as well?
The question could also be stated like: how do I delimit alternatives to match? (start and stop). I used parenthesis since this is all that worked...
(I know that the given regex is not perfect or very good, I'm just trying to figure this alternating with two values thing since it is not perfectly clear to me)
Thanks for any tips
Use non-capture groups, like this: (?:"|'), the key part being the ?:at the beginning. They act as a group but do not result in a separate match.

Extract querystring value from url using regex

I need to pull a variable out of a URL or get an empty string if that variable is not present.
Pseudo code:
String foo = "http://abcdefg.hij.klmnop.com/a/b/c.file?foo=123&zoo=panda";
String bar = "http://abcdefg.hij.klmnop.com/a/b/c.file";
when I run my regex I want to get 123 in the first case and empty string in the second.
I'm trying this as my replace .*?foo=(.*?)&?.*
replacing this with $1 but that's not working when foo= isn't present.
I can't just do a match, it has to be a replace.
You can try this:
[^?]+(?:\?foo=([^&]+).*)?
If there are parameters and the first parameter is named "foo", its value will be captured in group #1. If there are no parameters the regex will still succeed, but I can't predict what will happen when you access the capturing group. Some possibilities:
it will contain an empty string
it will contain a null reference, which will be automatically converted to
an empty string
the word "null"
your app will throw an exception because group #1 didn't participate in the match.
This regex matches the sample strings you provided, but it won't work if there's a parameter list that doesn't include "foo", or if "foo" is not the first parameter. Those options can be accommodated too, assuming the capturing group thing works.
I think you need to do a match, then a regex. That way you can extract the value if it is present, and replace it with "" if it is not. Something like this:
if(foo.match("\\?foo=([^&]+)")){
String bar = foo.replace("\\?foo=([^&]+)", $1);
}else{
String bar = "";
}
I haven't tested the regex, so I don't know if it will work.
In perl you could use this:
s/[^?*]*\??(foo=)?([\d]*).*/$2/
This will get everything up to the ? to start, and then isolate the foo, grab the numbers in a group and let the rest fall where they may.
There's an important rule when using regular expressions : don't try to put unnecessary processing into it. Sometimes things can't be done only by using one regular expression. Sometimes it is more advisable to use the host programming language.
Marius' answer makes use of this rule : rather than finding a convoluted way of replacing-something-only-if-it-exists, it is better to use your programming language to check for the pattern's presence, and replace only if necessary.