Exclude substring from capture group

Exclude substring from capture group - regex

I am using a system which takes a PCRE compatible regular expression.
The system stores capture group 1 into a database.
I need to capture two halves of a string with a delimiter, excluding the delimiter, as a single capture group.
Given the string: "I want to capture this bit but not this bit and definitely this bit"
I get that I could create a regex like:
([A-Za-z\s]*) but not this bit([A-Za-z\s]*)
This would give me two capture groups:
Group 1: "I want to capture this bit"
Group 2: " and definitely this bit"
However, I miss out on half my result, as group 1 is all that is stored.

You may be thinking about the branch reset feature. But this is only an assumption.
(?|([a-zA-Z\s]+) but not this bit|([a-zA-Z\s]+))
As stated in the comments, you can can fix this using the correct syntax.
([A-Za-z\s]+) but not this bit([A-Za-z\s]+)

So it turned out I had to do this programmatically, rather than relying on a single regex. Turns out Casimir was correct that it wasn't possible to do this with a single capture group, even following hwnd's suggestion, as below:
branch-reset does not result in a combined capture group
Also, yes, I had the wrong slash :-P

Related

Ways to exclude a word in a regex (without lookahead?)

If I have the input:
hello cat
hellocat
hello gat
I would like to find the a line that starts with the word "hello" and doesn't have the word "cat" after it.
Is it possible to negate a group, for example:
hello[^(\s?cat)]
Or are you only able to negate a set of characters in that position? If not, what are some ways to accomplish this? The only way that I've been able to do this is with a positive lookahead:
hello(?!\s?cat)
But I was wondering if there were alternative approaches to doing this.

There is also another way without look arounds which I think is worth mentioning as an interesting concept: /hello(?:\scat)|(hello\s.*)/
In this case we first match what we don't want (but don't capture it) then we only capture the second part if first part failed, which means that in the capture you will always have something that does not contain cat.
You can check in this example https://regex101.com/r/bydCGb/3, in the match information box, the "group 1" capture - and also check the substitution part - we never have the cat part.
According to your case, you can then say: if there are capturing group 1 then do something.

I don't think it's possible easily without using the negative lookahead.
You can exclude specific characters using the [^abc] convention. However you'd have to explicitly exclude cat but then permit everything that is almost cat.
E.g.
((hello)ca[^t]|(hello)c[^a]|(hello)[^c])
Then get the captured group within corresponding to the hello group. The spaces after the hello and making that an option make it a bit harder. This optional space can be captured with the following:
((hello)\sca[^t]|(hello)\sc[^a]|(hello)\s[^c]|(hello)ca[^t]|(hello)c[^a]|(hello)[^c ])
NB: It has all six options, and the final one adds an optional space to ensure that the first three don't get captured.
Tested here: https://regex101.com/r/sgoHyJ/1
I guess you can see why they invented negative look-aheads...

You can't easily do this with pure regex without using a negative lookahead. However, if you are making these regex calls via an API in some programming language, you could phrase a match using the following positive:
^hello\b.*
and the following negative:
^hello cat\b
That is, a valid match is positive on the first pattern and negative on the second pattern. In Java, this proposed solution would look like this:
String input = "hello gat";
if (input.matches("hello\\b.*") && !input.matches("hello cat\\b.*")) {
System.out.println("MATCH");
}
else {
System.out.println("NO MATCH");
}

Excluding 3dots additional to other characters with regex in a string

I have such an http-url detector regex:
(?:http|https)(?::\/{2}[\w]+)(?:[\/|\.]?)(?:[^\s<"]*)
It works pretty well for the following url representation:
http://www.acer.com/clearfi/download/
What kind of modification I can do to extract
http://schemas.microsoft.com/office/word/2003/wordml2450
from
Huanghhttp://schemas.microsoft.com/office/word/2003/wordml2450...)()()()()()
?

You can modify it to capture:
group of http stuff
followed by (group of) subdomain stuff
followed by as many as possible groups of:
one point or slash
followed by a group of characters (non-point, non-space, non-", non-<)
(?:http|https)(?:\/{2}[\w]+)([\/|\.][^\s<"\.]+)*
I made capturing groups to visualize the results

I've changed your expression here and there: (.*)(https?:\/{2}[\w]+[\/|\.]?[^\s<"]*)(\.{3}.*) and get only second capturing group from it. See example here: https://regex101.com/r/0viPC5/2
This expression probably can be simplified further but I don't know your exact input and search criteria so let's stick with what you already wrote.

A short way to capture/back-reference every digit of a number individually

So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!

The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3

Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.

Capture group that captures an entire string minus a section that matches a pattern

I'm not sure if this is possible, but I figured I'd ask anyways. What I need to do is effectively create a search/replace, but without using the regex s/pattern1/pattern2/ syntax as it is not directly exposed to me.
Is it possible to create a capture group that would take an image path, with the image size before the extension and remove the image size.
For instance convert http://example.com/path/to/image/filename-200x200.jpg to http://example.com/path/to/image/filename.jpg using only a capture group and no search/replace bits.
I'm asking as the software I'm working in does not currently have a search/replace functionality.

It's somewhat possible. There's no built-in capability for a match to be something other than a continuous segment of the source text, but you can work around that.
One approach you might consider is the use of non-capturing groups and concatenation. In regex, groups beginning with ?: aren't captured as matches.
For example, given the regex (A)(?:B)(C) and the string "ABC", the result would be:
1. "A"
2. "C"
In your case, then, you could capture around the part you want to ignore, then concatenate the parts you want.
Given the string you provided, http://example.com/path/to/image/filename-200x200.jpg, the regex (.+)(?:-200x200)(.+) returns:
1. "http://example.com/path/to/image/filename"
2. ".jpg"
You could then add the first and second capture groups to produce your intended result.

Problem getting nested groups in Regex

Given the following text:
//[&][$][*]\n81723&8992%9892*2343%8734
I need to get:
1. &
2. $
3. *
4. 81723&8992%9892*2343%8734
The first line defines delimiters that separates the numbers at the second line.
There is an undefined number of delimiters.
I made this regex:
//(?:\[([^\]]+)\])+\n(.+)
But only 2 groups are obtained. The first is the last delimiter and the second is the string containing the numbers. I tried but I couldn't get all the delimiters.
I'm not good at regex, but I think the first group is being overwritten on every iteration of (?:[([^]]+)])+ and I can't solve this.
Any help?
Regards
Victor

That's not a nested group you're dealing with, it's a repeated group. And you're right: when a capturing group is controlled by a quantifier, it gets repopulated on every iteration, so the final value is whatever was captured the last time around.
What you're trying to do isn't possible in any regex flavor I'm familiar with.
Here's a fuller explanation: Repeating a Capturing Group vs. Capturing a Repeated Group

The best thing I see that you could do (with regex) would be something like this:
(?:\[([^\]]+)\])?(?:\[([^\]]+)\])? #....etc....# \n(.+)

You can’t write something like (foo)+ and match against "foofoofoo" and expect to get three groups back. You only get one per open paren. That means you need more groups that you’ve written.

The following regex works for javascript:
(\[.+\])(\[.+\])(\[.+\])\\n(.*)
This assumes your & $ * will have values.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js