Capture group that captures an entire string minus a section that matches a pattern - regex

I'm not sure if this is possible, but I figured I'd ask anyways. What I need to do is effectively create a search/replace, but without using the regex s/pattern1/pattern2/ syntax as it is not directly exposed to me.
Is it possible to create a capture group that would take an image path, with the image size before the extension and remove the image size.
For instance convert http://example.com/path/to/image/filename-200x200.jpg to http://example.com/path/to/image/filename.jpg using only a capture group and no search/replace bits.
I'm asking as the software I'm working in does not currently have a search/replace functionality.

It's somewhat possible. There's no built-in capability for a match to be something other than a continuous segment of the source text, but you can work around that.
One approach you might consider is the use of non-capturing groups and concatenation. In regex, groups beginning with ?: aren't captured as matches.
For example, given the regex (A)(?:B)(C) and the string "ABC", the result would be:
1. "A"
2. "C"
In your case, then, you could capture around the part you want to ignore, then concatenate the parts you want.
Given the string you provided, http://example.com/path/to/image/filename-200x200.jpg, the regex (.+)(?:-200x200)(.+) returns:
1. "http://example.com/path/to/image/filename"
2. ".jpg"
You could then add the first and second capture groups to produce your intended result.

Related

Excluding 3dots additional to other characters with regex in a string

I have such an http-url detector regex:
(?:http|https)(?::\/{2}[\w]+)(?:[\/|\.]?)(?:[^\s<"]*)
It works pretty well for the following url representation:
http://www.acer.com/clearfi/download/
What kind of modification I can do to extract
http://schemas.microsoft.com/office/word/2003/wordml2450
from
Huanghhttp://schemas.microsoft.com/office/word/2003/wordml2450...)()()()()()
?
You can modify it to capture:
group of http stuff
followed by (group of) subdomain stuff
followed by as many as possible groups of:
one point or slash
followed by a group of characters (non-point, non-space, non-", non-<)
(?:http|https)(?:\/{2}[\w]+)([\/|\.][^\s<"\.]+)*
I made capturing groups to visualize the results
I've changed your expression here and there: (.*)(https?:\/{2}[\w]+[\/|\.]?[^\s<"]*)(\.{3}.*) and get only second capturing group from it. See example here: https://regex101.com/r/0viPC5/2
This expression probably can be simplified further but I don't know your exact input and search criteria so let's stick with what you already wrote.

A short way to capture/back-reference every digit of a number individually

So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!
The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3
Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.

Exclude substring from capture group

I am using a system which takes a PCRE compatible regular expression.
The system stores capture group 1 into a database.
I need to capture two halves of a string with a delimiter, excluding the delimiter, as a single capture group.
Given the string: "I want to capture this bit but not this bit and definitely this bit"
I get that I could create a regex like:
([A-Za-z\s]*) but not this bit([A-Za-z\s]*)
This would give me two capture groups:
Group 1: "I want to capture this bit"
Group 2: " and definitely this bit"
However, I miss out on half my result, as group 1 is all that is stored.
You may be thinking about the branch reset feature. But this is only an assumption.
(?|([a-zA-Z\s]+) but not this bit|([a-zA-Z\s]+))
As stated in the comments, you can can fix this using the correct syntax.
([A-Za-z\s]+) but not this bit([A-Za-z\s]+)
So it turned out I had to do this programmatically, rather than relying on a single regex. Turns out Casimir was correct that it wasn't possible to do this with a single capture group, even following hwnd's suggestion, as below:
branch-reset does not result in a combined capture group
Also, yes, I had the wrong slash :-P

regex up to a list of strings (without capturing that last string)

I am trying to form a regular expression to match text between a start-word and the first of a list of stop-words. However, I do not want to include the stop-word in my match.
(The use case is replacing a section of a document, stopping before the keyword signifying the next section)
My regular expression is:
(StartWord)[\s\S]*?(StopWord1|StopWord2|$)
However, this match includes the stop-word. See the example here: http://regexr.com/38pb9
Any thoughts? Thank you!
If your regex engine supports look aheads, you could just use this:
((StartWord)[\s\S]*?(?=StopWord1|StopWord2|$))
The look ahead makes that the match stops when the stop word or the end of the string is encountered, but it is not actually captured as part of the match.
If you also need to exclude the start word, you can use a look behind (again, assuming your regex engine supports it):
((?<=StartWord)[\s\S]*?(?=StopWord1|StopWord2|$))
But of course the simplest method may just be to use your existing pattern but use a group to extract only the parts that you need:
(StartWord)([\s\S]*?)(StopWord1|StopWord2|$)
Here, group 1 will contain the start word, group 2 will contain the body of the match, and group 3 will contain the stop word. In whatever language you're using, you can extract group 2 to get just the body.

What is the purpose of non-matching groups

I was reading the Groovy tutorial and they talk about how you can create non-matching groups by leading the group off with ?:. This way the group will not come up on the matcher. What I dont understand is why you would want to explicitly say do not match this group. Wouldnt it be simpler to just not put it into a group?
?: is used to group but when you do not want to capture them. This is useful for brevity of code and sometimes out rightly necessary. This helps in not storing something that we don't need subsequently after matching thus saving space.
They are also used mostly in conjunction with | operator.
The alternation operator has the lowest precedence of all regex
operators. That is, it tells the regex engine to match either
everything to the left of the vertical bar, or everything to the right
of the vertical bar. If you want to limit the reach of the
alternation, you need to use parentheses for grouping.
(http://www.regular-expressions.info/alternation.html).
In this case, you cannot just leave them without putting them in a group. You will need the alternation operator in many usual regexes such as email, url etc. Hope that helps.
/(?:http|ftp):\/\/([^\/\r\n]+)(\/[^\r\n]*)?/g is a sample URL regex in JavaScript which needs the alternation operator and needs grouping. Without grouping the match would be just http for all http urls.
There are at least four reasons for using a non-capturing group:
1) Save Memory: When you match a capturing group, the group's content is stored independently in memory, whether you need it or not. That space in memory can add up quickly when you're using regex and storing the results on a large set of data. For instance, [0-9]+(, [0-9]+)* will match a series of integers separated by commas and spaces like 15, 13, 14. Let's assume you only need whole matching string from the result (group 0). In this case, though, you'll really be storing "15, 13, 14" and ", 14", since the latter is in a captured group. You can save memory and time by using [0-9]+(?:, [0-9]+)* instead. It might not matter for such a simple and short example, but with more complicated regexes, those extra bits of memory usage add up fast. As a bonus, non-capturing groups are also faster to process.
2) Simpler Code: If you've got a regex like ([a-z]+)( \.)* ([a-z]+) ([a-z]+) and want to extract the three words, you'd need to use groups 1, 3, and 4. While that's not terribly difficult, imagine that you need to add another group between the latter two words like ([a-z]+)( \.)* ([a-z]+)( \.)* ([a-z]+). If you use these groups in several places later in your code, it may be hard to track them down. Instead, you can first write ([a-z]+)(?: \.)* ([a-z]+) ([a-z]+) at first, and then change it to ([a-z]+)(?: \.*) ([a-z]+)(?: \.)* ([a-z]+), both of which match the words to groups 1, 2, and 3 respectively.
3) External Dependencies: You might have a function or library which needs to receive a regex match with exactly n groups. This is an unusual instance, but making all the other groups non-capturing will satisfy the requirement.
4) Group Count Limits: Most languages have a limit to the overall number of capturing groups in a regex. It's unusual to need that many groups (100 for python, for instance), it is possible. You can use fewer groups and run up against this limit less frequently by using non-captured groups which are not limited in that way. For instance:
((one|1), )((two|2), )…((nine_hundred_ninety_nine|999), )
where the … is all the in-between groups wouldn't match in some languages because it has too many capturing groups. But:
(?:(one|1), )(?:(two|2), )…(?:(nine_hundred_ninety_nine|999), )
would match and still return all the groups like one or 22.