Regex capture into group everything from string except part of string - regex

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.

A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')

Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

Related

Regex Conditional Matching in One Capture Group

I have a string that may come in the form:
"filename.first_order.png"
"filename.second_order.png"
"filename.png"
"filename.(jpg|tif|etc)"
I need to match the first part of the string containing the name, and the extension - however, if the string is a first/second order type, I need to match "first_order"/"second_order" as the second group, and not "png", and I can't get those two conditions to co-exist in one capture group. Example matches:
imageondisk.first_order.png -> [imageondisk, first_order]
anotherfile.png -> [anotherfile, png]
meetingminutes.jpeg -> [meetingminutes, jpeg]
I feel like I've used all sorts of combinations of lookaheads, lookbehinds, ?s which must look like a desparate uneducated mess, but whatever I do, I can never get a result where they don't conflict when I join them together - which would look something like
(.+)\.(tif|jpg|<png when not preceded by first/second_order>|<first/second_order, ignoring the .png on the end>)
Except I just went down a frustrating rabbit hole of non-capture groups and lookarounds that seemed to end in the same place, and I feel like I knew less regex than before.
Help would hugely appreciated.
You could use this regex, which captures the filename in group 1, any first/second order string in group 2 and the extension in group 3:
^([^.]+)(?:\.(.+))?\.(png|jpg|tif)$
Demo on regex101

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

Get last characters up to specific character

Lets say I have a string something-123.
I need to get last 5 (or less) characters of it but only up to - if there is one in the string, so the result would be like thing, but if string has no - in it, like something123 then the result would be ng123, and if string is like 123 then the result would be 123.
I know how to mach last 5 characters:
/.{5}$/
I know how to mach everything up to first -:
/[^-]*/
But I can not figure out how to combine them, and to make things worse I need to get the match without extracting it from specific groups and similar advanced regex stuff because I want to use it in SQL Anywhere, please help.
Tank you all for the help, but looks like a complete regex solution is going to be too complicated for my problem, so I did it very simple: SELECT right(regexp_substr('something-123', '[^-]*'), 4).
One option is to group the result:
(.{4})-
Now you have captured the result but without the -.
Or using lookarounds you can:
.{4}(?=-)
which matches any 4 characters that appears before "-".
You can use:
.{5}(?=(?:-[^-]*)?$)
See the regex demo
We match 5 symbols other than a newline only before the last - in the string or at the very end of the string ((?=(?:-[^-]*)?$)). You only need to collect the matches, no need checking groups/submatches.
UPDATE
To match any 1 to 5 characters other than a hyphen before the first hyphen (if present in the string), you can use
([^-]{1,5})(?:(?:-[^-]*)*)?$
See demo. We rely on a lookahead here, that checks if there are -+non-hyphen sequences are after the expected substring.
An faster alternative:
^[^-]*?([^-]{1,5})(?:-|$)
This regex will search for any characters other than - up to 1 to 5 such characters.
Note that here, the value we need is in Group 1.
How about:
(.{5})(?:-[^-]+)?$
The result is in group 1
Try this regex:
(.{1,5})(?:-.*|$)
Group 1 has the result you need
demo

How to exclude a certain word in regex?

I'm using this expression and it's perfect for what I need:
.*(cq|conquest).*
It returns any word/phrase/sentence/etc. with the letters 'cq' or the word 'conquest' in it. However, from those matches I want to exclude all that contain the term 'conquest power'.
Examples:
some conquest here (should match)
another cq with some conquest here (should match)
too much cq or conquest power is bad (should not match)
How can I do that to the regex above? It has to be only one regex otherwise the program that I'm using (Advanced Combat Tracker) will create two different tabs.
If you want to match any string which contains "conquest" or "cq", but not if the string contains "conquest power", then the regex is
^(?!.*conquest power).*?(?:cq|conquest).*
The above will attempt to match from the start of the string to the end of the line, if you want to match from the start of each line, switch on multiline mode if available - adding (?m) to the start of the regex may do that.
If you want to match across newlines change . to [\s\S], or switch on singleline mode if available.
You have confused people by stating "I want to match 'cq' or 'conquest'" but also "I want the regex to extract that line".
I assume you don't really want to match just "cq" or "conquest", you want to match strings/lines (?) containing "cq" or "conquest".
From your original question I got that you want to match all strings which contain "cq" or "conquest" but do not contain "power". For this case the following regexp works:
^([^p]|p(?!ower))*(cq|conquest)([^p]|p(?!ower))*$
(regexpal)