Regex Conditional Matching in One Capture Group - regex

I have a string that may come in the form:
"filename.first_order.png"
"filename.second_order.png"
"filename.png"
"filename.(jpg|tif|etc)"
I need to match the first part of the string containing the name, and the extension - however, if the string is a first/second order type, I need to match "first_order"/"second_order" as the second group, and not "png", and I can't get those two conditions to co-exist in one capture group. Example matches:
imageondisk.first_order.png -> [imageondisk, first_order]
anotherfile.png -> [anotherfile, png]
meetingminutes.jpeg -> [meetingminutes, jpeg]
I feel like I've used all sorts of combinations of lookaheads, lookbehinds, ?s which must look like a desparate uneducated mess, but whatever I do, I can never get a result where they don't conflict when I join them together - which would look something like
(.+)\.(tif|jpg|<png when not preceded by first/second_order>|<first/second_order, ignoring the .png on the end>)
Except I just went down a frustrating rabbit hole of non-capture groups and lookarounds that seemed to end in the same place, and I feel like I knew less regex than before.
Help would hugely appreciated.

You could use this regex, which captures the filename in group 1, any first/second order string in group 2 and the extension in group 3:
^([^.]+)(?:\.(.+))?\.(png|jpg|tif)$
Demo on regex101

Related

RegEx Replace - Remove Non-Matched Values

Firstly, apologies; I'm fairly new to the world of RegEx.
Secondly (more of an FYI), I'm using an application that only has RegEx Replace functionality, therefore I'm potentially going to be limited on what can/can't be achieved.
The Challange
I have a free text field (labelled Description) that primarily contains "useless" text. However, some records will contain either one or multiple IDs that are useful and I would like to extract said IDs.
Every ID will have the same three-letter prefix (APP) followed by a five digit numeric value (e.g. 12911).
For example, I have the following string in my Description Field;
APP00001Was APP00002TEST APP00003Blah blah APP00004 Apple APP11112OrANGE APP
THE JOURNEY
I've managed to very crudely put together an expression that is close to what I need (although, I actually need the reverse);
/!?APP\d{1,5}/g
Result;
THE STRUGGLE
However, on the Replace, I'm only able to retain the non-matched values;
Was TEST Blah blah Apple OrANGE APP
THE ENDGAME
I would like the output to be;
APP00001 APP00002 APP00003 APP00004 APP11112
Apologies once again if this is somewhat of a 'noddy' question; but any help would be much appreciated and all ideas welcome.
Many thanks in advance.
You could use an alternation | to capture either the pattern starting with a word boundary in group 1 or match 1+ word chars followed by optional whitespace chars.
What you capture in group 1 can be used as the replacement. The matches will not be in the replacement.
Using !? matches an optional exclamation mark. You could prepend that to the pattern, but it is not part of the example data.
\b(APP\d{1,5})\w*|\w+\s*
See a regex demo
In the replacement use capture group 1, mostly using $1 or \1

regex to match all subfolders of a URL, except a few special ones

OK, I'm writing a regex that I want to match on a certain url path, and all subfolders underneath it, but with a few excluded. for context, this is for use inside verizon edgecast, which is a CDN caching system. it supports regex, but unfortunately i don't know the 'flavor' of regex it supports and the documentation isn't clear about that either. Seems to support all the core regex features though, and that should be all i need. unfortunately reading the documentation requires an account, but you can get the general idea of edgecast here: https://www.verizondigitalmedia.com/platform/edgecast-cdn/
so, here is some sample data:
help
help/good
help/better
help/great
help/bad
help/bad/worse
and here is the regex I am using right now:
(^help$|help\/[^bad].*)
link: https://regex101.com/r/CBWUDE/1
broken down:
( - start capture group
^ - start of string
help - 1st thing that should match
$ - end of string
| - or
help - another thing that should match
\/ - escaped / so i can match help/
[^bad] - match any single character that isn't b, a, or d
. - any character
* - any number of times
) - end capture group
I would like the first 4 to match, but not the last 2, 'bad' or 'bad/worse' should not be matches, and help/anythingelse should be a match
this regex is working for me, except that help/better is not a match. the reason it's not a match, i'm pretty sure, is because better, contains a character that appears inside 'bad'. if i change 'bettter' to 'getter' then it becomes a match, because it no longer has a b in it.
so what i really want is my 'bad' to only match the whole word bad, and not match any thing with b, a, or d in it. I tried using word boundary to do this, but isn't giving me the results i need, but perhaps i just have the syntax wrong, this is what i tried:
(^help$|help\/[^\bbad\b].*)
but does not seem to work, the 'bad' urls are no longer excluded, and help/better is still not matching with that. I think it's because / is not a word boundary. I'm positive my problem with the original regex is with the part:
[^bad] - match any single character that isn't b, a, or d
my question is, how can i turn [^bad] into something that matches anything that doesn't contain the full string 'bad'?
You're going to want to use negative look ahead (?!bad) instead of negating specific letters [^bad]
I think (^help$|help\/(?!bad).*) is what you're looking for
Edit: if you mean anything with the word bad at all, not just help/bad you can make it (?!.*bad.*) This would prevent you from matching help/matbadtom for example. Full regex: (^help$|help\/(?!.*bad.*).*)

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

Regex: using alternatives

Let's say I would like to get all the 'href' values from HTML.
I could run a regex like this on the content:
a[\s]+href[\s]*=("|')(.)+("|')
which would match
a href="something"
OR
a href = 'something' // quotes, spaces ...
which is OK; but with ("|') I get too many groups captured which is something I do not want.
How does one use alternative in regex without capturing groups as well?
The question could also be stated like: how do I delimit alternatives to match? (start and stop). I used parenthesis since this is all that worked...
(I know that the given regex is not perfect or very good, I'm just trying to figure this alternating with two values thing since it is not perfectly clear to me)
Thanks for any tips
Use non-capture groups, like this: (?:"|'), the key part being the ?:at the beginning. They act as a group but do not result in a separate match.

Regex practicing groups

Hello again Stack Overflow. As i mentioned in my last post i am trying to get better at regular expressions. I am going through my books chapters tonight and decided to see if i could, if even possible create multiple groups. I am fully aware Regex is not the answer to everything this is purely for me to learn. I am Using VB.net
Example input(s):
MyTokenName{%#example1%, %#example2%}
MyTokenName{example1, example2}
Now this is a completed made up by myself output to test against. The consistent factors of this expression are Name{ } There will always be a name consisting of only a-z first. Inside of curly brackets. The MAIN delimiter that separates the Two groups is , Before the groups start there will be an OPTIONAL %# that ends with a OPTIONAL %
So to summarize i only want to match groups defined between the curly brackets of only a-z unlimited times.
MyTokenName{%#example%, %#example%} ----- Would match Two groups example1 and example2
MyTokenName{example, example} --- Would match Two groups example1 and example2
My attempt that's not working.
(?<=[a-zA-Z]+\{[^a-zA-Z#]+?)[a-zA-Z, ]+(?=%?})
Any advice would be amazing. Thanks guys for such a great forum. Please remember i am only trying to practice regex. I can do this with other .Net methods.
An interesting way could maybe this one:
/(?i)(?<=\{|\G|\{%#|\G%#)([a-z0-9]+)(?:%?\s*(?:,\s*|\}))/g
http://regex101.com/r/bU0zY5
Here's also a structural view of it:
Debuggex Demo
with interesting I mean the usage of lookbehind with \G ;) and it should match all your examples
This variable length lookbehind is expensive performance wise and of no real value in this case, when all you want to do is capture what your interrested in.
This might work.
[a-zA-z]+ { \s*(?:%#)? ([a-z]+) %?\s* , \s*(?:%#)? ([a-z]+) %?\s* }
Does the pattern (\w+) serve your purpose here?
It'll match MyTokenName, example1, and example2 in both sample cases.
If you always wanted to ignore MyTokenName you could just refer to any matches other than the first match in the list.
Like:
dim txt = "MyTokenName{%#example1%, %#example2%}"
dim matches = regex.matches(txt,"(\w+)")
for i as integer = 1 to (matches.count - 1)
DoSomethingWith(matches(i).value) 'start at 1 so we skip over MyTokenName
next
Something like that.