Match then exclude without lookbehinds - regex

In Rust with the Regex crate, I've been trying to wrap my head around a regex expression to capture and extract things between square brackets [] yet exclude the brackets from the capture. Given:
// template[tags(foo,bar,baz)]
# template[replace_all(foo:bar)]
I'd like:
tags(foo,bar,baz)
replace_all(foo:bar)
I can easily get the [] capture group but i'm not understanding how to capture with an exclusion of characters after the match. I've been manually replacing these but it seems gross to me. I would love to be able to do it all in one expression.
Update: I am aware that I can get these in multiple capture groups but i'm really curious if there's a way to only capture the single one - hence exclude.
Looking over the docs i'm just not pickin up a way this can be done. There's a lot of great examples using look aheads and behinds but that doesn't appear to be apart of the rust regex crate. Am i missing something obvious here? Thanks for the help.

Related

Replace a Tag Name while keeping the rest as it is

I want to preface by saying I am a novice at regex, and I've spent a considerable amount of time trying to solve this myself using tutorials, online docs, etc. I have also gone through the suggested answers here.
Now here is my problem: I have 267 lines like this, and each county is different.
<SimpleData name="NAME">Angelina</SimpleData>
What I need to do is to replace NAME with COUNTY and keep the rest the same including the proper county name:
<SimpleData name="COUNTY">Angelina</SimpleData>
I used the following Find to find all the lines that I wanted to change, and was successful.
<SimpleData name="NAME">[\S\s\n]*?</SimpleData>
It's probably not the best way to do this, but it worked.
I hope I've explained this so it can be understood. Thanks, Paul
You need to use capturing groups with backreferences in the replacement field:
Find What: (<SimpleData name=")NAME(">[\S\s\n]*?</SimpleData>)
Replace With: $1COUNTY$2
See the regex demo
As per regular-expressions.info:
Besides grouping part of a regular expression together, parentheses also create a numbered capturing group. It stores the part of the string matched by the part of the regular expression inside the parentheses.
If your regular expression has named or numbered capturing groups, then you can reinsert the text matched by any of those capturing groups in the replacement text. Your replacement text can reference as many groups as you like, and can even reference the same group more than once. This makes it possible to rearrange the text matched by a regular expression in many different ways.
Note that, in VSCode, you can't use named groups.
You really don't want to be using regex for this job. Learn XSLT.
Any attempt to do this using regular expressions will either match things it shouldn't, or will fail to match things that it should. That's not because you're lacking regex skills, it's because of the computer science theory: XML's grammar is defined recursively, and regular expressions can't handle recursively-defined grammars.

Regex for inverse group match

I am trying to create a regex which returns the opposite of the matched groups.
Probably an example will explain it better.
My regex is:
/(\{[\w\s-\\\/_=*%\'\"]+\})/gui
The input text is:
{1}2{3}4{5}6{7}
Now it matches like this:
So I end up with {1}, {3}, {5}, {7}, but I need to have 2, 4, 6.
How can I negate it, please? I've tried fiddling around with negative look-aheads but couldn't achieve what I wanted.
Edit: Unfortunately I can't use functions under my current circumstances and I would really like to solve this with a one-step regex, but I'm not sure if it's possible.
I think this should work
/[^{](\d*)[^}]/g
The captured matches are what you're after. See http://refiddle.com/refiddles/56d84a4875622d5b7a3c3400
Update:
/(?!\{)(\d*)(?!\})/g
This won't capture the braces
https://regex101.com/r/oT9bY4/1
Update by OP:
It seems that this question doesn't have a definite one step answer because it is not possible to achieve simply with a regex. For more information see the comments to this answer.
since I cant comment jet Ill do it this way.
u where looking for a { then the group then } and you hope to find } group { so chance the brackets around and make the group inside like this
\}([\w\s-\\\/_=*%\'\"])+\{ see test of this regex here https://regex101.com/r/lQ9hC5/2 .
From your question I can say that you need to capture
Any patterns that leading your pattern.
or
Any patterns that following your pattern.
where your pattern is
(\{[\w\s-\\\/_=*%\'\"]+\})
So I got this regex from the above conditions
\{[\w\s-\\\/_=*%\'\"]+\}(.+?)|(.+?)\{[\w\s-\\\/_=*%\'\"]+\}
Literally, here's DEMO
Note that my regex will capture all text excluding your pattern but you still need to rearrange them according to existing of two capture groups.

Overcomplicating regular expression

I have the following regular expression ^(?:\/foo\/)([A-Za-z0-9-]{0,})|^(?:\/foo) that needs to match /foo,/foo/, /foo/abc-123 but not /foobar. This works, I've tested it but I'm sure there is a simpler way using something like lookbehind or ahead.
How can I simplify it, or do I need to? Maybe it's just me being over paranoid about the ugliness of it. Maybe drop the non capturing groups, to have ^\/foo\/([A-Za-z0-9-]{0,})|^\/foo still doesn't look right
Note the goal is to capture abd-123 if present, but not capture the / or the empty string
You can use this simpler regex for the same purpose:
^\/foo(?:\/([A-Za-z0-9-]*))?$
RegEx Demo

capture with if-then-else in php regex

I'm very lost with a regular expression. It's just black magic to me. Here's what i need:
there is a filename: some_file.jpg
it might be in the following format: some_file_p250.jpg
the regex to match the file in simple format: /^([a-zA-Z_-0-9]+).(jpg|jpeg|png)$/
the regex to match the file in advanced format: /^([a-zA-Z_-0-9]+)(_[a-z]?[0-9]{2,3}).(jpg|jpeg|png)$/
my question is as follows: how do i make the "(_[a-z]?[0-9]{3,4})" part optional? I've tried adding a question mark to the second group like this:
/^([a-zA-Z_\-0-9]+)(_[a-z]?[0-9]{3,4})?\.(jpg|jpeg|png)$/
Even though the pattern works, it always captures the contents of the second group in the first group and leaves the second empty.
How can i make this work to capture the filename, advanced part (_p250) and the extension separately? I'm thinking it has something to do with the greediness of the first group, but i might be completely wrong and even if i'm right, i still don't know how to solve it.
Thanks for your thoughts
Adding a question mark after the first plus will make the first capturing expression non-greedy. This worked for me using your test case:
/^([a-zA-Z_\-0-9]+?)(_[a-z]?[0-9]{3,4})?\.(jpg|jpeg|png)$/
I tested in Javascript, not PHP, but here's my test:
"some_file_p250.jpg".match(/^([a-zA-Z_\-0-9]+?)(_[a-z]?[0-9]{3,4})?\.(jpg|jpeg|png)$/)
and my results:
["some_file_p250.jpg", "some_file", "_p250", "jpg"]
In my experience, making a capturing expression non-greedy makes regular expressions a lot more intuitive and will often make them work the way I expect them to work. In your case, it was doing what you suspected; the first expression was capturing everything and never gave the second expression a chance to capture anything.
I think this is what you want:
/^([a-zA-Z_\-0-9]+)(|_[a-z]?[0-9]{3,4})?\.(jpg|jpeg|png)$/
or
/^([\d\w\-]+)(|_[a-z]?[0-9]{3,4})\.(jpg|jpeg|png)$/

regex to find instance of a word or phrase -- except if that word or phrase is in braces

First, a disclaimer. I know a little about regex's but I'm no expert. They seem to be something that I really need twice a year so they just don't stay "on top" of my brain.
The situation: I'd like to write a regex to match a certain word, let's call it "Ostrich". Easy. Except Ostrich can sometimes appear inside of a curly brace. If it's inside of a curly brace it's not a match. The trick here is that there can be spaces inside the curly braces. Also the text is typically inside of a paragraph.
This should match:
I have an Ostrich.
This should not match:
My Emu went to the {Ostrich Race Name}.
This should be a match:
My Ostrich went to the {Ostrich Race Name}.
This should not be a match:
My Emu went to the {Race Ostrich Place}. My Emu went to the {Race Place Ostrich}.
It seems like this is possible with a regex, but I sure don't see it.
I'll offer an alternative solution to doing this, which is a bit more robust (not using regex assertions).
First, remove all the bracketed items, using a regex like {[^}]+} (use replace to change it to an empty string).
Now you can just search for Ostrich (using regex or simple string matching, depending on your needs).
While regular expressions can certainly be written to do what you ask, they're probably not the best tool for this particular type of thing.
One major problem with regular expressions is that they're very good at pattern matching for things that are there, but not so much when you start adding except into the mix.
Regular expressions are not stateful enough to handle this properly without a lot of work, so I would try to find a different path towards a solution.
A character tokenizer that handles the braces would be easy enough to write.
I believe this will work, using lookahead and lookbehind assertions:
(?<!{[^}]*)Ostrich(?![^{]*})
I also tested the case My {Ostrich} went to the Ostrich Race. (where the second "Ostrich" does match)
Note that the lookahead assertion: (?![^{]*}) is optional.. but without it:
My {Ostrich has a missing bracket won't match
My Ostrich also} has a missing bracket will match
which may or may not be desirable.
This works in the .NET regex engine, however, it is not PCRE-compatible because it uses non-fixed-length assertions which are not supported.
Here's a very large regex that almost works.
It will return each "raw" occurrence of the word in a group.
However, the group for the last one will be empty; I'm not sure why. (Tested with .Net)
Parse without whitespace
^(?:
(?:
[^{]
|
(?:\{.*?\})
)*?
(?:\W(Ostrich)\W)?
)*$
Using a positive lookahead with a negation appears to properly match all the test cases as well as multiple Ostriches:
(?<!{[^}]*)Ostrich(?=[^}]*)