Non-greedy regex acts greedily - regex

Here's a simple example:
Text: <input name="zzz" value="18754" type="hidden"><input name="zzz" value="18311" type="hidden"><input name="zzz" value="17138" type="hidden">
Regex: /<input.*?value="(18754|17138)".*?>/
When matches are replaced by an empty string, the result is an empty string. I expected the middle <input> to remain since I am using non-greedy matching (.*?). Anyone could explain why it is removed?

There are two matches:
<input name="zzz" value="18754" type="hidden">
<input name="zzz" value="18311" type="hidden"><input name="zzz" value="17138" type="hidden">
In the second case, the first .*? matches name="zzz" value="18311" type="hidden"><input name="zzz". It's a match and it's non-greedy.

aix already explained, why it does match the middle part.
To avoid this behaviour, get rid of the .*?, instead try this:
/<input[^>]*value="(18754|17138)"[^>]*>/
See it here on Regexr
Instead of matching any character, match any, but ">"

aiz's answer is correct -- the second match includes the 2nd and 3rd input tags.
One possible fix for your regex would be to change . to [^>], like this:
/<input[^>]*?value="(18754|17138)"[^>]*?>/
That will cause it to match any character except >. But that has the obvious problem of breaking whenever > shows up inside a quoted literal. As everyone always says: Regexes aren't designed to work on HTML. Don't use them unless you have no other choice.

Related

regular expression exclude match that contains a string pattern

I'm trying to narrow down my RegEx to ignore form elements with type="submit". I only want to select the portion of elements up to the part class="*" but still ignore if type="submit" comes before or after the class.
My regular expression thus far:
(<(?:input|select|textarea){1}.*[^type="submit"]class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
Test case:
Line one should match up to the end of class, and line 2 ignored.
<input type="text" name="name" id="test" class="example-class" max-length="7" required="required">
<input type="submit" class="btn-primary" value="send">
Is this acheivable?
Thanks for your comments. The answer was a negative look ahead.
Adding (?!.*type="submit.*) to the start of the regex appears to have given me my desired result.
Working Regex:
(?!.*type="submit.*)(<(?:input|select|textarea).*class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
(<(?:input|select|textarea)\s((?!type="submit")[\w\-]+\b="[^"]*"\s?)*>)
This expression is bound to the single tag.
It is better to avoid expressions like .* since it can go further and match a string which would begin inside one tag and end-up inside another.

Regex not working in HTML5 pattern

So I have this regex intended to let pass all text but those that contain as initial chars the "34" sequence:
^(?!34)(?=([\w]+))
The regex is working fine for me in https://regex101.com/r/iN1yN3/2 , check the tests to see the intended behavior.
Any Idea why it isn't working in my form?
<form>
<input pattern="^(?!34)(?=([\w]+))" type="text">
<button type="submit">Submit!</button>
</form>
The pattern attribute has to match the entire string. Assertions check for a match, but do not count towards the total match length. Changing the second assertion to \w+ will make the pattern match the entire string.
You can also skip the implied ^, leaving you with just:
<input pattern="(?!34)\w+" type="text">

Regular expression replace start and end, ignore middle

In an Ant build file, is there a way to use a replaceregexp to find and replace two tags, and retain what's in between them? For example, to find all of these:
</a>1234abcdefg</P>
</a>123456789. </p>
</a> yop </p>
</a></p>
and replace
</a> and </p>
with
<#> and <##>
so that I have, respectively:
<#>1234abcdefg##
<#>123456789. <##>
<#> yop <##>
<#><##>
I can't replace the tags individually since they occur in other places, I just want the instances in which </a> is followed by </p>, in the same line, with either nothing or something in between them, and I want to keep what's in between them.
Try this:
<replaceregexp file="notTested.xml" match="(<)\/a(>.*?<)\/p(>)" replace="\1#\2##\3" byline="true" flags="g" />
as for, but it replaces what's between the tags with .* , i haven't seen .* in a replacement/substitution expression. probably it takes it as literals . and *.
as for </a>.*</p>, the > .* < will not work when you have multiple declerations of </a> and </p> on the same line... such as:
</a>1234abcdefg</P>abcde</a>123456789. </p> would be replaced as
<#>1234abcdefg</P>abcde</a>123456789. <##>
you need to use non greedy quantifier ?. See WiKi for the use of .*? vs .*.
Solution 1: You can try this
You store the match with parenthesis, and then replace it.
exp = new Regex(#"YourtagStartRegex(bodyRegex)YourtagClosingRegex");
str = exp.Replace(str, "$1");
Reference:Replace the start and end of a string ignoring the middle with regex, how?
Or
Solution 2:
Regex ignore middle part of capture

Why does Regex Replace delete a quote?

I'm trying to sanitize HTML tags, e.g. turn
<input type="image" name="name" src="image.png">
into the correct empty-element form
<input type="image" name="name" src="image.png" />
with a slash at the end.
I'm using Eclipse's Find/Replace with regular expressions like this:
Find: <(input .*)[^/]>
Replace with: <\1 />
But I end up with
<input type="image" name="name" src="image.png />
I.e. the last quote is missing.
Is that an error in my regex, or a bug in Eclipse?
The term [^/] is consuming the quote. Move it inside the captured group:
Find: <(input .*[^/])>
Replace: <\1 />
The error is in your regex. The [^/] at the end captures the last non-> character. \1 represents the first capturing group, which would be (input.*). In short, you are getting everything inside the tag except the last character. If you put the [^\] inside your group, your replace should work.
Also, you may run into issues if you have a / inside of one of your attribute values. For performance reasons, I would recommend using the following regex:
<(input [^/]*(/[^/]*)*)>
In this case, it does not have to backtrack if you have a / inside of one of your attributes. Your regex should capture everything you need though.

Why does this RegEx work the way I want it to?

I have a RegEx that is working for me but I don't know WHY it is working for me. I'll explain.
RegEx: \s*<in.*="(<?.*?>)"\s*/>\s*
Text it finds (it finds the white-space before and after the input tag):
<td class="style9">
<input name="guarantor4" id="guarantor4" size="50" type="text" tabindex="10" value="<?php echo $data[guarantor4]; ?>" /> </td>
</tr>
The part I don't understand:
<in.*=" <--- As I understand it, this should only find up to the first =" as in it should only find <input name="
It actually finds: <input name="guarantor4" id="guarantor4" size="50" type="text" tabindex="10" value=" which happened to be what I was trying to do.
What am I not understanding about this RegEx?
You appear to be using 'greedy' matching.
Greedy matching says "eat as much as possible to make this work"
try with
<in[^=]*=
for starters, that will stop it matching the "=" as part of ".*"
but in future, you might want to read up on the
.*?
and
.+?
notation, which stops at the first possible condtion that matches instead of the last.
The use of 'non-greedy' syntax would be better if you were trying to only stop when you saw TWO characters,
ie:
<in.*?=id
which would stop on the first '=id' regardless of whether or not there are '=' in between.
.* is greedy. You want .*? to find up to only the first =.
.* is greedy, so it'll find up to the last =. If you want it non-greedy, add a question mark, like so: .*?
As I understand it, this should only
find up to the first =" as in it
should only find <input name="
You don't say what language you're writing in, but almost all regular expression systems are "greedy matchers" - that is, they match the longest possible substring of the input. In your case, that means everything everying from the start of the input tag to the last equal-quote sequence.
Most regex systems have a way to specify that the patter only match the shortest possible substring, not the longest - "non-greedy matching".
As an aside, don't assume the first parameter will be name= unless you have full control over the construction of the input. Both HTML and XML allow attributes to be specified in any order.
Your greedy approach is causing confusion. You want .*?
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.