Why does this RegEx work the way I want it to? - regex

I have a RegEx that is working for me but I don't know WHY it is working for me. I'll explain.
RegEx: \s*<in.*="(<?.*?>)"\s*/>\s*
Text it finds (it finds the white-space before and after the input tag):
<td class="style9">
<input name="guarantor4" id="guarantor4" size="50" type="text" tabindex="10" value="<?php echo $data[guarantor4]; ?>" /> </td>
</tr>
The part I don't understand:
<in.*=" <--- As I understand it, this should only find up to the first =" as in it should only find <input name="
It actually finds: <input name="guarantor4" id="guarantor4" size="50" type="text" tabindex="10" value=" which happened to be what I was trying to do.
What am I not understanding about this RegEx?

You appear to be using 'greedy' matching.
Greedy matching says "eat as much as possible to make this work"
try with
<in[^=]*=
for starters, that will stop it matching the "=" as part of ".*"
but in future, you might want to read up on the
.*?
and
.+?
notation, which stops at the first possible condtion that matches instead of the last.
The use of 'non-greedy' syntax would be better if you were trying to only stop when you saw TWO characters,
ie:
<in.*?=id
which would stop on the first '=id' regardless of whether or not there are '=' in between.

.* is greedy. You want .*? to find up to only the first =.

.* is greedy, so it'll find up to the last =. If you want it non-greedy, add a question mark, like so: .*?

As I understand it, this should only
find up to the first =" as in it
should only find <input name="
You don't say what language you're writing in, but almost all regular expression systems are "greedy matchers" - that is, they match the longest possible substring of the input. In your case, that means everything everying from the start of the input tag to the last equal-quote sequence.
Most regex systems have a way to specify that the patter only match the shortest possible substring, not the longest - "non-greedy matching".
As an aside, don't assume the first parameter will be name= unless you have full control over the construction of the input. Both HTML and XML allow attributes to be specified in any order.

Your greedy approach is causing confusion. You want .*?
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.

Related

Why does a regular expression find a match outside of it's bounds?

I have the following regular expression, which I'm using to find <icon use="some-id" class="some-class" />:
(?:<icon )(?=(?:.*?(?:use=(?:"|')(.*?)(?:"|')))?)(?=(?:.*?(?:class=(?:"|')(.*?)(?:"|')))?)(?:.*?)(?: \/)?[^?](?:>)
This mostly works, except that if I don't specify a class, but do specify one on another element on the same line, it'll match that other elements class, even though the full match is reported as just being the icon element.
For example:
<icon use="search" /> <div class="test"></div>
$1 for that is search, and $2 is test, even though they're not part of the same element. $& is reporting <icon use="search" />.
I'm sure I'm missing something obvious about the way regular expressions work.
The .*? just before the match of class= will match ANYTHING it has to in order to make the rest of the regex match - including the end of the first tag and the start of the second one, and everything that might lie in between. The only restriction you've placed on it is that it can't cross a line boundary, as newlines are not matched by . by default. To make this work somewhat more reliably, you'd need to restrict that part of the regex so that it cannot cross a tag boundary: [^<]+? (one or more characters that aren't a left angle bracket, matching as few as possible) should do the job.

Regular expression replace start and end, ignore middle

In an Ant build file, is there a way to use a replaceregexp to find and replace two tags, and retain what's in between them? For example, to find all of these:
</a>1234abcdefg</P>
</a>123456789. </p>
</a> yop </p>
</a></p>
and replace
</a> and </p>
with
<#> and <##>
so that I have, respectively:
<#>1234abcdefg##
<#>123456789. <##>
<#> yop <##>
<#><##>
I can't replace the tags individually since they occur in other places, I just want the instances in which </a> is followed by </p>, in the same line, with either nothing or something in between them, and I want to keep what's in between them.
Try this:
<replaceregexp file="notTested.xml" match="(<)\/a(>.*?<)\/p(>)" replace="\1#\2##\3" byline="true" flags="g" />
as for, but it replaces what's between the tags with .* , i haven't seen .* in a replacement/substitution expression. probably it takes it as literals . and *.
as for </a>.*</p>, the > .* < will not work when you have multiple declerations of </a> and </p> on the same line... such as:
</a>1234abcdefg</P>abcde</a>123456789. </p> would be replaced as
<#>1234abcdefg</P>abcde</a>123456789. <##>
you need to use non greedy quantifier ?. See WiKi for the use of .*? vs .*.
Solution 1: You can try this
You store the match with parenthesis, and then replace it.
exp = new Regex(#"YourtagStartRegex(bodyRegex)YourtagClosingRegex");
str = exp.Replace(str, "$1");
Reference:Replace the start and end of a string ignoring the middle with regex, how?
Or
Solution 2:
Regex ignore middle part of capture

How to change all title attribute's value in Title Case in sublime text

I have 500 HTML files in my project where casing and quotes (" or ') in <title> attribute vary over all pages, see few examples below
<button title="Next" id="next"> Next</button>
<button title="next"> Next </buton>
<button title=""please go back">Check</button>
I want to change all title attributes in Title Case
<button title="Next" id="next"> Next</button>
<button title="Next"> Next </buton>
<button title="Please Go Back">Check</button>#
I have tried to find and replace - Regular Expression and Case sensitive button enabled
Find What: title=(".*")\s
Replace With: title="\u$"
But didn't get success.Please tell me what I am doing wrong?
UPDATED : also want to remove extra ' " see #
To further my comment, first it's the issue of .* being 'greedy' instead of 'lazy', meaning it is matching as much as possible (i.e. Next"> Next</button><button title="Next in your example).
The quick fix is using a 'lazy' .* instead, aka .*? (I added a ? to indicate possible presence of space because there's none in your examples):
title=(".*?")\s?
To improve performance, you would use a negated class:
title=("[^"]+")\s?
Where [^"]+ matches any character except ".
And to cope with the different quotes, you can use:
title=("[^"]+"|'[^']+')\s?
Which basically means either "[^"]+" or '[^']+' for the part within the parentheses.
For the replace and consecutive quotes issue:
title=(?:"+([^"]+)"+|'+([^']+)'+)\s?
Replace with:
title="\u$1$2"
The only thing is that the last line will be <button title="Please go back">Check</button>, if that's not an issue...
EDIT: \G actually works. Use a second replace:
(?:(?<=title=")|(?<!^)\G)[^\s"]+\s?
Replace with:
\u$0
(?<=title=('|")).+?(?=('|"))
this should give you matches Next next please go back that you can use.
you can use the index of the match to find your match in the Original string if you want to upper your lowers..
or use title=('|").+?(\1) to find any title attributes in your tekst including the quotation marks

Non-greedy regex acts greedily

Here's a simple example:
Text: <input name="zzz" value="18754" type="hidden"><input name="zzz" value="18311" type="hidden"><input name="zzz" value="17138" type="hidden">
Regex: /<input.*?value="(18754|17138)".*?>/
When matches are replaced by an empty string, the result is an empty string. I expected the middle <input> to remain since I am using non-greedy matching (.*?). Anyone could explain why it is removed?
There are two matches:
<input name="zzz" value="18754" type="hidden">
<input name="zzz" value="18311" type="hidden"><input name="zzz" value="17138" type="hidden">
In the second case, the first .*? matches name="zzz" value="18311" type="hidden"><input name="zzz". It's a match and it's non-greedy.
aix already explained, why it does match the middle part.
To avoid this behaviour, get rid of the .*?, instead try this:
/<input[^>]*value="(18754|17138)"[^>]*>/
See it here on Regexr
Instead of matching any character, match any, but ">"
aiz's answer is correct -- the second match includes the 2nd and 3rd input tags.
One possible fix for your regex would be to change . to [^>], like this:
/<input[^>]*?value="(18754|17138)"[^>]*?>/
That will cause it to match any character except >. But that has the obvious problem of breaking whenever > shows up inside a quoted literal. As everyone always says: Regexes aren't designed to work on HTML. Don't use them unless you have no other choice.

What Yahoo Pipes regex use in this case?

have you any ideas how to change in item. description in Yahoo.pipes this link
<img src="http://mysite.com/img/pc/image.gif" class="big" style="background-image:url(http://mysite.com/pre_big_crop/pic/pc/gallery/dd/c1/example.jpeg);" alt="" title="">
to this
<img src="http://mysite.com/pre_big_crop/pic/pc/gallery/dd/c1/example.jpeg"/>
using regex.
I don't know what variant of RegEx Pipes uses, so I'll go with the .NET variant and you can adjust for whatever syntax is needed. It should be pretty close.
Search for:
<img[^>]+url\(
([^\)]+)
\)[^>]+>
Replace with:
<img src="$1" />
Join the lines. Line 1 finds an image tag up to the url argument in the CSS style attribute. Line 2 matches the background image URL and captures it. Line 3 matches the rest of the image tag.
Here is an extremely simple regex to accomplish what you're looking for using PERL style Regexs:
<img.*background-image:url\((.*)\);.*>
Basically, here is the breakdown on how it matches:
It will start by matching the characters "
It then matches any characters, between 0 and unlimited times.
Then it matches the string "background-image:url(
Then it matches any characters, between 0 and unlimited times, which is captured into backreference #1
Then it matches the characters ");"
Then it matches any characters, between 0 and unlimited times.
Then it matches the ">" character.
Note: You should replace the items that match any characters to something more specific, depending on the application that you're using the regex. This is why I've referred to this as "extremely simple".
Then, that gets replaced with:
<img src="$1">
Edit: Didn't see richardtallent's answer, pretty similar application just a different implementation.