Why does Regex Replace delete a quote? - regex

I'm trying to sanitize HTML tags, e.g. turn
<input type="image" name="name" src="image.png">
into the correct empty-element form
<input type="image" name="name" src="image.png" />
with a slash at the end.
I'm using Eclipse's Find/Replace with regular expressions like this:
Find: <(input .*)[^/]>
Replace with: <\1 />
But I end up with
<input type="image" name="name" src="image.png />
I.e. the last quote is missing.
Is that an error in my regex, or a bug in Eclipse?

The term [^/] is consuming the quote. Move it inside the captured group:
Find: <(input .*[^/])>
Replace: <\1 />

The error is in your regex. The [^/] at the end captures the last non-> character. \1 represents the first capturing group, which would be (input.*). In short, you are getting everything inside the tag except the last character. If you put the [^\] inside your group, your replace should work.
Also, you may run into issues if you have a / inside of one of your attribute values. For performance reasons, I would recommend using the following regex:
<(input [^/]*(/[^/]*)*)>
In this case, it does not have to backtrack if you have a / inside of one of your attributes. Your regex should capture everything you need though.

Related

regular expression exclude match that contains a string pattern

I'm trying to narrow down my RegEx to ignore form elements with type="submit". I only want to select the portion of elements up to the part class="*" but still ignore if type="submit" comes before or after the class.
My regular expression thus far:
(<(?:input|select|textarea){1}.*[^type="submit"]class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
Test case:
Line one should match up to the end of class, and line 2 ignored.
<input type="text" name="name" id="test" class="example-class" max-length="7" required="required">
<input type="submit" class="btn-primary" value="send">
Is this acheivable?
Thanks for your comments. The answer was a negative look ahead.
Adding (?!.*type="submit.*) to the start of the regex appears to have given me my desired result.
Working Regex:
(?!.*type="submit.*)(<(?:input|select|textarea).*class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
(<(?:input|select|textarea)\s((?!type="submit")[\w\-]+\b="[^"]*"\s?)*>)
This expression is bound to the single tag.
It is better to avoid expressions like .* since it can go further and match a string which would begin inside one tag and end-up inside another.

Regular expression replace start and end, ignore middle

In an Ant build file, is there a way to use a replaceregexp to find and replace two tags, and retain what's in between them? For example, to find all of these:
</a>1234abcdefg</P>
</a>123456789. </p>
</a> yop </p>
</a></p>
and replace
</a> and </p>
with
<#> and <##>
so that I have, respectively:
<#>1234abcdefg##
<#>123456789. <##>
<#> yop <##>
<#><##>
I can't replace the tags individually since they occur in other places, I just want the instances in which </a> is followed by </p>, in the same line, with either nothing or something in between them, and I want to keep what's in between them.
Try this:
<replaceregexp file="notTested.xml" match="(<)\/a(>.*?<)\/p(>)" replace="\1#\2##\3" byline="true" flags="g" />
as for, but it replaces what's between the tags with .* , i haven't seen .* in a replacement/substitution expression. probably it takes it as literals . and *.
as for </a>.*</p>, the > .* < will not work when you have multiple declerations of </a> and </p> on the same line... such as:
</a>1234abcdefg</P>abcde</a>123456789. </p> would be replaced as
<#>1234abcdefg</P>abcde</a>123456789. <##>
you need to use non greedy quantifier ?. See WiKi for the use of .*? vs .*.
Solution 1: You can try this
You store the match with parenthesis, and then replace it.
exp = new Regex(#"YourtagStartRegex(bodyRegex)YourtagClosingRegex");
str = exp.Replace(str, "$1");
Reference:Replace the start and end of a string ignoring the middle with regex, how?
Or
Solution 2:
Regex ignore middle part of capture

Keeping Regex search to one line

I used Wget to scrape a site for migrating to new platform. I am trying to clean up the pages and remove all the viewstate code in them. I am using the following regex expression to do this:
<input type="hidden" name="__VIEWSTATE" value=.*/>
This works in programs like dreamweaver. I like to use another application called Wild Edit which is extremely fast for search and replace for large number of files. When I use that same expression it will match to the last /> on the page remove alot of good code. I have also tried <input type="hidden" name="__VIEWSTATE" value=.*/>$ with same results.
How would I constrain this to keep it to the first match of />
Try
<input type="hidden" name="__VIEWSTATE" value=.*?/>
The ?, if it's supported makes the search ungreedy so it will only match until the first /> rather than the last.
If that doesn't work, your best bet may be:
<input type="hidden" name="__VIEWSTATE" value=[^/]+/>
The regex is being too greedy. Try this:
<input type="hidden" name="__VIEWSTATE" value=.*?/>
By default, the regex engine tries to make as large of a match as possible. For example, the regular expression a.*z will match az (some other middle stuff) az as one big match, since, well, it does start with a and end with z.
The ? modifier tells the regular expression engine to, rather than be greedy, be lazy: instead of grabbing the largest possible match, grab the smallest. In the previous example, the regex a.*?z will just match the 2 az substrings, because it's being lazy: once it sees the z, it stops.

Non-greedy regex acts greedily

Here's a simple example:
Text: <input name="zzz" value="18754" type="hidden"><input name="zzz" value="18311" type="hidden"><input name="zzz" value="17138" type="hidden">
Regex: /<input.*?value="(18754|17138)".*?>/
When matches are replaced by an empty string, the result is an empty string. I expected the middle <input> to remain since I am using non-greedy matching (.*?). Anyone could explain why it is removed?
There are two matches:
<input name="zzz" value="18754" type="hidden">
<input name="zzz" value="18311" type="hidden"><input name="zzz" value="17138" type="hidden">
In the second case, the first .*? matches name="zzz" value="18311" type="hidden"><input name="zzz". It's a match and it's non-greedy.
aix already explained, why it does match the middle part.
To avoid this behaviour, get rid of the .*?, instead try this:
/<input[^>]*value="(18754|17138)"[^>]*>/
See it here on Regexr
Instead of matching any character, match any, but ">"
aiz's answer is correct -- the second match includes the 2nd and 3rd input tags.
One possible fix for your regex would be to change . to [^>], like this:
/<input[^>]*?value="(18754|17138)"[^>]*?>/
That will cause it to match any character except >. But that has the obvious problem of breaking whenever > shows up inside a quoted literal. As everyone always says: Regexes aren't designed to work on HTML. Don't use them unless you have no other choice.

Escaping apostrophes in regex?

I'm trying to validate a form using a regular expression found here http://regexlib.com/. What I am trying to do is filter out all characters except a-z, commas and apostrophes. If I use this code:
<cfinput name="FirstName" type="text" class="fieldwidth" maxlength="90" required="yes" validateat="onsubmit,onserver" message="Please ensure you give your First Name and it does not contain any special characters except hyphens or apostrophes." validate="regular_expression" pattern="^([a-zA-Z'-]+)$" />
I get the following error: Unmatched [] in expression. I figured out this relates to the apostrophe because it works if I use this code(but does not allow apostrophes):
<cfinput name="FirstName" type="text" class="fieldwidth" maxlength="90" required="yes" validateat="onsubmit,onserver" message="Please ensure you give your First Name and it does not contain any special characters except hyphens or apostrophes." validate="regular_expression" pattern="^([a-zA-Z-]+)$" />
So I'm wondering is there some special way to escape apostrophes when using regular expressions?
EDIT
I think I've found where the problem is being caused (thanks to xanatos), not sure how to fix it. Basically CF is generating a hidden field to validate the field as follows:
<input type='hidden' name='FirstName_CFFORMREGEX' value='^([a-zA-Z'-]+)$'>
Because it is using single apostrophes rather than speech marks round the value, it is interpreting the apostrophe as the end of the value.
I think there is a bug in the cfinput implementation. It probably uses the string you pass in pattern in a Javascript Regex but it uses the ' to quote it. So it converts it in:
new Regex('^([a-zA-Z'-]+)$')
Try replacing the quote with \x27 (it's the code for the single quote)
The unmatched ] is because the hyphen is treated to mean a range between the two characters around it. Put the hyphen at the beginning as a best practice.
^([-a-zA-Z']+)$