I used Wget to scrape a site for migrating to new platform. I am trying to clean up the pages and remove all the viewstate code in them. I am using the following regex expression to do this:
<input type="hidden" name="__VIEWSTATE" value=.*/>
This works in programs like dreamweaver. I like to use another application called Wild Edit which is extremely fast for search and replace for large number of files. When I use that same expression it will match to the last /> on the page remove alot of good code. I have also tried <input type="hidden" name="__VIEWSTATE" value=.*/>$ with same results.
How would I constrain this to keep it to the first match of />
Try
<input type="hidden" name="__VIEWSTATE" value=.*?/>
The ?, if it's supported makes the search ungreedy so it will only match until the first /> rather than the last.
If that doesn't work, your best bet may be:
<input type="hidden" name="__VIEWSTATE" value=[^/]+/>
The regex is being too greedy. Try this:
<input type="hidden" name="__VIEWSTATE" value=.*?/>
By default, the regex engine tries to make as large of a match as possible. For example, the regular expression a.*z will match az (some other middle stuff) az as one big match, since, well, it does start with a and end with z.
The ? modifier tells the regular expression engine to, rather than be greedy, be lazy: instead of grabbing the largest possible match, grab the smallest. In the previous example, the regex a.*?z will just match the 2 az substrings, because it's being lazy: once it sees the z, it stops.
Related
My HTML has the following input element (it is intended to accept email addresses that end in ".com"):
<input type="email" name="p_email_ad" id="p_email_ad" value="" required="required" pattern="[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+(\.[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+)*#([a-zA-Z0-9_][\-a-zA-Z0-9_]*(\.[\-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?$" maxlength="64">
At some point in the past 2 months, Chrome has started returning the following JavaScript error (and preventing submission of the parent form) when validating that input:
Pattern attribute value
[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+(\.[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+)*#([a-zA-Z0-9_][\-a-zA-Z0-9_]*(\.[\-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?$
is not a valid regular expression: Uncaught SyntaxError: Invalid
regular expression:
/[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+(\.[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+)*#([a-zA-Z0-9_][\-a-zA-Z0-9_]*(\.[\-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?$/: Invalid escape
Regex101.com likes the regex pattern, but Chrome doesn't. What syntax do I have wrong?
Use
pattern="[-a-zA-Z0-9~!$%^&*_=+}{'?]+(\.[-a-zA-Z0-9~!$%^&*_=+}{'?]+)*#([a-zA-Z0-9_][-a-zA-Z0-9_]*(\.[-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?"
The problem is that some chars that should not be escaped were escaped, like ' and ^ inside the character classes. Note that - inside a character class may be escaped, but does not have to when it is at its start.
Note also that HTML5 engines wraps the whole pattern inside ^(?: and )$ constructs, so there is no need using $ end of string anchor at the end of the pattern.
Test:
<form>
<input type="email" name="p_email_ad" id="p_email_ad" value="" required="required" pattern="[-a-zA-Z0-9~!$%^&*_=+}{'?]+(\.[-a-zA-Z0-9~!$%^&*_=+}{'?]+)*#([a-zA-Z0-9_][-a-zA-Z0-9_]*(\.[-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?" maxlength="64">
<input type="Submit">
</form>
I was experiencing the same issue with my application but had a slightly different approach to a solution. My regex has the same issue that the accepted answer describes (special characters being escaped in character classes when they didn't need to be), however the regex I'm dealing with is coming from an external source so I could not modify it. This kind of regex is usually fine for most languages (passes validation in PHP) but as we have found out it breaks with HTML5.
My simple solution, url encode the regex before applying it to the input's pattern attribute. That seems to satisfy the HTML5 engine and it works as expected. JavaScript's encodeURIComponent is a good fit.
I'm sure I miss something, but can't find the reason why this pattern doesn't work... The validator doesn't accept the format of the string I typed (i.e. 06201234567).
<input type="tel" pattern="06\d{7,9}" placeholder="06201234567">
I tried exactly the same code at w3schools' tryit editor, and there were no problem...
In HTML5 you can use <input type='tel'> and <input type='email'>
You can also specify a specific pattern like <input type='tel' pattern='[\+]\d{2}[\(]\d{2}[\)]\d{4}[\-]\d{4}' title='Phone Number (Format: +99(99)9999-9999)'>
Something like pattern='^\+?\d{0,13}' Would give you an optional + and up to 13 digits
The form is in a template where I replaced some texts like {{example}} with other texts and php's preg_replace() search expression {{.*?}} matched {7,9} in the pattern and replaced it.
With the use of ({{.*?}}) everything's OK.
I'm trying to narrow down my RegEx to ignore form elements with type="submit". I only want to select the portion of elements up to the part class="*" but still ignore if type="submit" comes before or after the class.
My regular expression thus far:
(<(?:input|select|textarea){1}.*[^type="submit"]class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
Test case:
Line one should match up to the end of class, and line 2 ignored.
<input type="text" name="name" id="test" class="example-class" max-length="7" required="required">
<input type="submit" class="btn-primary" value="send">
Is this acheivable?
Thanks for your comments. The answer was a negative look ahead.
Adding (?!.*type="submit.*) to the start of the regex appears to have given me my desired result.
Working Regex:
(?!.*type="submit.*)(<(?:input|select|textarea).*class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
(<(?:input|select|textarea)\s((?!type="submit")[\w\-]+\b="[^"]*"\s?)*>)
This expression is bound to the single tag.
It is better to avoid expressions like .* since it can go further and match a string which would begin inside one tag and end-up inside another.
Yes I know, don't parse html with regex. That said:
I am trying to capture content between any tag with the word "Title" in the first tag.
I started with:
(?P<QUALIFY_TITLE><(.*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
Where the Named Group Capture is a known word/string I am looking for. I also capture for research sake the QUALIFY_TITLE Name group. I do this because I don't want the string/term unless I 'qualify' it in this way.
However, if I have part of an html that looks like this:
<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title">KnownTermIWant</title>
Although I get the CAPTURE String I want (KnownTermIWant), the Qualify string starts from the very first "
I am trying to have the QUALIFY_TITLE start/capture from the last "<" before the title not the first in other words QUALIFY TITLE should be:
<div id="divTitle
or even
<div id="divTitle" class="title">
but I am currently getting
<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title"
The problem is that a regex-search will try to match at the first possible opportunity, and non-greedy quantifiers (*? instead of *) do not affect whether something is a match. For example, given the string abcd, the regex .*?d will match the whole thing, because .*? will still match as much as it needs to in order to ensure that the regex matches.
Do you see what I mean?
So you need to make your subexpressions more precise; for example, instead of <(.*?)(title)(.*?)>, you should write <([^>]*)(title)([^>]*)>.
The problem
There's only one problem here, you are matching exactly what you've asked for :)
The process
If you want to match only the last tag, ask yourself this question:
"What is inside every preceding tag, but not inside the one I want?"
The conclusion
The answer is the open/close tags themselves:
(?P<QUALIFY_TITLE><([^<>]*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
^^^^^
Your code was quite a big mess, but I'm going to answer the question in the title, in a much more simplified way:
In this sample code:
<div>Example text<div>Foo bar</div> Hello world <div>Lorem ipsum</div></div> hi
if you want to match from the first <div> to the last </div>, you could just use a greedy quantifier, such as + or *:
/<div>(.*)<\/div>/
That will match the whole string, until the very last </div>.
Demo
If this doesn't answer your question, the complexity of the regular expression would quickly get higher very fast (it's bascially exponentially more complex for extra requirements), so like you said in your very first line, just use a parser.
I'm trying to sanitize HTML tags, e.g. turn
<input type="image" name="name" src="image.png">
into the correct empty-element form
<input type="image" name="name" src="image.png" />
with a slash at the end.
I'm using Eclipse's Find/Replace with regular expressions like this:
Find: <(input .*)[^/]>
Replace with: <\1 />
But I end up with
<input type="image" name="name" src="image.png />
I.e. the last quote is missing.
Is that an error in my regex, or a bug in Eclipse?
The term [^/] is consuming the quote. Move it inside the captured group:
Find: <(input .*[^/])>
Replace: <\1 />
The error is in your regex. The [^/] at the end captures the last non-> character. \1 represents the first capturing group, which would be (input.*). In short, you are getting everything inside the tag except the last character. If you put the [^\] inside your group, your replace should work.
Also, you may run into issues if you have a / inside of one of your attribute values. For performance reasons, I would recommend using the following regex:
<(input [^/]*(/[^/]*)*)>
In this case, it does not have to backtrack if you have a / inside of one of your attributes. Your regex should capture everything you need though.