regular expression exclude match that contains a string pattern - regex

I'm trying to narrow down my RegEx to ignore form elements with type="submit". I only want to select the portion of elements up to the part class="*" but still ignore if type="submit" comes before or after the class.
My regular expression thus far:
(<(?:input|select|textarea){1}.*[^type="submit"]class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
Test case:
Line one should match up to the end of class, and line 2 ignored.
<input type="text" name="name" id="test" class="example-class" max-length="7" required="required">
<input type="submit" class="btn-primary" value="send">
Is this acheivable?

Thanks for your comments. The answer was a negative look ahead.
Adding (?!.*type="submit.*) to the start of the regex appears to have given me my desired result.
Working Regex:
(?!.*type="submit.*)(<(?:input|select|textarea).*class=")(((?!form\-control)[a-zA-Z0-9_ -])*")

(<(?:input|select|textarea)\s((?!type="submit")[\w\-]+\b="[^"]*"\s?)*>)
This expression is bound to the single tag.
It is better to avoid expressions like .* since it can go further and match a string which would begin inside one tag and end-up inside another.

Related

Pattern attribute value is not a valid regular expression

My HTML has the following input element (it is intended to accept email addresses that end in ".com"):
<input type="email" name="p_email_ad" id="p_email_ad" value="" required="required" pattern="[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+(\.[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+)*#([a-zA-Z0-9_][\-a-zA-Z0-9_]*(\.[\-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?$" maxlength="64">
At some point in the past 2 months, Chrome has started returning the following JavaScript error (and preventing submission of the parent form) when validating that input:
Pattern attribute value
[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+(\.[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+)*#([a-zA-Z0-9_][\-a-zA-Z0-9_]*(\.[\-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?$
is not a valid regular expression: Uncaught SyntaxError: Invalid
regular expression:
/[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+(\.[\-a-zA-Z0-9~!$%\^&*_=+}{\'?]+)*#([a-zA-Z0-9_][\-a-zA-Z0-9_]*(\.[\-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?$/: Invalid escape
Regex101.com likes the regex pattern, but Chrome doesn't. What syntax do I have wrong?
Use
pattern="[-a-zA-Z0-9~!$%^&*_=+}{'?]+(\.[-a-zA-Z0-9~!$%^&*_=+}{'?]+)*#([a-zA-Z0-9_][-a-zA-Z0-9_]*(\.[-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?"
The problem is that some chars that should not be escaped were escaped, like ' and ^ inside the character classes. Note that - inside a character class may be escaped, but does not have to when it is at its start.
Note also that HTML5 engines wraps the whole pattern inside ^(?: and )$ constructs, so there is no need using $ end of string anchor at the end of the pattern.
Test:
<form>
<input type="email" name="p_email_ad" id="p_email_ad" value="" required="required" pattern="[-a-zA-Z0-9~!$%^&*_=+}{'?]+(\.[-a-zA-Z0-9~!$%^&*_=+}{'?]+)*#([a-zA-Z0-9_][-a-zA-Z0-9_]*(\.[-a-zA-Z0-9_]+)*\.([cC][oO][mM]))(:[0-9]{1,5})?" maxlength="64">
<input type="Submit">
</form>
I was experiencing the same issue with my application but had a slightly different approach to a solution. My regex has the same issue that the accepted answer describes (special characters being escaped in character classes when they didn't need to be), however the regex I'm dealing with is coming from an external source so I could not modify it. This kind of regex is usually fine for most languages (passes validation in PHP) but as we have found out it breaks with HTML5.
My simple solution, url encode the regex before applying it to the input's pattern attribute. That seems to satisfy the HTML5 engine and it works as expected. JavaScript's encodeURIComponent is a good fit.

Regex not working in HTML5 pattern

So I have this regex intended to let pass all text but those that contain as initial chars the "34" sequence:
^(?!34)(?=([\w]+))
The regex is working fine for me in https://regex101.com/r/iN1yN3/2 , check the tests to see the intended behavior.
Any Idea why it isn't working in my form?
<form>
<input pattern="^(?!34)(?=([\w]+))" type="text">
<button type="submit">Submit!</button>
</form>
The pattern attribute has to match the entire string. Assertions check for a match, but do not count towards the total match length. Changing the second assertion to \w+ will make the pattern match the entire string.
You can also skip the implied ^, leaving you with just:
<input pattern="(?!34)\w+" type="text">

Matching from the last occurence of a character in a string with Regex

Yes I know, don't parse html with regex. That said:
I am trying to capture content between any tag with the word "Title" in the first tag.
I started with:
(?P<QUALIFY_TITLE><(.*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
Where the Named Group Capture is a known word/string I am looking for. I also capture for research sake the QUALIFY_TITLE Name group. I do this because I don't want the string/term unless I 'qualify' it in this way.
However, if I have part of an html that looks like this:
<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title">KnownTermIWant</title>
Although I get the CAPTURE String I want (KnownTermIWant), the Qualify string starts from the very first "
I am trying to have the QUALIFY_TITLE start/capture from the last "<" before the title not the first in other words QUALIFY TITLE should be:
<div id="divTitle
or even
<div id="divTitle" class="title">
but I am currently getting
<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title"
The problem is that a regex-search will try to match at the first possible opportunity, and non-greedy quantifiers (*? instead of *) do not affect whether something is a match. For example, given the string abcd, the regex .*?d will match the whole thing, because .*? will still match as much as it needs to in order to ensure that the regex matches.
Do you see what I mean?
So you need to make your subexpressions more precise; for example, instead of <(.*?)(title)(.*?)>, you should write <([^>]*)(title)([^>]*)>.
The problem
There's only one problem here, you are matching exactly what you've asked for :)
The process
If you want to match only the last tag, ask yourself this question:
"What is inside every preceding tag, but not inside the one I want?"
The conclusion
The answer is the open/close tags themselves:
(?P<QUALIFY_TITLE><([^<>]*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
^^^^^
Your code was quite a big mess, but I'm going to answer the question in the title, in a much more simplified way:
In this sample code:
<div>Example text<div>Foo bar</div> Hello world <div>Lorem ipsum</div></div> hi
if you want to match from the first <div> to the last </div>, you could just use a greedy quantifier, such as + or *:
/<div>(.*)<\/div>/
That will match the whole string, until the very last </div>.
Demo
If this doesn't answer your question, the complexity of the regular expression would quickly get higher very fast (it's bascially exponentially more complex for extra requirements), so like you said in your very first line, just use a parser.

Keeping Regex search to one line

I used Wget to scrape a site for migrating to new platform. I am trying to clean up the pages and remove all the viewstate code in them. I am using the following regex expression to do this:
<input type="hidden" name="__VIEWSTATE" value=.*/>
This works in programs like dreamweaver. I like to use another application called Wild Edit which is extremely fast for search and replace for large number of files. When I use that same expression it will match to the last /> on the page remove alot of good code. I have also tried <input type="hidden" name="__VIEWSTATE" value=.*/>$ with same results.
How would I constrain this to keep it to the first match of />
Try
<input type="hidden" name="__VIEWSTATE" value=.*?/>
The ?, if it's supported makes the search ungreedy so it will only match until the first /> rather than the last.
If that doesn't work, your best bet may be:
<input type="hidden" name="__VIEWSTATE" value=[^/]+/>
The regex is being too greedy. Try this:
<input type="hidden" name="__VIEWSTATE" value=.*?/>
By default, the regex engine tries to make as large of a match as possible. For example, the regular expression a.*z will match az (some other middle stuff) az as one big match, since, well, it does start with a and end with z.
The ? modifier tells the regular expression engine to, rather than be greedy, be lazy: instead of grabbing the largest possible match, grab the smallest. In the previous example, the regex a.*?z will just match the 2 az substrings, because it's being lazy: once it sees the z, it stops.

Non-greedy regex acts greedily

Here's a simple example:
Text: <input name="zzz" value="18754" type="hidden"><input name="zzz" value="18311" type="hidden"><input name="zzz" value="17138" type="hidden">
Regex: /<input.*?value="(18754|17138)".*?>/
When matches are replaced by an empty string, the result is an empty string. I expected the middle <input> to remain since I am using non-greedy matching (.*?). Anyone could explain why it is removed?
There are two matches:
<input name="zzz" value="18754" type="hidden">
<input name="zzz" value="18311" type="hidden"><input name="zzz" value="17138" type="hidden">
In the second case, the first .*? matches name="zzz" value="18311" type="hidden"><input name="zzz". It's a match and it's non-greedy.
aix already explained, why it does match the middle part.
To avoid this behaviour, get rid of the .*?, instead try this:
/<input[^>]*value="(18754|17138)"[^>]*>/
See it here on Regexr
Instead of matching any character, match any, but ">"
aiz's answer is correct -- the second match includes the 2nd and 3rd input tags.
One possible fix for your regex would be to change . to [^>], like this:
/<input[^>]*?value="(18754|17138)"[^>]*?>/
That will cause it to match any character except >. But that has the obvious problem of breaking whenever > shows up inside a quoted literal. As everyone always says: Regexes aren't designed to work on HTML. Don't use them unless you have no other choice.