Can't write regexp [closed] - regex

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Hi I need to get to $submitkey a value mjxjezhmgrutgevclt0qtyayiholcdctuxbwb. What's wrong with my code?
my $str = '<input type="hidden" value="set" name="fr.posted"></input><input type="hidden" value="mjxjezhmgrutgevclt0qtyayiholcdctuxbwb" name="fr.submitKey"></input><div class="form-actions form-actions__centrate"><button value="clicked" id="hook_FormButton_button_accept_request" onclick="className +=" button-loading"" class="button-pro form-actions__yes" type="submit" name="button_accept_request"><span class="button-pro_tx">Войти</span>';
($submitkey) = $str =~ m/value="(.*?)" name="fr.submitKey"/;
print $submitkey;

Never use .*?. It's never what you are actually trying to do. Even if you get it to work, it's far too likely to create extremely bad performance when there is no match. In this case, use [^"]*

You are matching from the first instance of value all the way until "fr.submitKey".
Take advantage of the fact that every value is contained within quotes; only look for non-quote characters as part of the value.
Additionally, it is cleaner to use the special capturing-group variables:
my $str = '<input type="hidden" value="set" name="fr.posted"></input><input type="hidden" value="mjxjezhmgrutgevclt0qtyayiholcdctuxbwb" name="fr.submitKey"></input><div class="form-actions form-actions__centrate"><button value="clicked" id="hook_FormButton_button_accept_request" onclick="className +=" button-loading"" class="button-pro form-actions__yes" type="submit" name="button_accept_request"><span class="button-pro_tx">Войти</span>';
$str =~ m/value="([^"]*)" name="fr.submitKey"/;
$submitkey = $1;
print $submitkey;

.*? does not cause Perl to search for the shortest possible match inside the whole string. Therefore the text before the .*? matches earlier in the string, and Perl is happy that it finds a match there. .*? simply means that it matches as few characters as possible from that first point where the part before .*? matches.
As #ikegami said: use [^"]* instead in your particular case.

Much better to use a real DOM parser for this task. I like Mojo::DOM which is part of the Mojolicious tool suite. Note that use Mojo::Base -strict enables strict, warnings and utf8. The at method finds the first instance which matches using CSS3 selectors.
#!/usr/bin/env perl
use Mojo::Base -strict;
use Mojo::DOM;
my $dom = Mojo::DOM->new(<<'END');
<input type="hidden" value="set" name="fr.posted"></input><input type="hidden" value="mjxjezhmgrutgevclt0qtyayiholcdctuxbwb" name="fr.submitKey"></input><div class="form-actions form-actions__centrate"><button value="clicked" id="hook_FormButton_button_accept_request" onclick="className +=" button-loading"" class="button-pro form-actions__yes" type="submit" name="button_accept_request"><span class="button-pro_tx">Войти</span>
END
my $submit_key = $dom->at('[name="fr.submitKey"]')->{value};
say $submit_key;

Related

Perl regex remove whitespace before close tag [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 days ago.
Improve this question
I need to remove white space before the closing tag using perl regex.
From
<span class="inf">cranium </span>
<span class="inf">craniums </span>
<span class="inf">crania </span>
to
<span class="inf">cranium</span>
<span class="inf">craniums</span>
<span class="inf">crania</span>
Using:
find . -type f -exec perl -pi -w -e 's/(\s)([\<\/span>])/$2/' \{\} \;
What am I doing wrong?
One can capture a pattern and then in the replacement put back only that, thus effectively removing all other that was matched, as attempted in the question, like so
s{\s+(</span>)}{$1}g
We match spaces† and the pattern </span> immediately following them and replace all that with what's been captured in the first (left-most) set of parenthesis ($1). That's it.‡
The /g modifier is there so that this is done throughout the whole string.
Or, using lookahead
s{\s+(?=</span>)}{}g
Now </span> pattern isn't "consumed" out of the string but is only "asserted" to be there, following spaces; so we don't need to "put it back" and the empty replacement effectively removes only the whitespace.
There is also no need to escape {} in the find command
† This includes all kinds of "whitespace." See about it for instance in perlrecharclass
‡ A comment on [] used in the question: that's a "character class." It matches any one of the characters listed inside, with some restrictions and modifications. See linked docs.
So [\<\/span>] matches either of the characters: <, /, s, p, a, n, >. The \ is used to escape < and / so it isn't matched itself. (However, escaping those would be unneeded, except for / if that is the delimiter for the whole regex.)
See perlrecharclass, and the tutorial perlretut. The full, top-level, reference is perlre.

regular expression exclude match that contains a string pattern

I'm trying to narrow down my RegEx to ignore form elements with type="submit". I only want to select the portion of elements up to the part class="*" but still ignore if type="submit" comes before or after the class.
My regular expression thus far:
(<(?:input|select|textarea){1}.*[^type="submit"]class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
Test case:
Line one should match up to the end of class, and line 2 ignored.
<input type="text" name="name" id="test" class="example-class" max-length="7" required="required">
<input type="submit" class="btn-primary" value="send">
Is this acheivable?
Thanks for your comments. The answer was a negative look ahead.
Adding (?!.*type="submit.*) to the start of the regex appears to have given me my desired result.
Working Regex:
(?!.*type="submit.*)(<(?:input|select|textarea).*class=")(((?!form\-control)[a-zA-Z0-9_ -])*")
(<(?:input|select|textarea)\s((?!type="submit")[\w\-]+\b="[^"]*"\s?)*>)
This expression is bound to the single tag.
It is better to avoid expressions like .* since it can go further and match a string which would begin inside one tag and end-up inside another.

Keeping Regex search to one line

I used Wget to scrape a site for migrating to new platform. I am trying to clean up the pages and remove all the viewstate code in them. I am using the following regex expression to do this:
<input type="hidden" name="__VIEWSTATE" value=.*/>
This works in programs like dreamweaver. I like to use another application called Wild Edit which is extremely fast for search and replace for large number of files. When I use that same expression it will match to the last /> on the page remove alot of good code. I have also tried <input type="hidden" name="__VIEWSTATE" value=.*/>$ with same results.
How would I constrain this to keep it to the first match of />
Try
<input type="hidden" name="__VIEWSTATE" value=.*?/>
The ?, if it's supported makes the search ungreedy so it will only match until the first /> rather than the last.
If that doesn't work, your best bet may be:
<input type="hidden" name="__VIEWSTATE" value=[^/]+/>
The regex is being too greedy. Try this:
<input type="hidden" name="__VIEWSTATE" value=.*?/>
By default, the regex engine tries to make as large of a match as possible. For example, the regular expression a.*z will match az (some other middle stuff) az as one big match, since, well, it does start with a and end with z.
The ? modifier tells the regular expression engine to, rather than be greedy, be lazy: instead of grabbing the largest possible match, grab the smallest. In the previous example, the regex a.*?z will just match the 2 az substrings, because it's being lazy: once it sees the z, it stops.

I need a regular expression that can match ending tags [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 6 years ago.
I need a regular expression that can match ending tags such as </something> and any and ALL data after it. Please help!
Example:
$html = '
<div id="footer">
<div class="wrap">
<strong class="logo">College</strong>
<ul><li>Emergencies</li>
<li>Contact</li>
<li>Copyright</li>
<li>Terms of Use</li>
<li>Member of The Colleges</li>
</ul><p>© 2010 College</p>
</div>
</div>
</body></html>
li>
<li>Contact</li>
<li>Copyright</li>
<li>Terms of Use</li>
<li>Member of The Colleges</li>
</ul><p>© 2010 College</p>
</div>
</div>
</body></html>';
$html = preg_replace("#</html>.*#i", '', $html);
print ($html);
You're trying to parse HTML with regular expressions. Regular expressions are inadequate for parsing HTML safely. What you need is an HTML parser. Take a look at PHP's DOM module.
Tags can be hidden inside comments, cdata, script and other places, and/or it could just be invalid. If you say its not markup of any kind, you could do something like this:
/<\/something\s*>((?:(?!<\/something\s*>)[\S\s])+)/ then peel off capture group 1 in a global loop. Don't need to capture the tag unless its a (?:something|something_else|...)
EDIT
Your example doesen't work because you are not using the /s modifier. It works in Perl as $html =~ s/<\/html>.*//s;. This $html =~ s/<\/html>[\S\s]*//; works without the /s modifier.
Change yours to #</html>[\S\s]*#i or use the /s modifier. Dot . will match any character except newline. With /s modifier it will match newline too.
and more Just tried it, use $html = preg_replace("#</html>.*#is", '', $html);
#"</[\da-zA-Z]+>.*"
or for a specific tag
#"</myTag>.*"
Making sure to set the regex options to ignore case. Although make sure something that parses xml isn't more helpful.
I don't think this will change your mind but probably regex's aren't the best way to pull ending tags out of html anyway. Jeff Atwood did a great essay about why this is not the best approach for solving this particular issue.
Parsing Html The Cthulhu Way

Regular expression lookbehind problem

I use
(?<!value=\")##(.*)##
to match string like ##MyString## that's not in the form of:
<input type="text" value="##MyString##">
This works for the above form, but not for this: (It still matches, should not match)
<input type="text" value="Here is my ##MyString## coming..">
I tried:
(?<!value=\").*##(.*)##
with no luck. Any suggestions will be deeply appreciated.
Edit: I am using PHP preg_match() function
This is not perfect (that's what HTML parsers are for), but it will work for the vast majority of HTML files:
(^|>)[^<>]*##[^#]*##[^<>]*(<|$)
The idea is simple. You're looking for a string that is outside of tags. To be outside of tags, the closest preceding angled bracket to it must be closing (or there's no bracket at all), and the closest following one must be opening (or none). This assumes that angled brackets are not used in attribute values.
If you actually care that the attribute name be "value", then you can match for:
value\s*=\s*"([^\"]|\\\")*##[^#]*##([^\"]|\\\")*\"
... and then simply negate the match (!preg_match(...)).
#OP, you can do it simply without regex.
$text = '<input type="text" value=" ##MyString##">';
$text = str_replace(" ","",$text);
if (strpos($text,'value="##' ) !==FALSE ){
$s = explode('value="##',$text);
$t = explode("##",$s[1]);
print "$t[0]\n";
}
here is a starting point at least, it works for the given examples.
(?<!<[^>]*value="[^>"]*)##(.*)##