Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
I have the following function that returns me the first image of the post:
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i',
$post->post_content, $matches);
however returns me any image, I need to ignore the images in gif format, how could I add this condition in regex expression?
Easier to loop through the results and use a different regex.
$output = preg_match_all('/<img[^>]+?src=[\'"](.+?)[\'"].*?>/i', $post->post_content, $matches);
foreach ($matches as $imgSrc)
{
if (!preg_match("/\.gif$/i"), $imgSrc)
{
$noGif[] = $imgSrc;
}
}
It is easier to understand, and there won't be unexpected side effects like blocking valid pictures that happen to have the letter "gif" in the file name.
Note, be very carefull when using .+ and .*. As it stands, your regex matches a LOT more than you think:
Try it on this, for instance:
<img whatever> whatever <img src="mypic.png"> <some other tag>
You should probably not be using regular expressions
HTML is not regular
Regexes may match today, but what about tomorrow?
Say you've got a file of HTML where you're trying to extract URLs from tags.
<img src="http://example.com/whatever.jpg">
So you write a regex like this (in Perl):
if ( $html =~ /<img src="(.+)"/ ) {
$url = $1;
}
In this case, $url will indeed contain http://example.com/whatever.jpg. But what happens when you start getting HTML like this:
<img src='http://example.com/whatever.jpg'>
or
<img src=http://example.com/whatever.jpg>
or
<img border=0 src="http://example.com/whatever.jpg">
or
<img
src="http://example.com/whatever.jpg">
or you start getting false positives from
<!-- <img src="http://example.com/outdated.png"> -->
<img[^>]+src=[\'"](?:([^\'"](?!\.gif))+)[\'"][^>]*>
Updated to have only one capture.
Fixed to include dot. Now would only fail on strange things like a.gif.jpg
Also added safety matches as suggested in comment.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
I have an HTML doc as string and would like to extract all it's links with one regex command/pattern (for better performance), instead of searching for every tag separately (which is the only way I know to solve it).
HTML example:
<img src="..." data-full-resolution="..." />
<object data="..."/>
Please consider also that the image tag has two attributes that should be extracted (src and data-full-resolution).
The programming language is intentionally left-out as I need a 'raw' solution, without HTML librarires.
(?:data-full-resolution|src|href|data)=\"(.*?)\"
Regex Explanation
(?: Non-capturing group
data-full-resolution|src|href|data One of data-full-resolution, src, href or data
) Close non-capturing group
=\" Match =" after an attribute name
( Capturing group
.*? Non-greedy capturing till the next quote
) Close group
\" Match the close quote
See regex demo
Python Example
import re
html = """
<img src="<link-src>" data-full-resolution="<link-data-full-resolution>" />
<object data="<link-data>"/>"""
print(re.findall(r"(?:data-full-resolution|src|href|data)=\"(.*?)\"", html)) # ['<link-href>', '<link-src>', '<link-data-full-resolution>', '<link-data>']
Where re.findall returns the list of captured groups.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I have requirement of following reg-ex pattern:
Sample string :
<html> a test of strength and <h1> valour </h1> for <<<NOT>>> faint hearted <b> BUT </b> protoganist having their characters <<<CARVED>>> out of gibralter <b> ROCK </b>
This above is single string in which I want to strip out every HTML tag and retain <<<xyz>>> .
My attempt:
(^|\n| )<[^>]*>(\n| |$)
Can someone please critically review this ?
This is what I've come up with. It uses lookbehinds to make sure you identify hmtl tags by what will precede and follow them without actually including them in the match. The point is to look for < and > only if they are followed or preceded by spaces or letters (not other < or >). Is this what you are after or did I misread you?
(?=([ A-z]?))<{1}\/?[A-z1-6]+>{1}(?=[^>])
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm looking to use Notepad++ to do a find and replace across a number of webpages I have.
I need to change the following code:
<OBJECT CLASSID="clsid:CA8A9780-280D-11CF-A24D-444553540000" WIDTH=800 HEIGHT=600> <PARAM NAME="SRC" VALUE="FILENAME.pdf"><EMBED SRC="FILENAME.pdf" HEIGHT=800 WIDTH=600> <NOEMBED> Your browser does not support embedded PDF files.</NOEMBED> </EMBED></OBJECT>
To this:
<meta http-equiv="refresh" content="0; url=FILENAME.pdf">
Unfortunately, FILENAME.pdf is different in every file I have. As such, I'd like to find that original string with whatever filename it shows, then use that filename in the new string.
There are two occurrences of the filename in the original string (they will be the same) - the value attribute of the param tag (<PARAM NAME="SRC" VALUE="FILENAME.pdf">) and the src attribute of the embed tag (<EMBED SRC="FILENAME.pdf" HEIGHT=800 WIDTH=600>). Otherwise, the entire original string should be identical to that listed above.
I think this should be straightforward with regex but I have no idea where to start.
Thanks in advance,
Find: <OBJECT CLASSID="clsid:CA8A9780-280D-11CF-A24D-444553540000" WIDTH=800 HEIGHT=600> <PARAM NAME="SRC" VALUE="([^"]+)"><EMBED SRC="([^"]+)" HEIGHT=800 WIDTH=600>
Replace with: <meta http-equiv="refresh" content="0; url="\1">
Here's one solution that will work in Notepad++ which is what you requested.
Find what: <OBJECT.*SRC="(.*)".*</OBJECT>
Replace with: <meta http-equiv="refresh" content="0; url=$1">
You can make the "Find what" more explicit as needed.
http://regex101.com is also great place to experiment.
you could do a regex replace to replace the first half before the file name to the desire portion, and then replace the second half of the original string following the same filename with the last few characters of the new one.
Archaic solution incoming...
Replace <OBJECT CLASSID="clsid:CA8A9780-280D-11CF-A24D-444553540000" WIDTH=800 HEIGHT=600> <PARAM NAME="SRC" VALUE=" with <meta http-equiv="refresh" content="0; url=
then a 2nd replace with
<EMBED SRC="FILENAME.pdf" HEIGHT=800 WIDTH=600> <NOEMBED> Your browser does not support embedded PDF files.</NOEMBED> </EMBED></OBJECT> replaced with nothing.
Worked for me.
EDIT: Note: this does not need regex, just a normal find/replace in Notepad++.
I would recommend couple of replaces for what you are looking for...
Replace 1:
Find What: <OBJECT.*VALUE=
Replace With: <meta http-equiv="refresh" content="0; url=
Replace 2:
Find What: ><EMBED.*
Replace With: >
Hope that works for you.
Regards.
Try following regex for search and replace
Fine what : .*EMBED\s*SRC="([^.]*.pdf)".*
Replace with : <meta http-equiv="refresh" content="0; url=\1">
Here, ([^.]*.pdf) will capture pdf file name in \1 so that it can handle varible file names.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Hi I need to get to $submitkey a value mjxjezhmgrutgevclt0qtyayiholcdctuxbwb. What's wrong with my code?
my $str = '<input type="hidden" value="set" name="fr.posted"></input><input type="hidden" value="mjxjezhmgrutgevclt0qtyayiholcdctuxbwb" name="fr.submitKey"></input><div class="form-actions form-actions__centrate"><button value="clicked" id="hook_FormButton_button_accept_request" onclick="className +=" button-loading"" class="button-pro form-actions__yes" type="submit" name="button_accept_request"><span class="button-pro_tx">Войти</span>';
($submitkey) = $str =~ m/value="(.*?)" name="fr.submitKey"/;
print $submitkey;
Never use .*?. It's never what you are actually trying to do. Even if you get it to work, it's far too likely to create extremely bad performance when there is no match. In this case, use [^"]*
You are matching from the first instance of value all the way until "fr.submitKey".
Take advantage of the fact that every value is contained within quotes; only look for non-quote characters as part of the value.
Additionally, it is cleaner to use the special capturing-group variables:
my $str = '<input type="hidden" value="set" name="fr.posted"></input><input type="hidden" value="mjxjezhmgrutgevclt0qtyayiholcdctuxbwb" name="fr.submitKey"></input><div class="form-actions form-actions__centrate"><button value="clicked" id="hook_FormButton_button_accept_request" onclick="className +=" button-loading"" class="button-pro form-actions__yes" type="submit" name="button_accept_request"><span class="button-pro_tx">Войти</span>';
$str =~ m/value="([^"]*)" name="fr.submitKey"/;
$submitkey = $1;
print $submitkey;
.*? does not cause Perl to search for the shortest possible match inside the whole string. Therefore the text before the .*? matches earlier in the string, and Perl is happy that it finds a match there. .*? simply means that it matches as few characters as possible from that first point where the part before .*? matches.
As #ikegami said: use [^"]* instead in your particular case.
Much better to use a real DOM parser for this task. I like Mojo::DOM which is part of the Mojolicious tool suite. Note that use Mojo::Base -strict enables strict, warnings and utf8. The at method finds the first instance which matches using CSS3 selectors.
#!/usr/bin/env perl
use Mojo::Base -strict;
use Mojo::DOM;
my $dom = Mojo::DOM->new(<<'END');
<input type="hidden" value="set" name="fr.posted"></input><input type="hidden" value="mjxjezhmgrutgevclt0qtyayiholcdctuxbwb" name="fr.submitKey"></input><div class="form-actions form-actions__centrate"><button value="clicked" id="hook_FormButton_button_accept_request" onclick="className +=" button-loading"" class="button-pro form-actions__yes" type="submit" name="button_accept_request"><span class="button-pro_tx">Войти</span>
END
my $submit_key = $dom->at('[name="fr.submitKey"]')->{value};
say $submit_key;
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 6 years ago.
I need a regular expression that can match ending tags such as </something> and any and ALL data after it. Please help!
Example:
$html = '
<div id="footer">
<div class="wrap">
<strong class="logo">College</strong>
<ul><li>Emergencies</li>
<li>Contact</li>
<li>Copyright</li>
<li>Terms of Use</li>
<li>Member of The Colleges</li>
</ul><p>© 2010 College</p>
</div>
</div>
</body></html>
li>
<li>Contact</li>
<li>Copyright</li>
<li>Terms of Use</li>
<li>Member of The Colleges</li>
</ul><p>© 2010 College</p>
</div>
</div>
</body></html>';
$html = preg_replace("#</html>.*#i", '', $html);
print ($html);
You're trying to parse HTML with regular expressions. Regular expressions are inadequate for parsing HTML safely. What you need is an HTML parser. Take a look at PHP's DOM module.
Tags can be hidden inside comments, cdata, script and other places, and/or it could just be invalid. If you say its not markup of any kind, you could do something like this:
/<\/something\s*>((?:(?!<\/something\s*>)[\S\s])+)/ then peel off capture group 1 in a global loop. Don't need to capture the tag unless its a (?:something|something_else|...)
EDIT
Your example doesen't work because you are not using the /s modifier. It works in Perl as $html =~ s/<\/html>.*//s;. This $html =~ s/<\/html>[\S\s]*//; works without the /s modifier.
Change yours to #</html>[\S\s]*#i or use the /s modifier. Dot . will match any character except newline. With /s modifier it will match newline too.
and more Just tried it, use $html = preg_replace("#</html>.*#is", '', $html);
#"</[\da-zA-Z]+>.*"
or for a specific tag
#"</myTag>.*"
Making sure to set the regex options to ignore case. Although make sure something that parses xml isn't more helpful.
I don't think this will change your mind but probably regex's aren't the best way to pull ending tags out of html anyway. Jeff Atwood did a great essay about why this is not the best approach for solving this particular issue.
Parsing Html The Cthulhu Way