I am trying to get all script tags and link tags for stylesheets only.
Right now I have this:
var re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;
Text:
<link rel="preload" href="fonts/roboto-v27-latin/roboto-v27-latin-700.woff2" as="font" type="font/woff2" crossorigin>
<link rel="preload" href="css/site.css" as="style">
<link rel="stylesheet" href="css/site.css" media="screen">
<script src="plugins/glider-js/glider.min.js"></script>
It matches the js file.
But how to get the tag for site.css too?
This is what I am trying to do:
var re = /<script\b[^>]*>([\s\S]*?)<\/script>|<link rel="stylesheet">/gm;
If you search for the = + stylesheet you should find all your links with few false positives. We make sure that we don't include any </script>tag before the last one.
If there are nested script tags this will return the inner pair.
I'm assuming that all of your tags are lower case : <script> and not <SCRIPT>. We could write the regex to capture both if there is a risk of missing valid links.
<script((?!(<(\/)script))[\s\S])*\=['\"]?(stylesheet)['\"]?((?!(<(\/)script))[\s\S])*<\/script>
I am trying to use regular expressions, to remove all the content between two strings ...
Suppose this is my content:
<h2>Misrepresentation of the Facts</h2>
</script>
<!-- Articles - Leaderboard 728x90 -->
</iframe></ins></ins></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
<h2>Who Can Commit the Crime</h2>
I want to remove all content between </script>
<!-- Articles - Leaderboard 728x90 -->
</iframe></ins></ins></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
Any help would be most appreciated.
<\/script>(?:[^<]*(?!.)+<\/script>
<\/script>(?:[^<]*(?!.)+<\/script>
I'm just guessing that these expressions being replaced by an empty string might work:
<\/script>[\s\S]*?<\/script>
<\/script>[\d\D]*?<\/script>
<\/script>[\w\W]*?<\/script>
Please see the demo here.
Escaping is just for demoing, and can be removed.
I am reading an HTML document. So far I have been using HTML::TreeBuilder with HTML::Element and look_down, but now I am stuck with the content of a script <script>...</script>
<script language="JavaScript">
eval(function(p,a,c,k,e,r){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('7 F={f:"P+/=",Q:z(5){7 8="";7 s,k,l,v,t,h,j;7 i=0;5=F.I(5);G(i<5.B){s=5.m(i++);k=5.m(i++);l=5.m(i++);v=s>>2;t=((s&3)<<4)|(k>>4);h=((k&H)<<2)|(l>>6);j=l&u;o(J(k)){h=j=C}w o(J(l)){j=C}8=8+p.f.q(v)+p.f.q(t)+p.f.q(h)+p.f.q(j)}D 8},R:z(5){7 8="";7 s,k,l;7 v,t,h,j;7 i=0;5=5.K(/[^A-S-T-9\\+\\/\\=]/g,"");G(i<5.B){v=p.f.E(5.q(i++));t=p.f.E(5.q(i++));h=p.f.E(5.q(i++));j=p.f.E(5.q(i++));s=(v<<2)|(t>>4);k=((t&H)<<4)|(h>>2);l=((h&3)<<6)|j;8=8+b.d(s);o(h!=C){8=8+b.d(k)}o(j!=C){8=8+b.d(l)}}8=F.L(8);D 8},I:z(e){e=e.K(/\\r\\n/g,"\\n");7 a="";U(7 n=0;n<e.B;n++){7 c=e.m(n);o(c<x){a+=b.d(c)}w o((c>V)&&(c<W)){a+=b.d((c>>6)|X);a+=b.d((c&u)|x)}w{a+=b.d((c>>M)|N);a+=b.d(((c>>6)&u)|x);a+=b.d((c&u)|x)}}D a},L:z(a){7 e="";7 i=0;7 c=Y=y=0;G(i<a.B){c=a.m(i);o(c<x){e+=b.d(c);i++}w o((c>Z)&&(c<N)){y=a.m(i+1);e+=b.d(((c&10)<<6)|(y&u));i+=2}w{y=a.m(i+1);O=a.m(i+2);e+=b.d(((c&H)<<M)|((y&u)<<6)|(O&u));i+=3}}D e}}',62,63,'|||||input||var|output||utftext|String||fromCharCode|string|_keyStr||enc3||enc4|chr2|chr3|charCodeAt||if|this|charAt||chr1|enc2|63|enc1|else|128|c2|function||length|64|return|indexOf|Base64|while|15|_utf8_encode|isNaN|replace|_utf8_decode|12|224|c3|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|encode|decode|Za|z0|for|127|2048|192|c1|191|31'.split('|'),0,{}));
eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(c/a))+String.fromCharCode(c%a+161)};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\[\xa1-\xff]+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp(e(c),'g'),k[c])}}return p}('¦ £=\'¥+¢+¤+¢+\';¡.«();¡.§(©.¨(£));¡.ª();',11,11,'document|PC9pZnJhbWU|ba2se|PGlmcmFtZSB3aWR0aCA9ICIxMDAlIiBoZWlnaHQgPSAiMTAwJSIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iJiMxMDQ7JiMxMTY7JiMxMTY7JiMxMTI7JiM1ODsmIzQ3OyYjNDc7JiMxMTc7JiMxMDg7JiM0NjsmIzExNjsmIzExMTsmIzQ3OyYjNTY7JiMxMTQ7JiM5ODsmIzEyMTsmIzExNzsmIzU3OyYjMTEzOyYjMTAwOyI|PGlmcmFtZSB3aWR0aCA9ICIwIiBoZWlnaHQgPSAiMCIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iaHR0cDovL2dvb2dsZS5kZSI|var|write|decode|Base64|close|open'.split('|'),0,{}))
</script>
I want to get the text from the second eval eval(......) to
I tried
my ($var) = $response->decoded_content =~ /^eval(.*?)\/script/
but I get both evals, which is obvious.
EDIT : Added raw source
This program shows how you might go about it. /eval/ finds the first occurrence of eval, while /.*eval/ find the last occurrence.
I have used an HTML document that is empty apart from a single <script> element in the <head> section.
The call to look_down will find all <script> elements with a language attribute equal to JavaScript and put them in the array #script. In this case there is only one, so I use $script[0]. Depending on your HTML you may need to select one of several elements.
A call to as_text ignores <script> and <style> elements, so I have to use content_list to get the text inside the <script> element. This text is put into $content, and everything from the last occurrence of eval onwards is copied to $eval.
I hope this helps.
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file(\*DATA);
my #script = $tree->look_down(_tag => 'script', language => 'JavaScript');
my ($content) = $script[0]->content_list;
my ($eval) = $content =~ /.*(eval.+\S)/s;
print $eval;
__DATA__
<html>
<head>
<script language="JavaScript">
eval(function(p,a,c,k,e,r){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('7 F={f:"P+/=",Q:z(5){7 8="";7 s,k,l,v,t,h,j;7 i=0;5=F.I(5);G(i<5.B){s=5.m(i++);k=5.m(i++);l=5.m(i++);v=s>>2;t=((s&3)<<4)|(k>>4);h=((k&H)<<2)|(l>>6);j=l&u;o(J(k)){h=j=C}w o(J(l)){j=C}8=8+p.f.q(v)+p.f.q(t)+p.f.q(h)+p.f.q(j)}D 8},R:z(5){7 8="";7 s,k,l;7 v,t,h,j;7 i=0;5=5.K(/[^A-S-T-9\\+\\/\\=]/g,"");G(i<5.B){v=p.f.E(5.q(i++));t=p.f.E(5.q(i++));h=p.f.E(5.q(i++));j=p.f.E(5.q(i++));s=(v<<2)|(t>>4);k=((t&H)<<4)|(h>>2);l=((h&3)<<6)|j;8=8+b.d(s);o(h!=C){8=8+b.d(k)}o(j!=C){8=8+b.d(l)}}8=F.L(8);D 8},I:z(e){e=e.K(/\\r\\n/g,"\\n");7 a="";U(7 n=0;n<e.B;n++){7 c=e.m(n);o(c<x){a+=b.d(c)}w o((c>V)&&(c<W)){a+=b.d((c>>6)|X);a+=b.d((c&u)|x)}w{a+=b.d((c>>M)|N);a+=b.d(((c>>6)&u)|x);a+=b.d((c&u)|x)}}D a},L:z(a){7 e="";7 i=0;7 c=Y=y=0;G(i<a.B){c=a.m(i);o(c<x){e+=b.d(c);i++}w o((c>Z)&&(c<N)){y=a.m(i+1);e+=b.d(((c&10)<<6)|(y&u));i+=2}w{y=a.m(i+1);O=a.m(i+2);e+=b.d(((c&H)<<M)|((y&u)<<6)|(O&u));i+=3}}D e}}',62,63,'|||||input||var|output||utftext|String||fromCharCode|string|_keyStr||enc3||enc4|chr2|chr3|charCodeAt||if|this|charAt||chr1|enc2|63|enc1|else|128|c2|function||length|64|return|indexOf|Base64|while|15|_utf8_encode|isNaN|replace|_utf8_decode|12|224|c3|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|encode|decode|Za|z0|for|127|2048|192|c1|191|31'.split('|'),0,{}));
eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(c/a))+String.fromCharCode(c%a+161)};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\[\xa1-\xff]+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp(e(c),'g'),k[c])}}return p}('¦ £=\'¥+¢+¤+¢+\';¡.«();¡.§(©.¨(£));¡.ª();',11,11,'document|PC9pZnJhbWU|ba2se|PGlmcmFtZSB3aWR0aCA9ICIxMDAlIiBoZWlnaHQgPSAiMTAwJSIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iJiMxMDQ7JiMxMTY7JiMxMTY7JiMxMTI7JiM1ODsmIzQ3OyYjNDc7JiMxMTc7JiMxMDg7JiM0NjsmIzExNjsmIzExMTsmIzQ3OyYjNTY7JiMxMTQ7JiM5ODsmIzEyMTsmIzExNzsmIzU3OyYjMTEzOyYjMTAwOyI|PGlmcmFtZSB3aWR0aCA9ICIwIiBoZWlnaHQgPSAiMCIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iaHR0cDovL2dvb2dsZS5kZSI|var|write|decode|Base64|close|open'.split('|'),0,{}))
</script>
</head>
<body> </body>
</html>
output
eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(c/a))+String.fromCharCode(c%a+161)};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\[\xa1-\xff]+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp(e(c),'g'),k[c])}}return p}('¦ £=\'¥+¢+¤+¢+\';¡.«();¡.§(©.¨(£));¡.ª();',11,11,'document|PC9pZnJhbWU|ba2se|PGlmcmFtZSB3aWR0aCA9ICIxMDAlIiBoZWlnaHQgPSAiMTAwJSIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iJiMxMDQ7JiMxMTY7JiMxMTY7JiMxMTI7JiM1ODsmIzQ3OyYjNDc7JiMxMTc7JiMxMDg7JiM0NjsmIzExNjsmIzExMTsmIzQ3OyYjNTY7JiMxMTQ7JiM5ODsmIzEyMTsmIzExNzsmIzU3OyYjMTEzOyYjMTAwOyI|PGlmcmFtZSB3aWR0aCA9ICIwIiBoZWlnaHQgPSAiMCIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iaHR0cDovL2dvb2dsZS5kZSI|var|write|decode|Base64|close|open'.split('|'),0,{}))
Use regex pattern
\beval\(.*\S(?!.*eval)(?=\s*<\/script>)
or
\beval\(.*\K\beval\(.*\S(?=\s*<\/script>)
Just match it twice:
/^.*?eval\([^)]+\).*?(eval\([^)]+\))/
DEMO
For now this one works for me
/eval\(function\(p,a,c,k,e,d\)\{.*\}\)\)/gmsi
Thank you all for your help, i did a mistake by not putting the whole script content in the beginning.
The rules would be:
Delete all lines except the last line which contains: link and href=
Replace the contents of whatever is after: href= and before: .css with: hello-world
Must maintain no quotes, single quotes or double quotes around the file name
A few examples:
This is a source file with quotes:
<link rel="stylesheet" href="css/reset.css">
<link rel="stylesheet" href="css/master.css">
This is the new source file:
<link rel="stylesheet" href="hello-world.css">
This is a source file without quotes:
<link rel=stylesheet href=css/reset.css>
<link rel=stylesheet href=css/master.css>
This is the new source file:
<link rel=stylesheet href=hello-world.css>
It does not need to maintain the path of the file name. It however cannot use <> brackets or spaces to determine what needs to be edited because the template language which is writing that line might not use brackets or spaces. The only thing that would remain consistent is href=[filename].css.
My bash/sed/regex skills are awful but those tools seem like they will probably get the job done in a decent way? How would I go about doing this?
EDIT
To clarify, the end result would leave everything above and below the lines that contain link and href= alone. Imagine that the source file was an html file or any other template file like so:
<html>
<head>
<title>Hello</title>
<link rel="stylesheet" href="css/reset.css">
<link rel="stylesheet" href="css/master.css">
</head>
<body><p>...</p></body>
</html>
It would be changed to:
<html>
<head>
<title>Hello</title>
<link rel="stylesheet" href="css/hello-world.css">
</head>
<body><p>...</p></body>
</html>
The path of the CSS files might be anything too.
../foo/bar.css
http://www.hello.com/static/css/hi.css
/yep.css
ok.css
The new file's path would be supplied as an argument of the bash script so the regex should remove the path.
Following a discussion in chat, one solution using PHP as a command line script would look like this -
#! /usr/bin/php
<?php
$options = getopt("f:r:");
$inputFile = $options['f'];
$replacement = $options['r'];
// read entire contents of input file
$inputFileContents = file_get_contents($inputFile);
// setup the regex and execute the search
$pattern = '/.*link.*href=["|\']?(.*[\\\|\/]?.*)\.css["|\']?.*/';
preg_match_all($pattern, $inputFileContents, $matches);
// remove last occurance of regex
// these are the lines we'll want to hang onto
$matchedLines = $matches[0];
array_pop($matchedLines);
// isolate the last css file name
$matchedFileName = array_pop($matches[1]);
// first substitution replaces all lines with <link> with
// an empty string (deletes them)
$inputFileContents = str_replace($matchedLines,'',$inputFileContents);
// second substitution replaces the matched file name
// with the desired string
$inputFileContents = str_replace($matchedFileName,$replacement,$inputFileContents);
//*/
// save to new file for debugging
$outputFileName = "output.html";
$outputFile = fopen($outputFileName,'w+');
fwrite($outputFile,$inputFileContents);
fclose($outputFile);
/*/
// save changes to original file
$origFile = fopen($inputFile,'w+');
fwrite($origFile,$inputFileContents);
fclose($origFile);
//*/
exit();
?>
You would execute this script from the command line like so -
$ php thisScript.php -f "input.html" -r "hello-world"
-f is the input file that we are parsing.
-r is the replacement string for the css file name (in this example "hello-world").
Answer specifically, for this case:
If you include same css file twice, it does not create any harm as far as seen by the user.
So you may just replace both css/reset.css AND css/master.css by css/hello-world.css.
There may be better ways, but I found this a quick way. It will work specifically for this case & not if you want to replace <script> or other tags.
Try including the first part of the file before the css and then include the rest of the file below the css, and in the middle, echo the correct css lines
i was wondering if this setup would work. i have to crank out a batch of PDF from a bunch of variables i'm pushing into the $_SESSION via a form (duh...). the idea is to pass the template file to the dompdf engine and have the template populate from the $_SESSION then out to PDF. it seems to me that when the $template gets loaded it should do that, yes?
here's the basic code:
<?php
function renderToPDF($theTemplate = "template.php") // this is just to show the value
{
require_once("dompdf/dompdf_config.inc.php");
$content = file_get_contents($theTemplate);
if ($content !== false)
{
$dompdf = new DOMPDF();
$dompdf->load_html($content);
$dompdf->render();
$dompdf->stream("kapow_ItWorks.pdf");
}
}
?>
and this is the template.php file (basically... you don't want all 16 pages...)
<html>
<meta>
<head>
<link href="thisPage.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1><?php echo $_SESSION['someTitle'] ?></h1>
<h2>wouldn't it be nice, <?php echo $_SESSION['someName'] ?></h2>
</body>
</html>
so my thinking is that the template.php will pull the variables right out of the $_SESSION array without any intervention, looking like this:
BIG TITLE
wouldn't it be nice, HandsomeLulu?
i guess the nut of the question is: Do $_SESSION variables get evaluated when PHP files are loaded, but not rendered?
WR!
file_get_contents does not evaluate the PHP file, it simply gets its contents (the file as it is in the hard drive).
To do what you want, you need to use output buffering and include.
ob_start(); // Start Output beffering
include $theTemplate; // include the file and evaluate it : all the code outside of <?php ?> is like doing an `echo`
$content = ob_get_clean(); // retrieve what was outputted and close the OB
for some reason, the code ON the page that calls the function ALSO gets dumped into the file. this was placed before the header. i understand now why: i wasn't referencing an external page, i was importing and external page. don't know why that didn't click.
anyway. as soon as i got rid of the page's extra stuff, it worked just fine. in retrospect, what dompdf needed to state was quite simply that NO HTML of ANY kind (echo, print, &c.) can be on the page that calls the function. at least that what it appears to require at this level of my knowledge.
for those who, like me, are floundering in a misma of 'everything but the answer', here's the bare bones code that did the job:
buildPDF.php:
<?php
session_start();
$_SESSION['someTitle'] = "BIG FAT TITLE";
$_SESSION['someName'] = "HandomeLu";
$theTemplate = 'template.php';
function renderToPDF($templateFile)
{
require_once("_dox/dompdf/dompdf_config.inc.php");
ob_start();
include $templateFile;
$contents = ob_get_clean();
if ($contents !== false)
{
$dompdf = new DOMPDF();
$dompdf->load_html($contents);
$dompdf->render();
$dompdf->stream("kapow_ItWorks.pdf");
}
}
renderToPDF($theTemplate);
?>
and this is the template.php:
<!DOCTYPE HTML>
<html>
<meta>
<head>
<meta charset="utf-8">
<link href="thisPage.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1><?php echo $_SESSION['someTitle'] ?></h1>
<p>wouldn't it be nice, <?php echo $_SESSION['someName'] ?></p>
</body>
</html>
also note that the external CSS file reads in just fine. so you can still keep the structure and style separate. also, the $_SESSION variables can be set anywhere, obviously, i just set them here to keep testing easy.
hope this is useful for those getting started with this GREAT class. if you're looking to get up and running cranking out PDF files, this kicks so much butt, it should have a trigger and a grip on it. :)
thanks to everyone who commented. you got me in the place i needed to be. :)
this site ROCKS.
WR!