Regex - How to remove white spaces and new lines in HTML code? - regex
I would like to remove withe spaces or new lines from a string that comes from a html sentence.
Example: lets take the follow string
<ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate">
<li class="list-group-item active">
<a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"></i> Overall</a>
</li>
<li class="list-group-item list-toggle">
<a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage" ><i class="fa fa-money"></i> Invoice</a>
<ul id="collapse-MoneyManage" class="collapse">
<li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa fa-level-down"></i> Big Invoice </a></li>
<li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa fa-cogs"></i> Big big big
Invoice 2 </a></li>
</ul>
</li>
</ul>
This is the desired result:
<ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate"><li class="list-group-item active"><a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"></i>Overall</a></li><li class="list-group-item list-toggle"><a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage" ><i class="fa fa-money"></i> Invoice</a><ul id="collapse-MoneyManage" class="collapse"><li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa fa-level-down"></i>Big Invoice</a></li><li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa fa-cogs"></i>Big big big Invoice 2</a></li></ul></li></ul>
As you can see:
Only 1 line, no withe spaces or new lines between "><" if there is no string between them.
I would like to have trimmed strings between "><" if there are some. Example: </i> Big Invoice </a> became </i>Big Invoice</a>.
And finally
</i> Big big big
Invoice 2 </a></li>
became </i>Big big big Invoice 2</a></li>, no new line in the middle of the sentence and trimmed.
So far I achieved the first step. This is the regex I used (>\s+<) but I don't know how to achieve the step 2 and 3. Is it possible? Any idea?
Update:
After Adam's post, this the final code:
//Put your html code here. Do not use double quotes " inside it. Instead, use single.
$str =<<<eof
your dynamic HTML here.
eof;
$re = "/(?:\\s*([<>])\\s*|(\\s)\\s*)/im";
$subst = "$1$2";
$result = preg_replace($re, $subst, $str);
//If you want to use JSON
$arrToJSON = array(
"dataPHPtoJs"=>"yourData",
"htmlDyn"=>"$result"
);
$resultJSON= json_encode(array($arrToJSON));
This html string is clean. So you can use it trough AJAX, JSON, inside javascript, that will works.
I my case I am using inside a javascript code, no AJAX, no JSON.
var htmlDyn="<?php echo $result; ?>";
//Do what you want to do with.
$('.someElementClass').append(htmlDyn);
Here is the solution:
(?:\s*([<>])\s*|(\s)\s*)
Substitution:
\1\2
You can try it here:
https://regex101.com/r/dL5gB5/1
Some XML conversions if you please?
The following snippet is in PHP but could easily transformed to work with i.e. Python as well.
<?php
$string = <<<EOF
<html>
<ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate">
<li class="list-group-item active">
<a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"></i> Overall</a>
</li>
<li class="list-group-item list-toggle">
<a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage" ><i class="fa fa-money"></i> Invoice</a>
<ul id="collapse-MoneyManage" class="collapse">
<li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa fa-level-down"></i> Big Invoice </a></li>
<li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa fa-cogs"></i> Big big big
Invoice 2 </a></li>
</ul>
</li>
</ul>
</html>
EOF;
$xml = simplexml_load_string($string);
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = false;
$dom->loadXML($xml->asXML());
echo $dom->saveXML();
/* output:
<html><ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate"><li class="list-group-item active"><a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"/> Overall</a></li><li class="list-group-item list-toggle"><a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage"><i class="fa fa-money"/> Invoice</a><ul id="collapse-MoneyManage" class="collapse"><li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa fa-level-down"/> Big Invoice </a></li><li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa fa-cogs"/> Big big big
Invoice 2 </a></li></ul></li></ul></html>
*/
?>
Eliminates all unnecessary whitespace and is safer then using regular expressions on HTML tags.
This will trim the whitespaces adjacent to tags and remove newlines in the middle of content.
Find:
(?:\s*(<(?:(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>)\s*|(?:\r?\n)+)
Replace:
$1
Output:
<ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate"><li class="list-group-item active"><a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"></i>Overall</a></li><li class="list-group-item list-toggle"><a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage" ><i class="fa fa-money"></i>Invoice</a><ul id="collapse-MoneyManage" class="collapse"><li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa fa-level-down"></i>Big Invoice</a></li><li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa fa-cogs"></i>Big big big Invoice 2</a></li></ul></li></ul>
Benchmark:
Regex1: (?:\s*(<(?:(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>)\s*|(?:\r?\n)+)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 29
Elapsed Time: 6.75 s, 6749.58 ms, 6749576 µs
Related
bootstrap data toggle tab is working only one tab but the other tab is not working
<div class="widget-header" style="margin-top: 5%;"> <ul class="nav nav-tabs"> <li class="<c:if test="${tab != 'LETTERS'}">selected active </c:if>inline headerDivider"> <a class="<c:if test="${tab != 'LETTERS'}">active </c:if>header" href="action" data-toggle="tab"><Regulations</a> </li> <li class="<c:if test="${tab == 'LETTERS'}">selected active </c:if>inline headerDivider"> <a class="<c:if test="${tab == 'LETTERS'}">>selected active </c:if>header" href="action" data-toggle="tab">Letters</a> </li> <c:choose> <c:when test="${tab != 'LETTERS'}"> <a data-href="action" class="btn btn-small" data-toggle="modal" data-reload="regulationDiv">Manage Regulations</a> </c:when> <c:otherwise> <a data-href="action" class="btn btn-small" data-toggle="modal" data-reload="letterDiv">Add Letter</a> </c:otherwise> </c:choose> </ul> </div> <div style="overflow: auto; max-height:60vh;" class="tab-content"> <c:if test="${tab != 'LETTERS'}"> <div id="regulationDiv" data-url='action'> <jsp:include page="regulations.jsp"/> </div> </c:if> <c:if test="${tab == 'LETTERS'}"> <div id="letterDiv" data-url='action'> <jsp:include page="letters.jsp"/> </div> </c:if> </div> First tab by default its working second tab there is no event or click, looks like disabled tab.Am I missing something to include in div elements, I tried many ways seems no solution yet to me.
RIDE Robot framework Select from dynamic list
I am trying to choose an element("Classic") from a dynamic dropdown list. Problem is that word Classic contains 2 elements. Html page is: <ul id="dynamic-14" class="results" role="list"> <li class="results-dept result"> <div dynamic-102" class="results" role="option"> <span class="match"/> </div> </li> <li class="results-dept result"> <div dynamic-12" class="results" role="option"> <span class="match"/> Classic </div> </li> <li class="results-dept result"> <div dynamic-1022" class="results" role="option"> <span class="match"/> Classic numbers </div> </li> I tried to do it with xpath using: //ul[#class="results"] //div[contains(.,'Classic')] but it gives me back 2 values so robot framework can't choose one I need.
user normalize-space() function to get rid of the leading and trailing whitespace. //ul[#class="results"] //div[ normalize-space(.)='Classic']
perl regex for complex multiline search replace
I know there are many questions on this topic, but most are fairly trivial and I'm unable to find a solution for my case. I have a set of HTML files with many, many "media" items like the following, each of which is a "paragraph", separated by "\n\n". Here is a link to a sample file of the type I'm working on. <li class="media"> <div class="media-left"> <a href="#"> <img class="media-object" src="4_17-HE-assoc.png" width="250" alt="..."> </a> </div> <div class="media-body"> <h4 class="media-heading">Figure 4.17</h4> Association plot for the hair-color eye-color data. Left: marginal table, collapsed over gender; right: full table. </div> </li> For each <img ...> tag, I need to find the src="file" value, and replace the href="#" on the previous line by href="file" class="fancybox. i.e., so that item will then look like <li class="media"> <div class="media-left"> <a href="4_17-HE-assoc.png" class="fancybox"> <img class="media-object" src="4_17-HE-assoc.png" width="250" alt="..."> </a> </div> <div class="media-body"> <h4 class="media-heading">Figure 4.17</h4> Association plot for the hair-color eye-color data. Left: marginal table, collapsed over gender; right: full table. </div> </li> I tried the following as a one-liner, but it has no effect, i.e., it doesn't make the changes. perl -pi~ -e '$/ = "";s|<a href="#">\n(\s*<img class="media object") src=(".*png")|<a class="fancybox" href="\2">\n\1 src=\2|ms' ch03.html Can someone help with this? I'd be happy with a simple script that I could use for this and modify for other similar modifications of a collection of web files. edit: I'm aware of the advantages of using perl modules such as HTML::TreeBuilder to avoid having to parse HTML directly. If someone could give me a start, I could probably take it from there.
use XML::LibXML qw( ); my $qfn = 'ch03.html'; my $in_qfn = $qfn . "~"; my $out_qfn = $qfn; rename($qfn, $in_qfn) or die("Can't rename \"qfn\": $!\n"); my $parser = XML::LibXML->new(); my $doc = $parser->parse_html_file($in_qfn); for my $a_node ($doc->findnodes('//a[#href="#"]')) { my ($src_node) = $a_node->findnodes('img[1]/#src') or next; $a_node->setAttribute('href', $src_node->value()); $a_node->setAttribute('class', 'fancybox'); } my $html = $doc->toStringHTML(); open(my $fh, '>', $out_qfn) or die("Can't create \"$out_qfn\": $!\n"); print($fh $html); Tested: $ diff -u ch03.html{~,} --- ch03.html~ 2016-01-20 12:41:30.809203040 -0800 +++ ch03.html 2016-01-20 12:41:31.009201042 -0800 ## -1,7 +1,7 ## -<div class="contents"> +<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> +<html><body><div class="contents"> <h1 class="tocpage">Chapter 3: Fitting and Graphing Discrete Distributions</h1> <hr class="tocpage"> - <div class="row"> <div class="col-md-6"> <!-- prelude-inserted --> ## -18,7 +18,7 ## <div class="col-md-6"> <h3>Contents</h3> <dl class="chaptoc"> - <dd>3.1. Introduction to discrete distributions</dd> +<dd>3.1. Introduction to discrete distributions</dd> <dd>3.2. Characteristics of discrete distributions</dd> <dd>3.3. Fitting discrete distributions</dd> <dd>3.4. Diagnosing discrete distributions: Ord plots</dd> ## -27,8 +27,7 ## <dd>3.7. Chapter summary</dd> <dd>3.8. Lab exercises</dd> </dl> - - </div> +</div> </div> <!-- more-content --> ## -38,11 +37,10 ## <h3>Selected figures</h3> <a class="btn btn-primary" href="../../Rcode/ch03.R" role="button">view R code</a> <ul class="media-list"> - <li class="media"> +<li class="media"> <div class="media-left"> - <a href="#"> - <img class="media-object" src="saxony-barplot.png" width="250" alt="males in Saxony families"> - </a> + <a href="saxony-barplot.png" class="fancybox"> + <img class="media-object" src="saxony-barplot.png" width="250" alt="males in Saxony families"></a> </div> <div class="media-body"> <h4 class="media-heading">Figure 3.2</h4> ## -52,9 +50,8 ## <li class="media"> <div class="media-left"> - <a href="#"> - <img class="media-object" src="dbinom2-plot2-1.png" width="250" alt="Binomial distributions"> - </a> + <a href="dbinom2-plot2-1.png" class="fancybox"> + <img class="media-object" src="dbinom2-plot2-1.png" width="250" alt="Binomial distributions"></a> </div> <div class="media-body"> <h4 class="media-heading">Figure 3.9</h4> ## -64,9 +61,8 ## <li class="media"> <div class="media-left"> - <a href="#"> - <img class="media-object" src="dpois-xyplot2-1.png" width="250" alt="Poisson distributions"> - </a> + <a href="dpois-xyplot2-1.png" class="fancybox"> + <img class="media-object" src="dpois-xyplot2-1.png" width="250" alt="Poisson distributions"></a> </div> <div class="media-body"> <h4 class="media-heading">Figure 3.11</h4> ## -76,9 +72,8 ## <li class="media"> <div class="media-left"> - <a href="#"> - <img class="media-object" src="Fed0-plots2-1.png" width="250" alt="Hanging rootogram"> - </a> + <a href="Fed0-plots2-1.png" class="fancybox"> + <img class="media-object" src="Fed0-plots2-1.png" width="250" alt="Hanging rootogram"></a> </div> <div class="media-body"> <h4 class="media-heading">Figure 3.15</h4> ## -89,9 +84,8 ## <li class="media"> <div class="media-left"> - <a href="#"> - <img class="media-object" src="ordplot1-1.png" width="250" alt="Ord plot for the Butterfly data"> - </a> + <a href="ordplot1-1.png" class="fancybox"> + <img class="media-object" src="ordplot1-1.png" width="250" alt="Ord plot for the Butterfly data"></a> </div> <div class="media-body"> <h4 class="media-heading">Figure 3.18</h4> ## -100,9 +94,10 ## </div> </li> - </ul> <!-- media-list --> - </div> <!-- col-md-12 --> + </ul> +<!-- media-list --> +</div> <!-- col-md-12 --> <!-- footer --> </div> <!-- row --> -</div> +</div></body></html>
I couldn't resist but write this one-off, super unstable, sends-me-to-parse-html-with-regex-hell sed command: sed -i.bak '/<a href="#"/ { N /\n.*<img class=/ { s/^\( *<a href="\).*\(\n.*src="\)\([^"]*\)\(.*\)/\1\3" class="fancybox">\2\3\4/ } }' ch03.html This looks for a line with href="#", appends the next line and then substitutes the filename and fancybox into the a tag. Diffing the result and the input file: 43c43 < <a href="#"> --- > <a href="saxony-barplot.png" class="fancybox"> 55c55 < <a href="#"> --- > <a href="dbinom2-plot2-1.png" class="fancybox"> 67c67 < <a href="#"> --- > <a href="dpois-xyplot2-1.png" class="fancybox"> 79c79 < <a href="#"> --- > <a href="Fed0-plots2-1.png" class="fancybox">
Else not executing
login.php $username = #mysql_real_escape_string($_SESSION['user_username']); if($username == ""){ echo'<span><a href="#" style="text-transform: capitalize;" role="button" class="dropdown-toggle" data-toggle="dropdown"></span> </li> <li> </li>'; } else{ echo '<li><a href="#" style="text-transform: capitalize;" role="button" class="dropdown-toggle" data-toggle="dropdown"> <i class="fa fa-user" style="color:white; margin-right:8px;"></i>Welcome <span style="color:#71F9FE; font-weight:bold;">'.$username.'</span> <a href="logout.php"style="color:red;"><i class="fa fa-sign-out" style="margin-right:5px;"></i>iLogout</a> </li>'; }?> When i am trying to login the else part in sot executing the same code i used in my admin login but there it is working perfectly.... i don't understand why my else part is not working ....
Preg_match getting date out of content
Hi I'm trying to get date from this content: <div class="article-meta"> <h1>Kelkraščiu ir prieš eismą</h1> <div class="clear"></div> <ul> <li> <strong>Publikuota:</strong> 2012 spalio 8d. </li> <li> <strong>Autorius:</strong> Vardas, Pavardė </li> <li> <strong>Rubrika:</strong> Fotopolicija </li> </ul> <div class="clear"></div> </div> I need to get this 2012 spalio 8d. put into variable. I was trying with preg_match but don't now how to complete pattern. Can someone help me?
Try this: preg_match_all('/Publikuota:<\/[^>]+>\s*(\d+\s*\w+\s*\w+)/i', $html, $out); print_r($out[1]);