Perl script to search and replace multiple lines in multiple html files - regex

I have many html files in a folder. I need to somehow remove a <div id="user-info" ...>...</div> from all of them. As far as I know I need to use a Perl script for that, but I don't know Perl to do that. Could someone get it for me?
Here is how the "bad" code looks like:
<div id="user-info" class="logged-in">
<a class="icon icon-key-delete" href="https://test.dev/login.php?0,logout=1">Log Out</a>
<a class="icon icon-user-edit" href="https://test.dev/control.php">Control Center</a>
</div> <!-- end of div id=user-info -->
Thank you in advance!

Using XML::XSH2:
for { glob '*.html' } {
open :F html (.) ;
delete //div[#id="user-info" and #class="logged-in"] ;
save :b ;
}

perl -0777 -i.withdiv -pe 's{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;' test.html
-0777 means split on nothing, so slurp in whole file (instead of line by line, the default for -p
-i.withdiv means alter files in place, leaving original with extension .withdiv (default for -p is to just print).
-p means pass line by line (except we are slurping) to passed code (see -e)
-e expects code to run.
man perlrun or perldoc perlrun for more info.
Here's another solution, which will be slightly more familiar to people that know jquery, as the syntax is similar. This uses Mojolicious' ojo module to load up the html content into a Mojo::DOM object, transform it, and then print that transformed version:
perl -Mojo -MFile::Slurp -E 'for (#ARGV) { say x(scalar(read_file $_))->at("#user-info")->replace("")->root; }' test.html test2.html test*.html
To replace content directly:
perl -Mojo -MFile::Slurp -E 'for (#ARGV) { write_file( $_, x(scalar(read_file $_))->at("#user-info")->replace("")->root ); }' test.html
Note, this won't JUST remove the div, it will also re-write the content based on Mojo's Mojo::DOM module, so tag attributes may not be in the same order. Specifically, I saw <div id="user-info2" class="logged-in"> rewritten as <div class="logged-in" id="user-info2">.
Mojolicious requires at least perl 5.10, but after that there's no non-core requirements.

Related

perl one liner -add html tags around matched html element

I am using MS word to generate (Web page, filtered) html and have a code style which I want to wrap with <pre><code>content</pre></code> to have preformatted text (including tabs and spaces) so visitors can copy and paste code from my site.
I am aware of the cautions / dangers of using regex on html / xml, but, not an issue, since I control content.
Input html looks like this:
<table class=Code1 PLAIN_MULTILINE_TEXT_AND_FORMATTING</table>
Output html should look like this:
<pre><code>
<table class=Code1 PLAIN_MULTILINE_TEXT_AND_FORMATTING</table>
</pre></code>
here is my one-liner, in bash script, $1 is filename:
perl -pi -e 's|<table class=Code1 (.*?)</table>|<pre><code><table class=Code1 $1</table></pre></code>|sg' $1
Which has no effect. Appears to not match.
Questions:
1 - What is wrong?
2 - Do I need /s (multiline) modifier
3 - Is there a better way (ultimately, will add this to a pre-cache / rendering script, along with existing auto table of contents and auto popup definitions creation)?
site: www.rossco.org
Thanks;
Bill
Add this to your Perl command line flags to read the entire file instead of line by line: -0777
Why not to make the code a little bit more readable? And may be it is not bad idea to have backup of original file
perl -pi 'orig_*' -0777 -e 's|(<table class=Code1 .*?</table>)|<pre><code>$1</pre></code>|sg' filename

Why is my regex failing to select the correct elements, when it works on the online regex tester

I have a number of xml files, that has HTML embedded in a node . I need capture everything that is not the tags, add some non HTML tags (for moodle) around the text.
I'm processing the files from the command line, using a bash script. I'm using xpath to get the content, piping through xargs to sneakily rip out newlines and then piping through sed.
Heres a sample of the tag:
xpath -q -e '/activity/page/content' page.xml|xargs
<content><h3 style=float:right><img
src=##PLUGINFILE##/consumables.png> </h3> <h3>TITLE</h3>
<p>In order to conduct an LE5 drug test you need a Druglizaer
(batch controlled) foil pouch that contains two items:</p>
<p></p> <ol> <li><span style=font-
weight:900>Druglizer Cartridge</span></li><li><span
style=font-weight:900>Druglizer Oral Fluid
Collector</span></li> </ol> <p></p></content>
On https://regex101.com/ I used \>(.*?)\< which is grouping the text as expected. but when I run with sed it isn't doing any substitutions.
#!/bin/bash
# get new name string
name=$(xpath -q -e '/activity/page/name' page.xml);
en=$(echo $name|sed -e 's/<[^>]*>//g');
vi=$(echo $en|trans -brief -t vi);
cn=$(echo $en|trans -brief -t zh-CN);
mlang_name=$(echo "{mlang en}$en{mlang}{mlang
vi}$vi{mlang}{mlang
zh_cn}$cn{mlang}")
# xmlstarlet to update node
# get new content string
content=$(xpath -q -e '/activity/page/content' page.xml);
# \>(.*?)\<
mlang_name=$(echo $content|sed -e 's/\>(.*?)\</\{mlang
en\}$1\{mlang\}\{mlang
vi\}#VI#\{mlang\}\{mlang
zh_cn\}#CN#\{mlang\}/g')
# xmlstarlet to update node
I need the replace to put {mlang en}TEXT{mlang} around the text.
I ended up using perl as it supports the non-greedy format i was using.
perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'
With the above file, the full command I used was
content=$(xpath -q -e '/activity/page/content' page.xml);echo $content|xargs|sed -e 's/<|<content>//g'|sed -e 's|</content>||g' |perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'|sed -e 's/{mlang en}[\ ]*{mlang}//g'|sed -e 's/<content>//g'
Which gave the following output
<h3 style=float:right><img src=##PLUGINFILE##/consumables.png></h3><h3>{mlang en}TITLE{mlang}</h3><p>{mlang en}In order to conduct an LE5 drug test you need a Druglizaer (batch controlled) foil pouch that contains two items:{mlang}</p><p></p><ol><li><span style=font-weight:900>{mlang en}Druglizer LE5 Cartridge{mlang}</span></li><li><span style=font-weight:900>{mlang en}Druglizer Oral Fluid Collector{mlang}</span></li></ol><p></p>
If there's a more elegant way feel free to let me know.

Working RegEx that fails in Perl find & replace one-liner

I have the following RegEx (<th>Password<\/th>\s*<td>)\w*(<\/td>) which matches <th>Password</th><td>root</td> in this HTML:
<tr>
<th>Password</th>
<td>root</td>
</tr>
However this Terminal command fails to find a match:
perl -pi -w -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
It appears to have something to do with the whitespace between the </th> and <td> but the <\/th>\s*<td> works in the RegEx so why not in Perl?
Have tried substituting \s* for \n*, \r*, \t* and various combinations thereof but still no match.
A working example can be seen here.
Any help would be gratefully appreciated.
The substitution is only applied to one line of your file at a time.
You can read the entire file in at once using the -0 option, like this
perl -w -0777 -pi -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
Note that it is far preferable to use a proper HTML parser, such as HTML::TreeBuilder::XPath, to process data like this, as it is very difficult to account for all possible representations of a given HTML construct using regular expressions.
Perl evaluates a file one line at a time, in your example you're trying to match over two lines so perl never finds the end of the string it's looking for on the first line, and never finds the beginning of the line it's looking for on the second line.
You can either flatten file.html to a single line temporarily (which might work if the file's small / performance is not so important) or you'll need to write more sophisticated logic to keep track of lines it's found.
Try searching for 'multiline regex perl' :)
You could use sed to do this:
sed -i '/<th>Password<\/th>/{n;s!<td>[^<]*!<td>NEWPASSWORD!}' file.html
Another sed version:
sed -i '/<th>Password<\/th>/!b;n;s/<td>[^<]*/<td>NEWPASSWORD/' file.html

How to insert an arbitrary string after pattern with sed

It must be really easy, but somehow I don't get it… I want to process an HTML-file via a bash script and insert an HTML-String into a certain node:
org.html: <div id="wrapper"></div>
MYTEXT=$(phantomjs capture.js www.somesite.com)
# MYTEXT will look something like this:
# <div id="test" style="top: -1.9%;">Something</div>
sed -i "s/\<div id=\"wrapper\"\>/\<div id=\"wrapper\"\>$MYTEXT/" org.html
I always get this error: bad flag in substitute command: 'd' which is probably because sed interprets the content of $MYTEXT as a pattern as well – which is not what I want…
By the way: Duplicating \<div id=\"wrapper\"\> is probably also not necessary?
It seems the / in $MYTEXT's </div> part is interpreted indeed as the final / in the sed command. You can choose another delimiter, which does not appear in $MYTEXT, for instance:
sed -i "s|\<div id=\"wrapper\"\>|\<div id=\"wrapper\"\>$MYTEXT|" org.html

Regex in perl/sed replacement not matching whitespace/characters

Given this file, I'm trying to do a super primitive sed or perl replacement of a footer.
Typically I use DOM to parse HTML files but so far I've had no issues due to the primitive HTML files I'm dealing with ( time matters ) using sed/perl.
All I need is to replace the <div id="footer"> which contains whitespace, an element that has another element, and the closing </div> with <?php include 'footer.php';?>.
For some reason I can't even get this pattern to match up until the <div id="stupid">. I know there are whitespace characters so i used \s*:
perl -pe 's|<div id="footer">.*\s*.*\s*|<?php include INC_PATH . 'includes/footer.php'; ?>|' file.html | less
But that only matches the first line. The replacement looks like this:
<?php include INC_PATH . includes/footer.php; ?>
<div id="stupid"><img src="file.gif" width="206" height="252"></div>
</div>
Am I forgetting something simple, or should I specify some sort of flag to deal with a multiline match?
perl -v is 5.14.2 and I'm only using the pe flags.
You probably want -0777, which will force perl to read the entire file at once.
perl -0777 -n -e 's|something|else|g' file
Also, your strategy of doing .*\s*.*\s* is pretty fragile. It'll match e.g. <div id="foo", which is just a fragment...
Are you forgetting that almost all regex parsing works on a line-by-line basis?
I've always had to use tr to convert the newlines into some other character, and then back again after the regex.
Just found this: http://www.perlmonks.org/?node_id=17947
You need to tell the regex engine to treat your scalar as a multiline string with the /m option; otherwise it won't attempt to match across newlines.
perl -p
is working on the file on a line by line basis see perl.com
that means your regex will never see all lines to match, it will only match when it gets the line that starts with "<div id="footer">" and on the following lines it will not match anymore.