How to insert an arbitrary string after pattern with sed - regex

It must be really easy, but somehow I don't get it… I want to process an HTML-file via a bash script and insert an HTML-String into a certain node:
org.html: <div id="wrapper"></div>
MYTEXT=$(phantomjs capture.js www.somesite.com)
# MYTEXT will look something like this:
# <div id="test" style="top: -1.9%;">Something</div>
sed -i "s/\<div id=\"wrapper\"\>/\<div id=\"wrapper\"\>$MYTEXT/" org.html
I always get this error: bad flag in substitute command: 'd' which is probably because sed interprets the content of $MYTEXT as a pattern as well – which is not what I want…
By the way: Duplicating \<div id=\"wrapper\"\> is probably also not necessary?

It seems the / in $MYTEXT's </div> part is interpreted indeed as the final / in the sed command. You can choose another delimiter, which does not appear in $MYTEXT, for instance:
sed -i "s|\<div id=\"wrapper\"\>|\<div id=\"wrapper\"\>$MYTEXT|" org.html

Related

Why is my regex failing to select the correct elements, when it works on the online regex tester

I have a number of xml files, that has HTML embedded in a node . I need capture everything that is not the tags, add some non HTML tags (for moodle) around the text.
I'm processing the files from the command line, using a bash script. I'm using xpath to get the content, piping through xargs to sneakily rip out newlines and then piping through sed.
Heres a sample of the tag:
xpath -q -e '/activity/page/content' page.xml|xargs
<content><h3 style=float:right><img
src=##PLUGINFILE##/consumables.png> </h3> <h3>TITLE</h3>
<p>In order to conduct an LE5 drug test you need a Druglizaer
(batch controlled) foil pouch that contains two items:</p>
<p></p> <ol> <li><span style=font-
weight:900>Druglizer Cartridge</span></li><li><span
style=font-weight:900>Druglizer Oral Fluid
Collector</span></li> </ol> <p></p></content>
On https://regex101.com/ I used \>(.*?)\< which is grouping the text as expected. but when I run with sed it isn't doing any substitutions.
#!/bin/bash
# get new name string
name=$(xpath -q -e '/activity/page/name' page.xml);
en=$(echo $name|sed -e 's/<[^>]*>//g');
vi=$(echo $en|trans -brief -t vi);
cn=$(echo $en|trans -brief -t zh-CN);
mlang_name=$(echo "{mlang en}$en{mlang}{mlang
vi}$vi{mlang}{mlang
zh_cn}$cn{mlang}")
# xmlstarlet to update node
# get new content string
content=$(xpath -q -e '/activity/page/content' page.xml);
# \>(.*?)\<
mlang_name=$(echo $content|sed -e 's/\>(.*?)\</\{mlang
en\}$1\{mlang\}\{mlang
vi\}#VI#\{mlang\}\{mlang
zh_cn\}#CN#\{mlang\}/g')
# xmlstarlet to update node
I need the replace to put {mlang en}TEXT{mlang} around the text.
I ended up using perl as it supports the non-greedy format i was using.
perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'
With the above file, the full command I used was
content=$(xpath -q -e '/activity/page/content' page.xml);echo $content|xargs|sed -e 's/<|<content>//g'|sed -e 's|</content>||g' |perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'|sed -e 's/{mlang en}[\ ]*{mlang}//g'|sed -e 's/<content>//g'
Which gave the following output
<h3 style=float:right><img src=##PLUGINFILE##/consumables.png></h3><h3>{mlang en}TITLE{mlang}</h3><p>{mlang en}In order to conduct an LE5 drug test you need a Druglizaer (batch controlled) foil pouch that contains two items:{mlang}</p><p></p><ol><li><span style=font-weight:900>{mlang en}Druglizer LE5 Cartridge{mlang}</span></li><li><span style=font-weight:900>{mlang en}Druglizer Oral Fluid Collector{mlang}</span></li></ol><p></p>
If there's a more elegant way feel free to let me know.

Use a RegEx expression for PayPal output

I'm having issues using a sed expression to get the data I would like. I've research it a bit, and tried a small tutorial but I could use some help. I feel that I can't use any
The closest I've come to a similar thread was "How do i print word after regex but not a similar word?".
I'm trying to parse through this to get information:
<table cellpadding=""0"" cellspacing=""0"" border=""0""><tr><td>Product<br>Total: 9.99 CAD<br></td></tr><tr><td><br /> <table cellpadding=""0"" cellspacing=""0"" border=""0"" style=""font-size:10px;""><tr><td colspan=""2""><b style=""color:#777; font size:12px;"">==Payer Info==</b></td></tr><tr><td width=""70""><b style=""color:#777"">First Name</b> </td><td>Greg</td></tr><tr><td><b style=""color:#777"">Last Name</b> </td><td>Allan</td></tr><tr><td><b style=""color:#777"">E-Mail</b></td><td>gregoryallan#me.com</td></tr></table></td></tr></table>
Ideally from this I'd like to get the persons first name. I have to make an expression that follows up until the > before the first name and then grab that variable.
$ sed -n 's/^.*[Payer Info] -- grab name and stop when you hit </td>
I've been misleading because I implied I was doing it in terminal. Which was my first goal. But now I need to use this RegEx in a Google Apps Script. I assumed that it would be similar - and it is not. Very sorry for all those who I misled.
This might work (assuming the format is always exactly like in your example):
sed -e 's/^.*First Name<\/b> <\/td><td>\([^<]*\).*$/\1/g' sed_sample
Here I extracted you the name (Greg in your case):
sed 's_^.*First Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
You can easily modify it to get other fields out.
Second name:
sed 's_^.*Last Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
Email:
sed 's_^.*E-Mail[^d]*d>[^>]*>\([A-Za-z#.]*\).*_\1_'
Inside a script you can use something like:
NAME = $(echo $STRING | sed xxx )
where you replace xxx with the commands from sed.
There are many other possibilities to capture the output of a process inside a script.

Perl script to search and replace multiple lines in multiple html files

I have many html files in a folder. I need to somehow remove a <div id="user-info" ...>...</div> from all of them. As far as I know I need to use a Perl script for that, but I don't know Perl to do that. Could someone get it for me?
Here is how the "bad" code looks like:
<div id="user-info" class="logged-in">
<a class="icon icon-key-delete" href="https://test.dev/login.php?0,logout=1">Log Out</a>
<a class="icon icon-user-edit" href="https://test.dev/control.php">Control Center</a>
</div> <!-- end of div id=user-info -->
Thank you in advance!
Using XML::XSH2:
for { glob '*.html' } {
open :F html (.) ;
delete //div[#id="user-info" and #class="logged-in"] ;
save :b ;
}
perl -0777 -i.withdiv -pe 's{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;' test.html
-0777 means split on nothing, so slurp in whole file (instead of line by line, the default for -p
-i.withdiv means alter files in place, leaving original with extension .withdiv (default for -p is to just print).
-p means pass line by line (except we are slurping) to passed code (see -e)
-e expects code to run.
man perlrun or perldoc perlrun for more info.
Here's another solution, which will be slightly more familiar to people that know jquery, as the syntax is similar. This uses Mojolicious' ojo module to load up the html content into a Mojo::DOM object, transform it, and then print that transformed version:
perl -Mojo -MFile::Slurp -E 'for (#ARGV) { say x(scalar(read_file $_))->at("#user-info")->replace("")->root; }' test.html test2.html test*.html
To replace content directly:
perl -Mojo -MFile::Slurp -E 'for (#ARGV) { write_file( $_, x(scalar(read_file $_))->at("#user-info")->replace("")->root ); }' test.html
Note, this won't JUST remove the div, it will also re-write the content based on Mojo's Mojo::DOM module, so tag attributes may not be in the same order. Specifically, I saw <div id="user-info2" class="logged-in"> rewritten as <div class="logged-in" id="user-info2">.
Mojolicious requires at least perl 5.10, but after that there's no non-core requirements.

sed - what's wrong with my replacement?

I have a list of 300+ files that have to be edited, so i thought that sed command in tandem with find and exec could help me.
Before doing something wrong (like overwrite files that I won't modify, or something like that) I decided to use sed and output his result into bash instead local substitution.
The string i'm searchin' for is: <tr><td class="button" style=" background:#040404; color:#eee9dc; font-size: 23px; padding:5px 0; text-align:center;">Offerte & LastMinute</td></tr>
and I only want to replace that part <a href="../Z2/C24357-0/hhcm-LASTMINUTES.html with <a href="../../../special_offers.php?lang=it"
Since I'm a noob about regex , i'll take a look at that web page that drive me into regex argument in a decent way.
Now, i've try something like that (on a single file, just for take a look to the output in a "safe" way)
sed s/\<a.*LAST.*html\"
/\<a href="\.\.\/\.\.\/\.\.\/special_offers\.php\?lang=en"/
C25030-9_3/hhcm-Solo_per_due.html
and I get that error: sed: -e expression #1, char 20: unterminated s' command
like the sed expressions isn't correct, but I don't know if the error is into the replacement part or somewhere else.
Thanks from now for your time and effort.
S.
Edit
Thanks to answers, I resolved it by doin' something this
sed 's/<a.*LAST.*html\"/\<a href="..\/..\/..\/special_offers.php?lang=en"/' C25030-9_3/hhcm-Solo_per_due.html
First, it seems that you need to put the whole sed expression (s/.../.../) in single quotes, second, you forgot the / in the end (see edit), and third, don't forget that the substitution will only work if the whole pattern string is on one line.
EDIT: More info on s command (see here, for example): it's s/old/new/ and possible flags after the last /. Your command looks like s/something/somethingelse/somethingelse, it has more than it should, I think.

Regex in perl/sed replacement not matching whitespace/characters

Given this file, I'm trying to do a super primitive sed or perl replacement of a footer.
Typically I use DOM to parse HTML files but so far I've had no issues due to the primitive HTML files I'm dealing with ( time matters ) using sed/perl.
All I need is to replace the <div id="footer"> which contains whitespace, an element that has another element, and the closing </div> with <?php include 'footer.php';?>.
For some reason I can't even get this pattern to match up until the <div id="stupid">. I know there are whitespace characters so i used \s*:
perl -pe 's|<div id="footer">.*\s*.*\s*|<?php include INC_PATH . 'includes/footer.php'; ?>|' file.html | less
But that only matches the first line. The replacement looks like this:
<?php include INC_PATH . includes/footer.php; ?>
<div id="stupid"><img src="file.gif" width="206" height="252"></div>
</div>
Am I forgetting something simple, or should I specify some sort of flag to deal with a multiline match?
perl -v is 5.14.2 and I'm only using the pe flags.
You probably want -0777, which will force perl to read the entire file at once.
perl -0777 -n -e 's|something|else|g' file
Also, your strategy of doing .*\s*.*\s* is pretty fragile. It'll match e.g. <div id="foo", which is just a fragment...
Are you forgetting that almost all regex parsing works on a line-by-line basis?
I've always had to use tr to convert the newlines into some other character, and then back again after the regex.
Just found this: http://www.perlmonks.org/?node_id=17947
You need to tell the regex engine to treat your scalar as a multiline string with the /m option; otherwise it won't attempt to match across newlines.
perl -p
is working on the file on a line by line basis see perl.com
that means your regex will never see all lines to match, it will only match when it gets the line that starts with "<div id="footer">" and on the following lines it will not match anymore.