sed - what's wrong with my replacement? - regex

I have a list of 300+ files that have to be edited, so i thought that sed command in tandem with find and exec could help me.
Before doing something wrong (like overwrite files that I won't modify, or something like that) I decided to use sed and output his result into bash instead local substitution.
The string i'm searchin' for is: <tr><td class="button" style=" background:#040404; color:#eee9dc; font-size: 23px; padding:5px 0; text-align:center;">Offerte & LastMinute</td></tr>
and I only want to replace that part <a href="../Z2/C24357-0/hhcm-LASTMINUTES.html with <a href="../../../special_offers.php?lang=it"
Since I'm a noob about regex , i'll take a look at that web page that drive me into regex argument in a decent way.
Now, i've try something like that (on a single file, just for take a look to the output in a "safe" way)
sed s/\<a.*LAST.*html\"
/\<a href="\.\.\/\.\.\/\.\.\/special_offers\.php\?lang=en"/
C25030-9_3/hhcm-Solo_per_due.html
and I get that error: sed: -e expression #1, char 20: unterminated s' command
like the sed expressions isn't correct, but I don't know if the error is into the replacement part or somewhere else.
Thanks from now for your time and effort.
S.
Edit
Thanks to answers, I resolved it by doin' something this
sed 's/<a.*LAST.*html\"/\<a href="..\/..\/..\/special_offers.php?lang=en"/' C25030-9_3/hhcm-Solo_per_due.html

First, it seems that you need to put the whole sed expression (s/.../.../) in single quotes, second, you forgot the / in the end (see edit), and third, don't forget that the substitution will only work if the whole pattern string is on one line.
EDIT: More info on s command (see here, for example): it's s/old/new/ and possible flags after the last /. Your command looks like s/something/somethingelse/somethingelse, it has more than it should, I think.

Related

Using ampersand in sed

I have a csv file full of lines like the following:
Aity Chel Jenni,Hendaland 229,2591 TE Amsterdam
I want to create a sed pattern for in an automated batch script that changes the info in this kind of formatting into the following formatting:
Aity Chel Jenni,Hendaland 30,2591 TE, Amsterdam
With a bit of research, I found out that I had to create a regex, then use an ampersand (&) character to have it change things around using the & to define the location of the regex.
I have tried the following:
sed 's/([1-9] [A-Z]{2}/&,/' file1 >file2
And have been trying variants of that trying to get the regexes down, but it doesn't seem to change anything.
Am I making a mistake in the usage of the ampersand or is my regex wrong?
Reading through the internet I can't seem to wrap my head around this function, can someone give me any examples/explain to me how to properly do this?
You are saying
sed 's/([1-9] [A-Z]{2}/&,/' file1 >file2
^
But you don't have to capture with () to use &. Instead, just say:
sed 's/[1-9] [A-Z]\{2\}/&,/' file
Note you need to escape the elements in the { } quantifier, unless you use -r:
sed -r 's/[1-9] [A-Z]{2}/&,/' file
Try the following:
sed -r 's:[0-9] [A-Z]{2}\b:&,:' file > out
About your own pattern, you're missing the closing parenthesis. And, iirc, you need to escape ( inside sed patterns to not match them literally.
The -r option enabled sed to use extended regex, which provides the {2} expansion.

Working RegEx that fails in Perl find & replace one-liner

I have the following RegEx (<th>Password<\/th>\s*<td>)\w*(<\/td>) which matches <th>Password</th><td>root</td> in this HTML:
<tr>
<th>Password</th>
<td>root</td>
</tr>
However this Terminal command fails to find a match:
perl -pi -w -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
It appears to have something to do with the whitespace between the </th> and <td> but the <\/th>\s*<td> works in the RegEx so why not in Perl?
Have tried substituting \s* for \n*, \r*, \t* and various combinations thereof but still no match.
A working example can be seen here.
Any help would be gratefully appreciated.
The substitution is only applied to one line of your file at a time.
You can read the entire file in at once using the -0 option, like this
perl -w -0777 -pi -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
Note that it is far preferable to use a proper HTML parser, such as HTML::TreeBuilder::XPath, to process data like this, as it is very difficult to account for all possible representations of a given HTML construct using regular expressions.
Perl evaluates a file one line at a time, in your example you're trying to match over two lines so perl never finds the end of the string it's looking for on the first line, and never finds the beginning of the line it's looking for on the second line.
You can either flatten file.html to a single line temporarily (which might work if the file's small / performance is not so important) or you'll need to write more sophisticated logic to keep track of lines it's found.
Try searching for 'multiline regex perl' :)
You could use sed to do this:
sed -i '/<th>Password<\/th>/{n;s!<td>[^<]*!<td>NEWPASSWORD!}' file.html
Another sed version:
sed -i '/<th>Password<\/th>/!b;n;s/<td>[^<]*/<td>NEWPASSWORD/' file.html

How to insert an arbitrary string after pattern with sed

It must be really easy, but somehow I don't get it… I want to process an HTML-file via a bash script and insert an HTML-String into a certain node:
org.html: <div id="wrapper"></div>
MYTEXT=$(phantomjs capture.js www.somesite.com)
# MYTEXT will look something like this:
# <div id="test" style="top: -1.9%;">Something</div>
sed -i "s/\<div id=\"wrapper\"\>/\<div id=\"wrapper\"\>$MYTEXT/" org.html
I always get this error: bad flag in substitute command: 'd' which is probably because sed interprets the content of $MYTEXT as a pattern as well – which is not what I want…
By the way: Duplicating \<div id=\"wrapper\"\> is probably also not necessary?
It seems the / in $MYTEXT's </div> part is interpreted indeed as the final / in the sed command. You can choose another delimiter, which does not appear in $MYTEXT, for instance:
sed -i "s|\<div id=\"wrapper\"\>|\<div id=\"wrapper\"\>$MYTEXT|" org.html

Use a RegEx expression for PayPal output

I'm having issues using a sed expression to get the data I would like. I've research it a bit, and tried a small tutorial but I could use some help. I feel that I can't use any
The closest I've come to a similar thread was "How do i print word after regex but not a similar word?".
I'm trying to parse through this to get information:
<table cellpadding=""0"" cellspacing=""0"" border=""0""><tr><td>Product<br>Total: 9.99 CAD<br></td></tr><tr><td><br /> <table cellpadding=""0"" cellspacing=""0"" border=""0"" style=""font-size:10px;""><tr><td colspan=""2""><b style=""color:#777; font size:12px;"">==Payer Info==</b></td></tr><tr><td width=""70""><b style=""color:#777"">First Name</b> </td><td>Greg</td></tr><tr><td><b style=""color:#777"">Last Name</b> </td><td>Allan</td></tr><tr><td><b style=""color:#777"">E-Mail</b></td><td>gregoryallan#me.com</td></tr></table></td></tr></table>
Ideally from this I'd like to get the persons first name. I have to make an expression that follows up until the > before the first name and then grab that variable.
$ sed -n 's/^.*[Payer Info] -- grab name and stop when you hit </td>
I've been misleading because I implied I was doing it in terminal. Which was my first goal. But now I need to use this RegEx in a Google Apps Script. I assumed that it would be similar - and it is not. Very sorry for all those who I misled.
This might work (assuming the format is always exactly like in your example):
sed -e 's/^.*First Name<\/b> <\/td><td>\([^<]*\).*$/\1/g' sed_sample
Here I extracted you the name (Greg in your case):
sed 's_^.*First Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
You can easily modify it to get other fields out.
Second name:
sed 's_^.*Last Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
Email:
sed 's_^.*E-Mail[^d]*d>[^>]*>\([A-Za-z#.]*\).*_\1_'
Inside a script you can use something like:
NAME = $(echo $STRING | sed xxx )
where you replace xxx with the commands from sed.
There are many other possibilities to capture the output of a process inside a script.

One more greedy sed question

I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:
<td width="25%" align="center" valign="top"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></td>
So I do this:
sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm
to get the part which looks like this:
concept_Core.jpg
to do then this:
wget --base=/some/url/concept_Core.jpg
But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)
<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>
That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me
concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg
You understand why. Sed is greedy and this obviously shows up in this case.
Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text
use perl:
perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'
You might want to consider changing:
\(.*jpg\)
into:
\([^"]*jpg\)
This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.
If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.
Use [^"] instead of . in the regular expression.
This will pick all characters except the appostrophes.
sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'
GNU grep can do PCRE:
grep -Po '(?<=\.htm\?).*?jpg' concept.htm