One more greedy sed question - regex

I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:
<td width="25%" align="center" valign="top"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></td>
So I do this:
sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm
to get the part which looks like this:
concept_Core.jpg
to do then this:
wget --base=/some/url/concept_Core.jpg
But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)
<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>
That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me
concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg
You understand why. Sed is greedy and this obviously shows up in this case.
Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text

use perl:
perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'

You might want to consider changing:
\(.*jpg\)
into:
\([^"]*jpg\)
This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.
If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.

Use [^"] instead of . in the regular expression.
This will pick all characters except the appostrophes.

sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'

GNU grep can do PCRE:
grep -Po '(?<=\.htm\?).*?jpg' concept.htm

Related

Working RegEx that fails in Perl find & replace one-liner

I have the following RegEx (<th>Password<\/th>\s*<td>)\w*(<\/td>) which matches <th>Password</th><td>root</td> in this HTML:
<tr>
<th>Password</th>
<td>root</td>
</tr>
However this Terminal command fails to find a match:
perl -pi -w -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
It appears to have something to do with the whitespace between the </th> and <td> but the <\/th>\s*<td> works in the RegEx so why not in Perl?
Have tried substituting \s* for \n*, \r*, \t* and various combinations thereof but still no match.
A working example can be seen here.
Any help would be gratefully appreciated.
The substitution is only applied to one line of your file at a time.
You can read the entire file in at once using the -0 option, like this
perl -w -0777 -pi -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
Note that it is far preferable to use a proper HTML parser, such as HTML::TreeBuilder::XPath, to process data like this, as it is very difficult to account for all possible representations of a given HTML construct using regular expressions.
Perl evaluates a file one line at a time, in your example you're trying to match over two lines so perl never finds the end of the string it's looking for on the first line, and never finds the beginning of the line it's looking for on the second line.
You can either flatten file.html to a single line temporarily (which might work if the file's small / performance is not so important) or you'll need to write more sophisticated logic to keep track of lines it's found.
Try searching for 'multiline regex perl' :)
You could use sed to do this:
sed -i '/<th>Password<\/th>/{n;s!<td>[^<]*!<td>NEWPASSWORD!}' file.html
Another sed version:
sed -i '/<th>Password<\/th>/!b;n;s/<td>[^<]*/<td>NEWPASSWORD/' file.html

How to insert an arbitrary string after pattern with sed

It must be really easy, but somehow I don't get it… I want to process an HTML-file via a bash script and insert an HTML-String into a certain node:
org.html: <div id="wrapper"></div>
MYTEXT=$(phantomjs capture.js www.somesite.com)
# MYTEXT will look something like this:
# <div id="test" style="top: -1.9%;">Something</div>
sed -i "s/\<div id=\"wrapper\"\>/\<div id=\"wrapper\"\>$MYTEXT/" org.html
I always get this error: bad flag in substitute command: 'd' which is probably because sed interprets the content of $MYTEXT as a pattern as well – which is not what I want…
By the way: Duplicating \<div id=\"wrapper\"\> is probably also not necessary?
It seems the / in $MYTEXT's </div> part is interpreted indeed as the final / in the sed command. You can choose another delimiter, which does not appear in $MYTEXT, for instance:
sed -i "s|\<div id=\"wrapper\"\>|\<div id=\"wrapper\"\>$MYTEXT|" org.html

How to use regex for multiple line pattern in shell script

I want to write a bash script that finds a pattern in a html-file which is going over multiple lines.
File for regex:
<td class="content">
some content
</td>
<td class="time">
13.05.2013 17:51
</td>
<td class="author">
A Name
</td>
Now I want to find the content of <td>-tag with the class="time".
So in principle the following regex:
<td class="time">(\d{2}\.\d{2}\.\d{4}\s+\d{2}:\d{2})</td>
grep seems not to be the command I can use, because...
It only returns the complete line or the complete result using -o and not only the result inside the round brackets (...).
It looks only in one line for a pattern
So how is it possible that I will get only a string with 13.05.2013 17:51?
It's not quite there, it prints a leading newline for some reason, but maybe something like this?
$ sed -n '/<td class="time">/,/<\/td>/{s/^<td class="time">$//;/^<\/td>$/d;p}' file
13.05.2013 17:51
Inspired by https://stackoverflow.com/a/13023643/1076493
Edit: Well, there's always perl!
For more info see https://stackoverflow.com/a/1213996/1076493
$ perl -0777 -ne 'print "$1\n" while /<td class="time">\n (.*?)\n<\/td>/gs' regex.txt
13.05.2013 17:51
How fixed is your format? If you're sure it's going to look like that then you can use sed to match the first line, get the next line and print it, like this:
$ sed -n '/<td *class="time">/{n;p}' test
13.05.2013 17:51
You could add something to cover the case where it's on the same line as well. Alternatively pre-process the file to strip all the newlines, maybe collapse the whitespace too (can't be done with sed apparently) and then go from there.
However, if it's an HTML file from somewhere else and you can't be sure of the format I'd consider using some other scripting language that has a library to parse XML, otherwise any solution is liable to break when the format changes.
Edited to add a link to my favorite sed resource for this sort of thing:http://www-rohan.sdsu.edu/doc/sed.html
Try:
awk '/^td class="time">/{gsub(ORS,x); print $2}' RS=\< FS=\> file
or
awk '/^td class="time">/{print $2}' ORS= RS=\< FS='>[[:space:]]*' file

Use a RegEx expression for PayPal output

I'm having issues using a sed expression to get the data I would like. I've research it a bit, and tried a small tutorial but I could use some help. I feel that I can't use any
The closest I've come to a similar thread was "How do i print word after regex but not a similar word?".
I'm trying to parse through this to get information:
<table cellpadding=""0"" cellspacing=""0"" border=""0""><tr><td>Product<br>Total: 9.99 CAD<br></td></tr><tr><td><br /> <table cellpadding=""0"" cellspacing=""0"" border=""0"" style=""font-size:10px;""><tr><td colspan=""2""><b style=""color:#777; font size:12px;"">==Payer Info==</b></td></tr><tr><td width=""70""><b style=""color:#777"">First Name</b> </td><td>Greg</td></tr><tr><td><b style=""color:#777"">Last Name</b> </td><td>Allan</td></tr><tr><td><b style=""color:#777"">E-Mail</b></td><td>gregoryallan#me.com</td></tr></table></td></tr></table>
Ideally from this I'd like to get the persons first name. I have to make an expression that follows up until the > before the first name and then grab that variable.
$ sed -n 's/^.*[Payer Info] -- grab name and stop when you hit </td>
I've been misleading because I implied I was doing it in terminal. Which was my first goal. But now I need to use this RegEx in a Google Apps Script. I assumed that it would be similar - and it is not. Very sorry for all those who I misled.
This might work (assuming the format is always exactly like in your example):
sed -e 's/^.*First Name<\/b> <\/td><td>\([^<]*\).*$/\1/g' sed_sample
Here I extracted you the name (Greg in your case):
sed 's_^.*First Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
You can easily modify it to get other fields out.
Second name:
sed 's_^.*Last Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
Email:
sed 's_^.*E-Mail[^d]*d>[^>]*>\([A-Za-z#.]*\).*_\1_'
Inside a script you can use something like:
NAME = $(echo $STRING | sed xxx )
where you replace xxx with the commands from sed.
There are many other possibilities to capture the output of a process inside a script.

sed - what's wrong with my replacement?

I have a list of 300+ files that have to be edited, so i thought that sed command in tandem with find and exec could help me.
Before doing something wrong (like overwrite files that I won't modify, or something like that) I decided to use sed and output his result into bash instead local substitution.
The string i'm searchin' for is: <tr><td class="button" style=" background:#040404; color:#eee9dc; font-size: 23px; padding:5px 0; text-align:center;">Offerte & LastMinute</td></tr>
and I only want to replace that part <a href="../Z2/C24357-0/hhcm-LASTMINUTES.html with <a href="../../../special_offers.php?lang=it"
Since I'm a noob about regex , i'll take a look at that web page that drive me into regex argument in a decent way.
Now, i've try something like that (on a single file, just for take a look to the output in a "safe" way)
sed s/\<a.*LAST.*html\"
/\<a href="\.\.\/\.\.\/\.\.\/special_offers\.php\?lang=en"/
C25030-9_3/hhcm-Solo_per_due.html
and I get that error: sed: -e expression #1, char 20: unterminated s' command
like the sed expressions isn't correct, but I don't know if the error is into the replacement part or somewhere else.
Thanks from now for your time and effort.
S.
Edit
Thanks to answers, I resolved it by doin' something this
sed 's/<a.*LAST.*html\"/\<a href="..\/..\/..\/special_offers.php?lang=en"/' C25030-9_3/hhcm-Solo_per_due.html
First, it seems that you need to put the whole sed expression (s/.../.../) in single quotes, second, you forgot the / in the end (see edit), and third, don't forget that the substitution will only work if the whole pattern string is on one line.
EDIT: More info on s command (see here, for example): it's s/old/new/ and possible flags after the last /. Your command looks like s/something/somethingelse/somethingelse, it has more than it should, I think.