How to use regex for multiple line pattern in shell script - regex

I want to write a bash script that finds a pattern in a html-file which is going over multiple lines.
File for regex:
<td class="content">
some content
</td>
<td class="time">
13.05.2013 17:51
</td>
<td class="author">
A Name
</td>
Now I want to find the content of <td>-tag with the class="time".
So in principle the following regex:
<td class="time">(\d{2}\.\d{2}\.\d{4}\s+\d{2}:\d{2})</td>
grep seems not to be the command I can use, because...
It only returns the complete line or the complete result using -o and not only the result inside the round brackets (...).
It looks only in one line for a pattern
So how is it possible that I will get only a string with 13.05.2013 17:51?

It's not quite there, it prints a leading newline for some reason, but maybe something like this?
$ sed -n '/<td class="time">/,/<\/td>/{s/^<td class="time">$//;/^<\/td>$/d;p}' file
13.05.2013 17:51
Inspired by https://stackoverflow.com/a/13023643/1076493
Edit: Well, there's always perl!
For more info see https://stackoverflow.com/a/1213996/1076493
$ perl -0777 -ne 'print "$1\n" while /<td class="time">\n (.*?)\n<\/td>/gs' regex.txt
13.05.2013 17:51

How fixed is your format? If you're sure it's going to look like that then you can use sed to match the first line, get the next line and print it, like this:
$ sed -n '/<td *class="time">/{n;p}' test
13.05.2013 17:51
You could add something to cover the case where it's on the same line as well. Alternatively pre-process the file to strip all the newlines, maybe collapse the whitespace too (can't be done with sed apparently) and then go from there.
However, if it's an HTML file from somewhere else and you can't be sure of the format I'd consider using some other scripting language that has a library to parse XML, otherwise any solution is liable to break when the format changes.
Edited to add a link to my favorite sed resource for this sort of thing:http://www-rohan.sdsu.edu/doc/sed.html

Try:
awk '/^td class="time">/{gsub(ORS,x); print $2}' RS=\< FS=\> file
or
awk '/^td class="time">/{print $2}' ORS= RS=\< FS='>[[:space:]]*' file

Related

Working RegEx that fails in Perl find & replace one-liner

I have the following RegEx (<th>Password<\/th>\s*<td>)\w*(<\/td>) which matches <th>Password</th><td>root</td> in this HTML:
<tr>
<th>Password</th>
<td>root</td>
</tr>
However this Terminal command fails to find a match:
perl -pi -w -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
It appears to have something to do with the whitespace between the </th> and <td> but the <\/th>\s*<td> works in the RegEx so why not in Perl?
Have tried substituting \s* for \n*, \r*, \t* and various combinations thereof but still no match.
A working example can be seen here.
Any help would be gratefully appreciated.
The substitution is only applied to one line of your file at a time.
You can read the entire file in at once using the -0 option, like this
perl -w -0777 -pi -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
Note that it is far preferable to use a proper HTML parser, such as HTML::TreeBuilder::XPath, to process data like this, as it is very difficult to account for all possible representations of a given HTML construct using regular expressions.
Perl evaluates a file one line at a time, in your example you're trying to match over two lines so perl never finds the end of the string it's looking for on the first line, and never finds the beginning of the line it's looking for on the second line.
You can either flatten file.html to a single line temporarily (which might work if the file's small / performance is not so important) or you'll need to write more sophisticated logic to keep track of lines it's found.
Try searching for 'multiline regex perl' :)
You could use sed to do this:
sed -i '/<th>Password<\/th>/{n;s!<td>[^<]*!<td>NEWPASSWORD!}' file.html
Another sed version:
sed -i '/<th>Password<\/th>/!b;n;s/<td>[^<]*/<td>NEWPASSWORD/' file.html

Use a RegEx expression for PayPal output

I'm having issues using a sed expression to get the data I would like. I've research it a bit, and tried a small tutorial but I could use some help. I feel that I can't use any
The closest I've come to a similar thread was "How do i print word after regex but not a similar word?".
I'm trying to parse through this to get information:
<table cellpadding=""0"" cellspacing=""0"" border=""0""><tr><td>Product<br>Total: 9.99 CAD<br></td></tr><tr><td><br /> <table cellpadding=""0"" cellspacing=""0"" border=""0"" style=""font-size:10px;""><tr><td colspan=""2""><b style=""color:#777; font size:12px;"">==Payer Info==</b></td></tr><tr><td width=""70""><b style=""color:#777"">First Name</b> </td><td>Greg</td></tr><tr><td><b style=""color:#777"">Last Name</b> </td><td>Allan</td></tr><tr><td><b style=""color:#777"">E-Mail</b></td><td>gregoryallan#me.com</td></tr></table></td></tr></table>
Ideally from this I'd like to get the persons first name. I have to make an expression that follows up until the > before the first name and then grab that variable.
$ sed -n 's/^.*[Payer Info] -- grab name and stop when you hit </td>
I've been misleading because I implied I was doing it in terminal. Which was my first goal. But now I need to use this RegEx in a Google Apps Script. I assumed that it would be similar - and it is not. Very sorry for all those who I misled.
This might work (assuming the format is always exactly like in your example):
sed -e 's/^.*First Name<\/b> <\/td><td>\([^<]*\).*$/\1/g' sed_sample
Here I extracted you the name (Greg in your case):
sed 's_^.*First Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
You can easily modify it to get other fields out.
Second name:
sed 's_^.*Last Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
Email:
sed 's_^.*E-Mail[^d]*d>[^>]*>\([A-Za-z#.]*\).*_\1_'
Inside a script you can use something like:
NAME = $(echo $STRING | sed xxx )
where you replace xxx with the commands from sed.
There are many other possibilities to capture the output of a process inside a script.

Regex in perl/sed replacement not matching whitespace/characters

Given this file, I'm trying to do a super primitive sed or perl replacement of a footer.
Typically I use DOM to parse HTML files but so far I've had no issues due to the primitive HTML files I'm dealing with ( time matters ) using sed/perl.
All I need is to replace the <div id="footer"> which contains whitespace, an element that has another element, and the closing </div> with <?php include 'footer.php';?>.
For some reason I can't even get this pattern to match up until the <div id="stupid">. I know there are whitespace characters so i used \s*:
perl -pe 's|<div id="footer">.*\s*.*\s*|<?php include INC_PATH . 'includes/footer.php'; ?>|' file.html | less
But that only matches the first line. The replacement looks like this:
<?php include INC_PATH . includes/footer.php; ?>
<div id="stupid"><img src="file.gif" width="206" height="252"></div>
</div>
Am I forgetting something simple, or should I specify some sort of flag to deal with a multiline match?
perl -v is 5.14.2 and I'm only using the pe flags.
You probably want -0777, which will force perl to read the entire file at once.
perl -0777 -n -e 's|something|else|g' file
Also, your strategy of doing .*\s*.*\s* is pretty fragile. It'll match e.g. <div id="foo", which is just a fragment...
Are you forgetting that almost all regex parsing works on a line-by-line basis?
I've always had to use tr to convert the newlines into some other character, and then back again after the regex.
Just found this: http://www.perlmonks.org/?node_id=17947
You need to tell the regex engine to treat your scalar as a multiline string with the /m option; otherwise it won't attempt to match across newlines.
perl -p
is working on the file on a line by line basis see perl.com
that means your regex will never see all lines to match, it will only match when it gets the line that starts with "<div id="footer">" and on the following lines it will not match anymore.

Return each instance of a regex

I've extensively googled and everyone keeps telling me how to return the LINE that the regex matches...
go lets say that i have a line like this in a text file:
<a href=http://google.com> Google </a>
I want to be able to return ONLY what occurs between > and < ("Google"). The problem is that I could have a file with thousands of lines like that and I only want to have sed/awk return the EXACT string that matches the regex.
I figured it would be something along the lines of :
sed 's/>.*</p'
but obviously that wont work...
Its killing me because im sure its probably very simple but i just cant find the right sed line. can sed just not do it?
So I just want it to search through a file, match the regex i give it, and return the exact match (not the line)
anyone have any ideas?
With `sed
sed -n 's/^.*>\([^<]*\)<.*$/\1/p'
If you have GNU grep, the -o option does what you want.
echo '<a href=http://google.com> Google </a><span>foo</span>' |
grep -oP '(?<=>).*?(?=<)'
returns
Google
foo
This assumes your html tags and their enclosed data are on one line. This would not match
<a href=...>
blah
</a>

One more greedy sed question

I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:
<td width="25%" align="center" valign="top"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></td>
So I do this:
sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm
to get the part which looks like this:
concept_Core.jpg
to do then this:
wget --base=/some/url/concept_Core.jpg
But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)
<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>
That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me
concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg
You understand why. Sed is greedy and this obviously shows up in this case.
Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text
use perl:
perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'
You might want to consider changing:
\(.*jpg\)
into:
\([^"]*jpg\)
This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.
If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.
Use [^"] instead of . in the regular expression.
This will pick all characters except the appostrophes.
sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'
GNU grep can do PCRE:
grep -Po '(?<=\.htm\?).*?jpg' concept.htm