Regex doesn't produce match when containing a new line - regex

I'm trying to parse the page https://extensions.typo3.org/extension/tt_news/ for version numbers and corresponding dates with sed or grep.
More specifically, I'm interested in the following html section:
<tr>
<td class="align-middle">
<strong>3.6.0</strong> /
<span class="ter-ext-state-beta">beta</span>
<br />
<small>
April 06, 2014
</small>
</td>
<td class="align-middle">
tt_news for TYPO3 4.5 - 6.2 (compatibility update)
</td>
<td class="align-middle">
<strong>4.5.0 - 6.2.99</strong>
</td>
<td class="align-middle">
<a class="btn btn-primary" title="Size: 2.58MB" href="/extension/download/tt_news/3.6.0/zip/">
<strong>
Download ZIP Archive
</strong>
</a>
</td>
</tr>
I would like to get from each of these sections the version (between the strong tag) and the date (between the small tag).
All my attempts have failed so far and I can narrow down the problem to something very easy.
I have tested the following regex which only tries to get an tr tag followed by whitespaces and a td tag on regex101.com and there, it works perfectly fine:
<tr>\s*<td
It gives me 5 matches which is correct. The following one also works fine:
<tr[^>]*>\s*<td
It produces 38 results because it includes those tr tags with a css class attribute.
However, neither with grep nor with sed I can get this to work. As soon as I include the \s there aren't any matches anymore. Here is what it looks like:
cat tt_news_history | grep '<tr>\s*<td'
no hits.
cat tt_news_history | grep '<tr>'
6 hits.
cat tt_news_history | grep '<tr[^>]*>'
lots of hits (didn't count). Same thing with sed.
What am I doing wrong? Why can't I use a \s?
Thanks for any hint.

There is a -z option for the GNU grep that makes \s match newlines in the input, eg:
cat tt_news_history | grep -z '<tr>\s*<td'
The relevant fragments from the info documentation:
‘-z’ ‘--null-data’
Treat input and output data as sequences of lines, each terminated
by a zero byte (the ASCII NUL character) instead of a newline.
Like the ‘-Z’ or ‘--null’ option, this option can be used with
commands like ‘sort -z’ to process arbitrary file names.
(...)
How can I match across lines?
Standard grep cannot do this, as it is fundamentally line-based.
Therefore, merely using the ‘[:space:]’ character class does not match
newlines in the way you might expect.
With the GNU ‘grep’ option ‘-z’ (‘--null-data’), each input “line” is
terminated by a null byte; *note Other Options::. Thus, you can match
newlines in the input, but typically if there is a match the entire
input is output, so this usage is often combined with
output-suppressing options like ‘-q’, e.g.:
printf 'foo\nbar\n' | grep -z -q 'foo[[:space:]]+bar'
If this does not suffice, you can transform the input before giving it
to ‘grep’, or turn to ‘awk’, ‘sed’, ‘perl’, or many other utilities
that are designed to operate across lines.

Related

How to use regex for multiple line pattern in shell script

I want to write a bash script that finds a pattern in a html-file which is going over multiple lines.
File for regex:
<td class="content">
some content
</td>
<td class="time">
13.05.2013 17:51
</td>
<td class="author">
A Name
</td>
Now I want to find the content of <td>-tag with the class="time".
So in principle the following regex:
<td class="time">(\d{2}\.\d{2}\.\d{4}\s+\d{2}:\d{2})</td>
grep seems not to be the command I can use, because...
It only returns the complete line or the complete result using -o and not only the result inside the round brackets (...).
It looks only in one line for a pattern
So how is it possible that I will get only a string with 13.05.2013 17:51?
It's not quite there, it prints a leading newline for some reason, but maybe something like this?
$ sed -n '/<td class="time">/,/<\/td>/{s/^<td class="time">$//;/^<\/td>$/d;p}' file
13.05.2013 17:51
Inspired by https://stackoverflow.com/a/13023643/1076493
Edit: Well, there's always perl!
For more info see https://stackoverflow.com/a/1213996/1076493
$ perl -0777 -ne 'print "$1\n" while /<td class="time">\n (.*?)\n<\/td>/gs' regex.txt
13.05.2013 17:51
How fixed is your format? If you're sure it's going to look like that then you can use sed to match the first line, get the next line and print it, like this:
$ sed -n '/<td *class="time">/{n;p}' test
13.05.2013 17:51
You could add something to cover the case where it's on the same line as well. Alternatively pre-process the file to strip all the newlines, maybe collapse the whitespace too (can't be done with sed apparently) and then go from there.
However, if it's an HTML file from somewhere else and you can't be sure of the format I'd consider using some other scripting language that has a library to parse XML, otherwise any solution is liable to break when the format changes.
Edited to add a link to my favorite sed resource for this sort of thing:http://www-rohan.sdsu.edu/doc/sed.html
Try:
awk '/^td class="time">/{gsub(ORS,x); print $2}' RS=\< FS=\> file
or
awk '/^td class="time">/{print $2}' ORS= RS=\< FS='>[[:space:]]*' file

Sed grabbing tags & newlines (Mac OSX)

I have this text where I need to remove the page numbers:
<p class="p3">El gabinete se iba iluminando lentamente ... Por delante de las</p>
<p class="p5"><span class="s4"><i>32</i></span> grandes nubes de un color violeta obscuro...</p>
<p class="p3">
I need to remove
</p>
<p class="p5"><span class="s4"><i>32</i></span>
from it.
So far I have this
sed -E -i '' 's/</p>\n<p class="p[0-9]+"[^>]*><span class=".+">.+<\/span> / /g' Capítulo1.html
But that is not working it works without the </p>\n part, but I really need to capture and replace the </p> too.
Note this is on Mac and sed seems to be a bit different from Linux.
Also the paragraph classes can be anything starting with p followed by a number,similar for the span class s followed by number, and the italic tags can be there or not and in between is the pagenumber.
Unless the newlines really matter, you could try stripping them out first:
tr -d '\n' | sed ...
You missed escaping the forwardslash of the closing paragraph tag, try this:
's/<\/p>\r?\n<p class="p\d+"[^>]*><span class=".+">.+<\/span> / /g' Capítulo1.html
For a more complete match as you've described, try this:
's/<\/p>\r?\n<p class="p\d+"[^>]*?><span class="s\d+">(<i>)?\d+(<\/i>)?<\/span>/ /g' Capítulo1.html
This more specifically narrows down the span class matching, and adds non-greediness to stop any unexpected surprises when a huge chunk of data is removed between a span opening tag and the furthest matching span closing tag.

sed solaris 5.10

Hi I am trying to write a script to parse some html files to make a job a bit easier, but I'm having no luck, I've tried reading other threads and manuals to no avail. I seem to get stuck with circular brackets.
I want to replace all appearances of:
$FORMTOP("2")$ with $FORMTOP("3")$
$WHITE*("5")$ with $WHITE*("10")$
</b> with </strong>
<tr><td with <tr> newline, tab <td
delete occurrences of <td></td>
In sed you will have to put a new line (put a "\" and hit enter) and tab spaces (press spacebar 8 times) manually in the replacement section.
[jaypal#MBP-13~/temp] sed 's/<tr><td/<tr>\
<td/g' test123
<tr>
<td
<tr>
<td
I can't say for certain that this will work on Solaris, as I don't have it available anymore, but I'm using Sun-Solaris std sed commands with nothing fancy, I think this should work.
{
cat <<-EOS
\$FORMTOP("2")$
\$WHITE*("5")$
</b>
<tr><td
EOS
} |sed '
s/\$FORMTOP("2")\$/\$FORMTOP("3")\$/g
s/\$WHITE\*("5")\$/\$WHITE\*("10")\$/g
s/<\/b>/\<\/strong>/g
/<tr><td/{
s/<td//
a\
<td
}
'
#output
$FORMTOP("3")$
$WHITE*("10")$
</strong>
<tr>
<td
For this testing harness, using { cat <<-EOS ... EOS }, I had to escape the '$' that where being interpreted as env vars by the shell. If you put the test data in a file, be sure to remove the '\'s in front of the '$'s.
EDIT Also, stuff that looks indented in sed, is indented with spaces except for the char just before your final <td.
Also, as you wrote 'I've tried reading other threads',you did find the S.O. number one post concerning fixing XML with sed, right?
I hope this helps.

Sed program - deleted strings reappearing?

I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed to clean it up. I understand that sed works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.
Here is the relevant section of my source file:
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
I would like this to be made into the following plaintext format:
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000
The key is that City, Region, and Post code are all on one line now.
I use sed -f commands.sed file.html > output.txt and I believe that the following sed program (commands.sed) should put it in that format:
#using the '#' symbol as delimiter instead of '/'
#remove tags
s#<.*>\(.*\)</.*>#\1#g
#remove the nbsp
s#\( \)*##g
#add a newline before the address (actually typing a newline in the file)
s#\(123 street\)#\
\1#g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s#\(.*\)\n\(.*\)\n\(.*\)#\1 \2 \3#g
}
Seems to make sense. Tags are all stripped and then three lines are put into one.
Buuuuut it doesn't work that way. Here is the result I get:
My Name
123 street
City <span class="region">Region</span> <span class="postal-code">1A1 A1A</span>
my#email.ca
000-000-0000
To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?
I'm running Mac OS X 10.4.11 with the bash shell and using the version of sed that comes with it.
I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.
See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.
Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.
If you have only one data block per php file, try the following (using sed)
kent$ cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
kent$ sed 's/<[^>]*>//g; s/ //g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000

One more greedy sed question

I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:
<td width="25%" align="center" valign="top"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></td>
So I do this:
sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm
to get the part which looks like this:
concept_Core.jpg
to do then this:
wget --base=/some/url/concept_Core.jpg
But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)
<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>
That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me
concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg
You understand why. Sed is greedy and this obviously shows up in this case.
Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text
use perl:
perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'
You might want to consider changing:
\(.*jpg\)
into:
\([^"]*jpg\)
This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.
If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.
Use [^"] instead of . in the regular expression.
This will pick all characters except the appostrophes.
sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'
GNU grep can do PCRE:
grep -Po '(?<=\.htm\?).*?jpg' concept.htm