Hi I am trying to write a script to parse some html files to make a job a bit easier, but I'm having no luck, I've tried reading other threads and manuals to no avail. I seem to get stuck with circular brackets.
I want to replace all appearances of:
$FORMTOP("2")$ with $FORMTOP("3")$
$WHITE*("5")$ with $WHITE*("10")$
</b> with </strong>
<tr><td with <tr> newline, tab <td
delete occurrences of <td></td>
In sed you will have to put a new line (put a "\" and hit enter) and tab spaces (press spacebar 8 times) manually in the replacement section.
[jaypal#MBP-13~/temp] sed 's/<tr><td/<tr>\
<td/g' test123
<tr>
<td
<tr>
<td
I can't say for certain that this will work on Solaris, as I don't have it available anymore, but I'm using Sun-Solaris std sed commands with nothing fancy, I think this should work.
{
cat <<-EOS
\$FORMTOP("2")$
\$WHITE*("5")$
</b>
<tr><td
EOS
} |sed '
s/\$FORMTOP("2")\$/\$FORMTOP("3")\$/g
s/\$WHITE\*("5")\$/\$WHITE\*("10")\$/g
s/<\/b>/\<\/strong>/g
/<tr><td/{
s/<td//
a\
<td
}
'
#output
$FORMTOP("3")$
$WHITE*("10")$
</strong>
<tr>
<td
For this testing harness, using { cat <<-EOS ... EOS }, I had to escape the '$' that where being interpreted as env vars by the shell. If you put the test data in a file, be sure to remove the '\'s in front of the '$'s.
EDIT Also, stuff that looks indented in sed, is indented with spaces except for the char just before your final <td.
Also, as you wrote 'I've tried reading other threads',you did find the S.O. number one post concerning fixing XML with sed, right?
I hope this helps.
Related
I'm trying to parse the page https://extensions.typo3.org/extension/tt_news/ for version numbers and corresponding dates with sed or grep.
More specifically, I'm interested in the following html section:
<tr>
<td class="align-middle">
<strong>3.6.0</strong> /
<span class="ter-ext-state-beta">beta</span>
<br />
<small>
April 06, 2014
</small>
</td>
<td class="align-middle">
tt_news for TYPO3 4.5 - 6.2 (compatibility update)
</td>
<td class="align-middle">
<strong>4.5.0 - 6.2.99</strong>
</td>
<td class="align-middle">
<a class="btn btn-primary" title="Size: 2.58MB" href="/extension/download/tt_news/3.6.0/zip/">
<strong>
Download ZIP Archive
</strong>
</a>
</td>
</tr>
I would like to get from each of these sections the version (between the strong tag) and the date (between the small tag).
All my attempts have failed so far and I can narrow down the problem to something very easy.
I have tested the following regex which only tries to get an tr tag followed by whitespaces and a td tag on regex101.com and there, it works perfectly fine:
<tr>\s*<td
It gives me 5 matches which is correct. The following one also works fine:
<tr[^>]*>\s*<td
It produces 38 results because it includes those tr tags with a css class attribute.
However, neither with grep nor with sed I can get this to work. As soon as I include the \s there aren't any matches anymore. Here is what it looks like:
cat tt_news_history | grep '<tr>\s*<td'
no hits.
cat tt_news_history | grep '<tr>'
6 hits.
cat tt_news_history | grep '<tr[^>]*>'
lots of hits (didn't count). Same thing with sed.
What am I doing wrong? Why can't I use a \s?
Thanks for any hint.
There is a -z option for the GNU grep that makes \s match newlines in the input, eg:
cat tt_news_history | grep -z '<tr>\s*<td'
The relevant fragments from the info documentation:
‘-z’ ‘--null-data’
Treat input and output data as sequences of lines, each terminated
by a zero byte (the ASCII NUL character) instead of a newline.
Like the ‘-Z’ or ‘--null’ option, this option can be used with
commands like ‘sort -z’ to process arbitrary file names.
(...)
How can I match across lines?
Standard grep cannot do this, as it is fundamentally line-based.
Therefore, merely using the ‘[:space:]’ character class does not match
newlines in the way you might expect.
With the GNU ‘grep’ option ‘-z’ (‘--null-data’), each input “line” is
terminated by a null byte; *note Other Options::. Thus, you can match
newlines in the input, but typically if there is a match the entire
input is output, so this usage is often combined with
output-suppressing options like ‘-q’, e.g.:
printf 'foo\nbar\n' | grep -z -q 'foo[[:space:]]+bar'
If this does not suffice, you can transform the input before giving it
to ‘grep’, or turn to ‘awk’, ‘sed’, ‘perl’, or many other utilities
that are designed to operate across lines.
I have an HTML document which includes links to a hundred or so local files. I want to use either sed, awk or perl (in that preferred order) to remove the filename portion of the URL up to the last backslash in the URL. In the example below I'm only showing a portion of the HTML code forming the path of the local file.
Example:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
After Processing Example:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
In testing I have tried different regular expression combinations to accomplish this however I'm only getting ".dmg" or it and everything to the left of .dmg and I really only want to remove the "SoftwarePackageName.dmg" portion. BTW In some cases it's "SoftwarePackageName.zip" and there may be a space in the "CompanyName" or "SoftwarePackageName.dmg" shown as "%20". I've also reviewed "Questions that may already have your answer" shown when creating this post.
EDIT: I appreciate the time taken to try and help and certainly understand the difficulty when due to policy I cannot provide more then the example I did and as such I'll just manually edit the html document. I've already taken to much of my time and others on this. Will just have to do more reading on regex for the next time. Thanks to all that contributed. :)
You could try the below sed command.
sed 's/\(<a href="[^."]*\/\)[^."\/]*\.[^."\/]*">/\1">/g' file
modded
I've deleted the previous sed regex (I have no way to test it).
Instead, I'm posting a expanded regex (verbose) that should help get you started.
# Unknown extension: (<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.[^/."'>]+)\2
# Known extension: (<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.dmg\b[^/."'>]*)\2
# Replacement: $1$2
( # (1 start), Tag and Url part to keep
<a \s+ [^>]*? href \s* = \s*
( ["'] ) # (2), Quote
[^>]*?
/ # End of directories
) # (1 end)
( # (3 start), Throw away filename
[^/."'>]+ # - Filename (not /."'> chars)
\. # - Dot
# - Extension and parameters
# ----------------------------
# Use one of these lines (but not both)
# Known extensions ->
#dmg \b [^/."'>]*
# Unknown extensions ->
[^/."'>]+
) # (3 end)
\2 # Backref to Quote
Sed should not use much of a different substitute structure s///g.
It might be the case that you have to escape parenthesis meta characters. But I think that's
it for this regex. These regex are in the raw state.
Here they are used in a sample Perl program. It could easily be done useing Perl from the command line.
use strict;
use warnings;
$/ = undef;
my $html = <DATA>; # slurp in the entire file
my $htmlcopy = $html;
$html =~ s|(<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.[^/."'>]+)\2|\1\2|g;
print "Replaced using Unknown extensions:\n", $html, "\n";
$htmlcopy =~ s|(<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.dmg\b[^/."'>]*)\2|\1\2|g;
print "Replace using Known extensions:\n", $htmlcopy, "\n\n";
__DATA__
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/Standardized-Test-for-English-Proficiency-(STEP).aspx">
<a rel="nofollow" class="external free" href="http://www.ielts.org/about_us.aspx">
<a href="/w/index.php?title=IELTS&redirect=no" title="IELTS">
<a href="/wiki/File:IELTS_logo.svg" class="image">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit§ion=1" title="Edit section: IELTS characteristics">
<a href="/w/index.php?title=Band_score&action=edit&redlink=1" class="new" title="Band score (page does not exist)">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit§ion=2" title="Edit section: IELTS test structure">
<a href="/wiki/University_of_St._Andrews" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/search.aspx">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/language-requirements.html#toc05">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Output >>
Replaced using Unknown extensions:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/">
<a rel="nofollow" class="external free" href="http://www.ielts.org/">
<a href="/w/" title="IELTS">
<a href="/wiki/" class="image">
<a href="/w/" title="Edit section: IELTS characteristics">
<a href="/w/" class="new" title="Band score (page does not exist)">
<a href="/w/" title="Edit section: IELTS test structure">
<a href="/wiki/" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Replace using Known extensions:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/Standardized-Test-for-English-Proficiency-(STEP).aspx">
<a rel="nofollow" class="external free" href="http://www.ielts.org/about_us.aspx">
<a href="/w/index.php?title=IELTS&redirect=no" title="IELTS">
<a href="/wiki/File:IELTS_logo.svg" class="image">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit§ion=1" title="Edit section: IELTS characteristics">
<a href="/w/index.php?title=Band_score&action=edit&redlink=1" class="new" title="Band score (page does not exist)">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit§ion=2" title="Edit section: IELTS test structure">
<a href="/wiki/University_of_St._Andrews" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/search.aspx">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/language-requirements.html#toc05">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Try this:
sed 's|\(<a href="file:///[^>]*/\).*">|\1">|g'
Demo:
$ sed 's|\(<a href="file:///[^>]*/\).*\.\(dmg\|zip\)">|\1">|g' <<EOF
> <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
> foo bar <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg"> baz quux
> EOF
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
foo bar <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/"> baz quux
First I want to say again how much I truly appreciate the time taken by those who tried to help! Secondly, it pains me to have to say that nothing that was presented worked in the real-world application and I, at the least, attribute that to you all not having the actual file I wanted to modify to work with and sorry I was not allowed to provide it. Yes your demos worked however unfortunately they were not at all representative of the actual html coding in the document and maybe because the "Generator" was "Cocoa HTML Writer" from a RTF Document this might have had something to do with it, just not sure at the moment. Even if I took just one complete line that included the Example code, placing it by itself in a file and then processing it nonetheless all solutions presented failed. I wish I could provide the file or take the time to figure out why in this real-world use it fails, however I can't.
Some background on the Document is when originally created as an RTF Document in TextEdit the FQP of the target file was included because the version of OS X would open the target file however in later versions of OS X it only opens Finder to the location of the target file. As such there is no longer a need to use the FQP to the target file just the path to its location. This actually makes it easier to update the RTF Document over time. At times this RTF Document is exported to an HTML Document to be modified and then saved as an RTF Document. As I mention earlier maybe the "Generator" being "Cocoa HTML Writer" from a RTF Document in TextEdit is in part to blame why processing failed with the proposed solutions.
Anyway the reason for my long winded reply is to put this issue in proper perspective and also explain how I resolved this issue. As I had previously mentioned I was just going to manually edit the file however after the generous help already presented I wanted to find some automated solutions and I did.
The primary constant was the Example code previously presented so focusing only on it here is the command line I used to process the file.
grep -o 'file:///[^"]*' Build_Out_Template.html | rev | cut -d / -f 1 | rev | while read LINE; do sed -i "s/${LINE}//" Build_Out_Template.html; done
Using "grep -o 'file:///[^"]*'" enabled me to extract just the target portion of the lines in the document. I piped it through rev to reverse the character order and piped it through cut which gave be only the portion up to the first slash in the reversed line (after the last slash in the original line) and then had to pipe it again through rev for obvious reasons. It was then piped through a loop where sed used a very simple instruction vs a complex regex using literally just the SoftwarePackageName.dmg etc. file name. While much more time was spent on this then manually editing the file nonetheless I took it as a challenge and will remember that sometimes the thinking out-side-the-box solution is faster and easier and I'll remember this for some other application if/when needed.
Thanks again to all who tried to help, it's truly appreciated.
I have a files with the contents:
aaa.bbb.ccc ddd.eee.fff.ggg h.i.j.k
If i use the code:
awk '{sub(/\.$/, ""); print $1}' test.txt
returns aaa.bbb.ccc
awk '{sub(/\.$/, ""); print $3}' test.txt
Returns: h.i.j.k
I understand the sub function is used as: sub(regexp, replacement, target)
I dont understand this part .$/, from the sub function.
what is the .$?
thanks
UPDATE
Ok, i like your way of explaining things - thank you!
If i apply this to a real example,
/usr/bin/host 172.0.0.10
01.0.0.172.in-addr.arpa domain name pointer hostname.domain.com.
/usr/bin/host 172.0.0.10 | /bin/awk '{sub(/.$/, ""); print $5}'
gives: hostname.domain.com
/usr/bin/host 172.0.0.10| /bin/awk '{sub(/.$/, ""); print $1}'
gives: 10.0.0.172.in-addr.arpa
-The sub function will match to the end of the line as there is a "."
-what is the "" doing?
-I dont understand how awk is splitting things into columns?
sub(/regexp/, replacement, target)
sub(/\.$/, replacement, target)
Your regexp is \.$, not .$/
\ is the escape character. It escapes the character that follows it, thus stripping it from the regex meaning and processing it literally.
. in regex matches any single character. Unless it's escaped by \ like in your example, thus it just matches the dot character .
$ simply means the end of the line.
Putting this together, \.$ is an escaped dot at the end of the line. This would match for example any end of paragraph that ends in a period.
In your example, the sub doesn't substitute anything because there is no . at the end of the line (your input ends with .k. So your first awk just prints the 1st column, and the other one prints the 3rd column.
Update
For your updated question.
Awk splits a string in columns by whitespace by default. Thus in your input, columns are like this:
01.0.0.172.in-addr.arpa domain name pointer hostname.domain.com.
|----------$1-----------|--$2--|-$3-|--$4---|----------$5--------|
in your sub command, awk finds the dot at the end of the line and replaces with "" which is the empty string (i.e. it just deletes it)
So your 1st command - {sub(/.$/, ""); print $5}, it prints the 5th column which is hostname.domain.com. after it replaces the . at the end with nothing (deletes it). It's worth noting that in this regex you don't escape the . anymore, so the pattern just matches any character at the end and deletes it (it happens to be a . in your input)
Your other command - {sub(/.$/, ""); print $1} deletes the character at the very end of the line and then just prints the first column 10.0.0.172.in-addr.arpa
You can also set custom column separators in awk, I recommend you read some introduction and tutorials on awk to have a better understanding of how it works. E.g. simple awk tutorial
sub(regexp, replacement, target)
So here we used the regex as \.$, which matches the dot at the end. Here sub(/\.$/, "") we didn't mention the target so it takes $0 ie the whole line. If you specify any target , it would remove the last dot only on that particular column.
awk '{sub(/\.$/, ""); print $1}' test.txt
Removes a dot which was present only at the end of the line and prints only the column 1. If there is no dot at the last, then replacement won't occur.
awk '{sub(/\.$/, ""); print $3}' test.txt
Removes the dot at the end of the line and prints only the column 3. Because of there is no dot at the end, it returns the third column aka last column as it is.
Example:
$ cat file
aaa.bbb.ccc. ddd.eee.fff.ggg h.i.j.k.
$ awk '{sub(/\.$/, ""); print $1}' file
aaa.bbb.ccc.
$ awk '{sub(/\.$/, ""); print $3}' file
h.i.j.k
I had one table with this format
<table width="700" border="1" align="center" cellpadding="0" cellspacing="0" bordercolor="ffcc00" bgcolor="ffcc00">
<tbody>
<th colspan="7" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style2">
Exciter Power Supply</div>
</th>
</tr>
<tr>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">+ 3 V </div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">
OK</div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1"> + 5 V</div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">
OK</div>
</th>
</tr>
</tbody>
When i get the value of +3 V
curl -s http://my-site/index.htm | sed -e 's/<[^>]*>//g' | awk '/+ 3 V/{getline; print}'
I had the output OK' '; + 5 V
For remove blank space and text of another field i use sub() for change caracters plus tr to remove the caracters
curl -s http://my-site/index.htm | sed -e 's/<[^>]*>//g' | awk '/+ 3 V/{getline; sub(/+ 5 V/, ""); print}' | tr " " " "
My output is only OK
I want to write a bash script that finds a pattern in a html-file which is going over multiple lines.
File for regex:
<td class="content">
some content
</td>
<td class="time">
13.05.2013 17:51
</td>
<td class="author">
A Name
</td>
Now I want to find the content of <td>-tag with the class="time".
So in principle the following regex:
<td class="time">(\d{2}\.\d{2}\.\d{4}\s+\d{2}:\d{2})</td>
grep seems not to be the command I can use, because...
It only returns the complete line or the complete result using -o and not only the result inside the round brackets (...).
It looks only in one line for a pattern
So how is it possible that I will get only a string with 13.05.2013 17:51?
It's not quite there, it prints a leading newline for some reason, but maybe something like this?
$ sed -n '/<td class="time">/,/<\/td>/{s/^<td class="time">$//;/^<\/td>$/d;p}' file
13.05.2013 17:51
Inspired by https://stackoverflow.com/a/13023643/1076493
Edit: Well, there's always perl!
For more info see https://stackoverflow.com/a/1213996/1076493
$ perl -0777 -ne 'print "$1\n" while /<td class="time">\n (.*?)\n<\/td>/gs' regex.txt
13.05.2013 17:51
How fixed is your format? If you're sure it's going to look like that then you can use sed to match the first line, get the next line and print it, like this:
$ sed -n '/<td *class="time">/{n;p}' test
13.05.2013 17:51
You could add something to cover the case where it's on the same line as well. Alternatively pre-process the file to strip all the newlines, maybe collapse the whitespace too (can't be done with sed apparently) and then go from there.
However, if it's an HTML file from somewhere else and you can't be sure of the format I'd consider using some other scripting language that has a library to parse XML, otherwise any solution is liable to break when the format changes.
Edited to add a link to my favorite sed resource for this sort of thing:http://www-rohan.sdsu.edu/doc/sed.html
Try:
awk '/^td class="time">/{gsub(ORS,x); print $2}' RS=\< FS=\> file
or
awk '/^td class="time">/{print $2}' ORS= RS=\< FS='>[[:space:]]*' file
I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:
<td width="25%" align="center" valign="top"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></td>
So I do this:
sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm
to get the part which looks like this:
concept_Core.jpg
to do then this:
wget --base=/some/url/concept_Core.jpg
But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)
<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>
That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me
concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg
You understand why. Sed is greedy and this obviously shows up in this case.
Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text
use perl:
perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'
You might want to consider changing:
\(.*jpg\)
into:
\([^"]*jpg\)
This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.
If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.
Use [^"] instead of . in the regular expression.
This will pick all characters except the appostrophes.
sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'
GNU grep can do PCRE:
grep -Po '(?<=\.htm\?).*?jpg' concept.htm