AWK sub function syntax

AWK sub function syntax - regex

I have a files with the contents:
aaa.bbb.ccc ddd.eee.fff.ggg h.i.j.k
If i use the code:
awk '{sub(/\.$/, ""); print $1}' test.txt
returns aaa.bbb.ccc
awk '{sub(/\.$/, ""); print $3}' test.txt
Returns: h.i.j.k
I understand the sub function is used as: sub(regexp, replacement, target)
I dont understand this part .$/, from the sub function.
what is the .$?
thanks
UPDATE
Ok, i like your way of explaining things - thank you!
If i apply this to a real example,
/usr/bin/host 172.0.0.10
01.0.0.172.in-addr.arpa domain name pointer hostname.domain.com.
/usr/bin/host 172.0.0.10 | /bin/awk '{sub(/.$/, ""); print $5}'
gives: hostname.domain.com
/usr/bin/host 172.0.0.10| /bin/awk '{sub(/.$/, ""); print $1}'
gives: 10.0.0.172.in-addr.arpa
-The sub function will match to the end of the line as there is a "."
-what is the "" doing?
-I dont understand how awk is splitting things into columns?

sub(/regexp/, replacement, target)
sub(/\.$/, replacement, target)
Your regexp is \.$, not .$/
\ is the escape character. It escapes the character that follows it, thus stripping it from the regex meaning and processing it literally.
. in regex matches any single character. Unless it's escaped by \ like in your example, thus it just matches the dot character .
$ simply means the end of the line.
Putting this together, \.$ is an escaped dot at the end of the line. This would match for example any end of paragraph that ends in a period.
In your example, the sub doesn't substitute anything because there is no . at the end of the line (your input ends with .k. So your first awk just prints the 1st column, and the other one prints the 3rd column.
Update
For your updated question.
Awk splits a string in columns by whitespace by default. Thus in your input, columns are like this:
01.0.0.172.in-addr.arpa domain name pointer hostname.domain.com.
|----------$1-----------|--$2--|-$3-|--$4---|----------$5--------|
in your sub command, awk finds the dot at the end of the line and replaces with "" which is the empty string (i.e. it just deletes it)
So your 1st command - {sub(/.$/, ""); print $5}, it prints the 5th column which is hostname.domain.com. after it replaces the . at the end with nothing (deletes it). It's worth noting that in this regex you don't escape the . anymore, so the pattern just matches any character at the end and deletes it (it happens to be a . in your input)
Your other command - {sub(/.$/, ""); print $1} deletes the character at the very end of the line and then just prints the first column 10.0.0.172.in-addr.arpa
You can also set custom column separators in awk, I recommend you read some introduction and tutorials on awk to have a better understanding of how it works. E.g. simple awk tutorial

sub(regexp, replacement, target)
So here we used the regex as \.$, which matches the dot at the end. Here sub(/\.$/, "") we didn't mention the target so it takes $0 ie the whole line. If you specify any target , it would remove the last dot only on that particular column.
awk '{sub(/\.$/, ""); print $1}' test.txt
Removes a dot which was present only at the end of the line and prints only the column 1. If there is no dot at the last, then replacement won't occur.
awk '{sub(/\.$/, ""); print $3}' test.txt
Removes the dot at the end of the line and prints only the column 3. Because of there is no dot at the end, it returns the third column aka last column as it is.
Example:
$ cat file
aaa.bbb.ccc. ddd.eee.fff.ggg h.i.j.k.
$ awk '{sub(/\.$/, ""); print $1}' file
aaa.bbb.ccc.
$ awk '{sub(/\.$/, ""); print $3}' file
h.i.j.k

I had one table with this format
<table width="700" border="1" align="center" cellpadding="0" cellspacing="0" bordercolor="ffcc00" bgcolor="ffcc00">
<tbody>
<th colspan="7" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style2">
Exciter Power Supply</div>
</th>
</tr>
<tr>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">+ 3 V </div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">
OK</div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1"> + 5 V</div>
</th>
<th width="175" bordercolor="ffcc00" bgcolor="000000" scope="col">
<div align="center" class="style1">
OK</div>
</th>
</tr>
</tbody>
When i get the value of +3 V
curl -s http://my-site/index.htm | sed -e 's/<[^>]*>//g' | awk '/+ 3 V/{getline; print}'
I had the output OK'&nbsp'; + 5 V
For remove blank space and text of another field i use sub() for change caracters plus tr to remove the caracters
curl -s http://my-site/index.htm | sed -e 's/<[^>]*>//g' | awk '/+ 3 V/{getline; sub(/+ 5 V/, ""); print}' | tr " " " "
My output is only OK

Related

Regex doesn't produce match when containing a new line

I'm trying to parse the page https://extensions.typo3.org/extension/tt_news/ for version numbers and corresponding dates with sed or grep.
More specifically, I'm interested in the following html section:
<tr>
<td class="align-middle">
<strong>3.6.0</strong> /
<span class="ter-ext-state-beta">beta</span>
<br />
<small>
April 06, 2014
</small>
</td>
<td class="align-middle">
tt_news for TYPO3 4.5 - 6.2 (compatibility update)
</td>
<td class="align-middle">
<strong>4.5.0 - 6.2.99</strong>
</td>
<td class="align-middle">
<a class="btn btn-primary" title="Size: 2.58MB" href="/extension/download/tt_news/3.6.0/zip/">
<strong>
Download ZIP Archive
</strong>
</a>
</td>
</tr>
I would like to get from each of these sections the version (between the strong tag) and the date (between the small tag).
All my attempts have failed so far and I can narrow down the problem to something very easy.
I have tested the following regex which only tries to get an tr tag followed by whitespaces and a td tag on regex101.com and there, it works perfectly fine:
<tr>\s*<td
It gives me 5 matches which is correct. The following one also works fine:
<tr[^>]*>\s*<td
It produces 38 results because it includes those tr tags with a css class attribute.
However, neither with grep nor with sed I can get this to work. As soon as I include the \s there aren't any matches anymore. Here is what it looks like:
cat tt_news_history | grep '<tr>\s*<td'
no hits.
cat tt_news_history | grep '<tr>'
6 hits.
cat tt_news_history | grep '<tr[^>]*>'
lots of hits (didn't count). Same thing with sed.
What am I doing wrong? Why can't I use a \s?
Thanks for any hint.

There is a -z option for the GNU grep that makes \s match newlines in the input, eg:
cat tt_news_history | grep -z '<tr>\s*<td'
The relevant fragments from the info documentation:
‘-z’ ‘--null-data’
Treat input and output data as sequences of lines, each terminated
by a zero byte (the ASCII NUL character) instead of a newline.
Like the ‘-Z’ or ‘--null’ option, this option can be used with
commands like ‘sort -z’ to process arbitrary file names.
(...)
How can I match across lines?
Standard grep cannot do this, as it is fundamentally line-based.
Therefore, merely using the ‘[:space:]’ character class does not match
newlines in the way you might expect.
With the GNU ‘grep’ option ‘-z’ (‘--null-data’), each input “line” is
terminated by a null byte; *note Other Options::. Thus, you can match
newlines in the input, but typically if there is a match the entire
input is output, so this usage is often combined with
output-suppressing options like ‘-q’, e.g.:
printf 'foo\nbar\n' | grep -z -q 'foo[[:space:]]+bar'
If this does not suffice, you can transform the input before giving it
to ‘grep’, or turn to ‘awk’, ‘sed’, ‘perl’, or many other utilities
that are designed to operate across lines.

SED Regex rewriting between two patterns using backreferences

I have text in many html files that is in this format:
<!-- BEGIN FOOTER -->
<div id="footer">
<p align="right"> Slogan<br />
5555 Street East <br />
City, State 99999 <br />
Call Us (555)555-5555 <br />
</p>
<div align="center">
<a class="footer" href="http://www.example.com" title="Site">Site</a>
</div>
<br>
</div>
<!--END FOOTER-->
I am using this:
sed -E -i 's/(<!-- BEGIN FOOTER -->)(.|\n)*(<!--END FOOTER-->)/\1 <br>REPLACE<br> \3 /m' file.html
but not working to capture & backreference them:
\1 <!-- BEGIN FOOTER -->
\3 <!--END FOOTER-->
And insert this in between them:
REPLACE
So trying to end up with this:
<!-- BEGIN FOOTER -->
<br>REPLACE<br>
<!--END FOOTER-->

This might work for you (GNU sed):
sed '/<!-- BEGIN FOOTER -->/{:a;N;/<!--END FOOTER-->/!ba;s/\n.*\n/\n<br>REPLACE<br>\n/}' file
This gathers up lines between footers and replaces the lines between with the required string.
An alternative (similar to revo);
sed '/<!-- BEGIN FOOTER -->/,/<!--END FOOTER-->/!b;/<!-- BEGIN FOOTER -->/b;/<!--END FOOTER-->/!d;i\<br>REPLACE<br>' file
Another way:
sed '/<!-- BEGIN FOOTER -->/,/<!--END FOOTER-->/!b;/<!--END FOOTER-->/p | sed '/<!-- BEGIN FOOTER -->/p;/<!-- BEGIN FOOTER -->/,/<!--END FOOTER-->/c\<br>REPLACE<br>'

sed reads from input file one line at a time and directs the result to standard output. It means no where in input line there is \n character to be matched except in some cases that a few special commands are being used. You could use address ranges along with insert i command to achieve desired output:
sed '/BEGIN FOOTER/,/END FOOTER/{ /END FOOTER/{i\<br>REPLACE<br>
b}; /BEGIN FOOTER/b; d;}' file
Watch the linebreak after first line. i inserts data before current line in pattern space. b jumps over remaining commands causing a new iteration to be dobe. /BEGIN FOOTER/,/END FOOTER/ implies a range in which the other commands should execute.

You can first save the lines with BEGIN FOOTER and END FOOTER in beg and end variables:
beg=$(grep -n "BEGIN FOOTER" inputfile | cut -d: -f1)
end=$(grep -n "END FOOTER" inputfile| cut -d: -f1)
Then use sed's c\ command:
sed -i.bak "$((beg+1)),$((end-1))c\<br>REPLACE<br>" inputfile
Result:
<!-- BEGIN FOOTER -->
<br>REPLACE<br>
<!--END FOOTER-->
Alternatively use the single command:
sed "/BEGIN FOO/,/END FOO/c\<!--BEGIN FOOTER-->\n<br>REPLACE<br>\n<!--END FOOTER-->" inputfile

Find a regexp in awk

I have a file with a line like this:
<div class="cell contentCell bbActiveRow" tabindex="-1" style="width: 150px; left: 77px; display: block;" cellposition="15,2"><div class="cell contentCell bbActiveRow last-child" tabindex="-1" style="width: 150px; left: 697px; display: block;" cellposition="15,6">159</div></div><div class="contentRow bb_row" rowindex="16" style="display: block; top: 429px;"><div class="cell first-child " title="Go to box" tabindex="-1" role="linkAction" cellposition="16,0"><span class="pre-child" style="background-color:#16A765;"> </span><span class="link" role="link"> </span></div>
The important bit I want to catch is the 159 in:
,6">159</div>
I can catch it fine with grep:
cat c |grep ',6\">[0-9]\+<'
Now, what I want to do, is actually catch the number itself (159) and print it out.
Note that the actual file I have has several of those lines. Ideally, only the numbers will print out.
I thought I could do it with awk:
cat c | awk ' /,6\">([0-9]\+)/ { print $1 } '
But nope, nothing gets printed out.
Having the regexp ready, and knowing that there are several lines in the file with entries that match the expression (with different numbers), how would you squeeze those numbers out?

This oneliner is an alternate way to do that (using an xpath expression which matches div elements containing a cellposition attribute value ending with ',6'):
# xmllint --html test.html --xpath '//div[substring(#cellposition, string-length(#cellposition) - 1)=",6"]/text()'
159

A pragmatic approach:
cat c | grep -o ',6\">[0-9]\+<' | awk -F'<|>' '{ print $2 }'
-o causes grep to only report the matching part of each line.
awk -F'<|>' '{ print $2 }' then extracts the token between > and <.
As for why your awk command didn't work:
awk uses extended regular expressions, in which + must NOT be escaped as \+ to be recognized as a quantifier.
Even with that fixed, the command wouldn't work, because, by default, awk splits by whitespace, so $2 will simply report the 2nd whitespace-separated token on each matching line, irrespective of the regular expression that caused the match.
The solution at the top even finds multiple matches on a line, but if we assume that there's at most 1, it is relatively straightforward to do it all in awk, if you have GNU awk:
cat c | gawk '{ m=gensub(/^.*,6\">([0-9]+)<.*$/, "\\1", "1"); if (m != $0) print m }'
The non-POSIX gensub() replaces regex matches and returns the replacement, while crucially also supporting backreferences, which the POSIX sub() and gsub() functions do not.
The above matches the entire line, then replaces it with the captured number only (via (escaped) backreference \1), and stores the result in a variable. If the variable doesn't equal the input line, a match was captured, and it is printed.
While a solution with POSIX awk features only is possible (using match(), RSTART, RLENGTH, split()), it would be cumbersome.
Finally, if you have xmllint (OS X does, and some Linux distros), consider guido's answer for a solution that performs actual HTML parsing and applies an XPath query, and is therefore more robust.

GREP data within multiple tags from cURL html

Getting rather desperate to understand how to get the data I want from a curl command.
I need a hand with generating a grep command to get the following html:
<title> timetable </t itle>< <h3>study table</h3> <p>< strong>biology <div> <table
width='100%' cellpadding='5' cellspacing='0'><tr><th colspan="3">Level 44 Building 1 <tr>
<td >monday</td> <td >1:30 – 2:30</td> <td >< a>Room number 22</a></td> <td > </td>
</tr> <tr><th colspan="2">body> </html>
I would like the output look like:
timetable
study table
Biology
Level 44 Building 1
Monday
1:30 - 2:30
Room Number 22
Currently I only know how to do a single grep such as :
grep 'href='

If you have GNU grep:
$ grep -Po '(?<=>) ?\K[^<&>]{2,}(?=<)' file
timetable
study table
biology
Level 44 Building 1
monday
1:30 – 2:30
Room number 22
Disclaimer: You should really use a proper parser for this.

Assuming your string is in the variable $data, you can:
IFS=$'\n'
result=$(echo $data | sed 's/&[^;]*;//')
result=$(echo $result | sed 's/<[^>]*>/\n/g')
for string in $result; do
if [[ ! $string =~ ^\ *$ ]]; then
echo "string=$string."
fi
done

sed solaris 5.10

Hi I am trying to write a script to parse some html files to make a job a bit easier, but I'm having no luck, I've tried reading other threads and manuals to no avail. I seem to get stuck with circular brackets.
I want to replace all appearances of:
$FORMTOP("2")$ with $FORMTOP("3")$
$WHITE*("5")$ with $WHITE*("10")$
</b> with </strong>
<tr><td with <tr> newline, tab <td
delete occurrences of <td></td>

In sed you will have to put a new line (put a "\" and hit enter) and tab spaces (press spacebar 8 times) manually in the replacement section.
[jaypal#MBP-13~/temp] sed 's/<tr><td/<tr>\
<td/g' test123
<tr>
<td
<tr>
<td

I can't say for certain that this will work on Solaris, as I don't have it available anymore, but I'm using Sun-Solaris std sed commands with nothing fancy, I think this should work.
{
cat <<-EOS
\$FORMTOP("2")$
\$WHITE*("5")$
</b>
<tr><td
EOS
} |sed '
s/\$FORMTOP("2")\$/\$FORMTOP("3")\$/g
s/\$WHITE\*("5")\$/\$WHITE\*("10")\$/g
s/<\/b>/\<\/strong>/g
/<tr><td/{
s/<td//
a\
<td
}
'
#output
$FORMTOP("3")$
$WHITE*("10")$
</strong>
<tr>
<td
For this testing harness, using { cat <<-EOS ... EOS }, I had to escape the '$' that where being interpreted as env vars by the shell. If you put the test data in a file, be sure to remove the '\'s in front of the '$'s.
EDIT Also, stuff that looks indented in sed, is indented with spaces except for the char just before your final <td.
Also, as you wrote 'I've tried reading other threads',you did find the S.O. number one post concerning fixing XML with sed, right?
I hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWK sub function syntax - regex

Related

Regex doesn't produce match when containing a new line

SED Regex rewriting between two patterns using backreferences

Find a regexp in awk

GREP data within multiple tags from cURL html

sed solaris 5.10

Categories

Resources