Find a regexp in awk - regex

I have a file with a line like this:
<div class="cell contentCell bbActiveRow" tabindex="-1" style="width: 150px; left: 77px; display: block;" cellposition="15,2"><div class="cell contentCell bbActiveRow last-child" tabindex="-1" style="width: 150px; left: 697px; display: block;" cellposition="15,6">159</div></div><div class="contentRow bb_row" rowindex="16" style="display: block; top: 429px;"><div class="cell first-child " title="Go to box" tabindex="-1" role="linkAction" cellposition="16,0"><span class="pre-child" style="background-color:#16A765;"> </span><span class="link" role="link"> </span></div>
The important bit I want to catch is the 159 in:
,6">159</div>
I can catch it fine with grep:
cat c |grep ',6\">[0-9]\+<'
Now, what I want to do, is actually catch the number itself (159) and print it out.
Note that the actual file I have has several of those lines. Ideally, only the numbers will print out.
I thought I could do it with awk:
cat c | awk ' /,6\">([0-9]\+)/ { print $1 } '
But nope, nothing gets printed out.
Having the regexp ready, and knowing that there are several lines in the file with entries that match the expression (with different numbers), how would you squeeze those numbers out?

This oneliner is an alternate way to do that (using an xpath expression which matches div elements containing a cellposition attribute value ending with ',6'):
# xmllint --html test.html --xpath '//div[substring(#cellposition, string-length(#cellposition) - 1)=",6"]/text()'
159

A pragmatic approach:
cat c | grep -o ',6\">[0-9]\+<' | awk -F'<|>' '{ print $2 }'
-o causes grep to only report the matching part of each line.
awk -F'<|>' '{ print $2 }' then extracts the token between > and <.
As for why your awk command didn't work:
awk uses extended regular expressions, in which + must NOT be escaped as \+ to be recognized as a quantifier.
Even with that fixed, the command wouldn't work, because, by default, awk splits by whitespace, so $2 will simply report the 2nd whitespace-separated token on each matching line, irrespective of the regular expression that caused the match.
The solution at the top even finds multiple matches on a line, but if we assume that there's at most 1, it is relatively straightforward to do it all in awk, if you have GNU awk:
cat c | gawk '{ m=gensub(/^.*,6\">([0-9]+)<.*$/, "\\1", "1"); if (m != $0) print m }'
The non-POSIX gensub() replaces regex matches and returns the replacement, while crucially also supporting backreferences, which the POSIX sub() and gsub() functions do not.
The above matches the entire line, then replaces it with the captured number only (via (escaped) backreference \1), and stores the result in a variable. If the variable doesn't equal the input line, a match was captured, and it is printed.
While a solution with POSIX awk features only is possible (using match(), RSTART, RLENGTH, split()), it would be cumbersome.
Finally, if you have xmllint (OS X does, and some Linux distros), consider guido's answer for a solution that performs actual HTML parsing and applies an XPath query, and is therefore more robust.

Related

SED - remove attribute from HTML tag

I want to remove a specific attribute(name in my example) from the HTML tag, that might be in different positions for each line in my file
Example Input:
<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">
Expected output:
<img src="https://websiteurl.com/286.jpg" alt="img">
My code:
sed 's/name=".*"//g' <<< '<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">'
but it only shows <img >, I am losing src attribute as well
Notes:
name attribute might be in any position in a tag (not necessarily at the beginning)
you can use sed, awk, Perl, or anything you like, it should work on the command line
Your sed expression matches the text up to the last " in the line. It must have been
sed 's/ name="[^"]*"//g'
With your shown samples, could you please try following. Written and tested in GNU awk.
awk '/^<img/ && match($0,/src.*/){print substr($0,1,4),substr($0,RSTART,RLENGTH)}' Input_file
2nd solution: Using sub(substitute function) of awk.
awk '/^<img/{sub(/name="[^"]*" /,"")} 1' Input_file
Explanation:
1st solution: Using match function of awk to match from src till last of line and printing 1st 4 characters with space with matched regex value.
2nd solution: Checking condition if line starts from <img then substitute name=" till again " comes with NULL and printing current line.

Sed substitute pattern within a line

How can I substitute characters only within a specific pattern, preferably in sed but awk or otherwise if there's an easier option? I would like to replace spaces in my html h3 ids with hyphens (-), but I don't want it to hyphen the entire line.
Eg, in my foo.html:
<p>This is a paragraph which likes its spaces.</p>
<h3 id="No spaces in this id please">Keep spaces in this title</h3>
<p>Here's another paragraph with spaces.</p>
<h3 id="Another id that should be spaceless">Spaces please!</h3>
<p>Yes I would like extra noodles in my soup.</p>
What I want are h3s like this:
<h3 id="Another-id-that-should-be-spaceless">Spaces please!</h3>
I've tried
sed -e "/^<h3 id=\"/,/\">/s/ /-/g;" <foo.html >bar.html
But this greedily adds hyphens to lines (2nd p) and parts (h3 content) which shouldn't have hyphens! Bar.html:
<p>This is a paragraph which likes its spaces.</p>
<h3-id="No-spaces-in-this-id-please">Keep-spaces-in-this-title</h3>
<p>Here's-another-paragraph-with-spaces.</p>
<h3-id="Another-id-that-should-be-spaceless">Spaces-please!</h3>
<p>Yes I would like extra noodles in my soup.</p>
Note I'm using GNU sed. Thanks!
This sed replace one space at a time in the id value of h3 tags. When substitution succeeds, the t command loops to :a label to search for remaining spaces to replace:
sed -e ':a;s/\(<h3[^>]*id="[^"> ]*\) \(.*\)/\1-\2/;ta;' < foo.html > bar.html

sed - Include newline in pattern

I am still a noob to shell scripts but am trying hard. Below, is a partially working shell script which is supposed to remove all JS from *.htm documents by matching tags and deleting their enclosed content. E.g. <script src="">, <script></script> and <script type="text/javascript">
find $1 -name "*.htm" > ./patterns
for p in $(cat ./patterns)
do
sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p
done
The problem with this is script is that because sed reads text input line-by-line, this script will not work as expected with new-lines. Running:
<script>
//Foo
</script>
will remove the first script tag but will omit the "foo" and closing tag which I don't want.
Is there a way to match new-line characters in my regular expression? Or if sed is not appropriate, is there anything else I can use?
Assuming that you have <script> tags on different lines, e.g. something like:
foo
bar
<script type="text/javascript">
some JS
</script>
foo
the following should work:
sed '/<script/,/<\/script>/d' inputfile
This awk script will look for the <script*> tag, set the in variable and then read the next line. When the closing </script*> tag is found the variable is set to zero. The final print pattern outputs all lines if the in variable is zero.
awk '/<script.*>/ { in=1; next }
/<\/script.*>/ { if (in) in=0; next }
{ if (!in) print; } ' $1
As you mentioned, the issue is that sed processes input line by line.
The simplest workaround is therefore to make the input a single line, e.g. replacing newlines with a character which you are confident doesn't exist in your input.
One would be tempted to use tr :
… |tr '\n' '_'|sed 's~<script>.*</script>~~g'|tr '_' '\n'
However "currently tr fully supports only single-byte characters", and to be safe you probably want to use some improbable character like ˇ, for which tr is of no help.
Fortunately, the same thing can be achieved with sed, using branching.
Back on our <script>…</script> example, this does work and would be (according to the previous link) cross-platform :
… |sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ˇ/g' -e 's~<script>.*</script>~~g' -e 's/ˇ/\n/g'
Or in a more condensed form if you use GNU sed and don't need cross-platform compatibility :
… |sed ':a;N;$!ba;s/\n/ˇ/g;s~<script>.*</script>~~g;s/ˇ/\n/g'
Please refer to the linked answer under "using branching" for details about the branching part (:a;N;$!ba;). The remaining part is straightforward :
s/\n/ˇ/g replaces all newlines with ˇ ;
s~<script>.*</script>~~g removes what needs to be removed (beware that it requires some securing for actual use : as is it will delete everything between the first <script> and the last </script> ; also, note that I used ~ instead of / to avoid escaping of the slash in </script> : I could have used just about any single-byte character except a few reserved ones like \) ;
s/ˇ/\n/g readds newlines.

Match inner pattern. Multiline

I have:
%{ lorem ipsum dolor
sit %{hello
world}%
amet}%
I want:
hello
world
That is, I want to keep the inner %{...}% of any number of nesting %{...}%s that may or may not span multiple lines.
Is there a sed or awk way?
This sed command:
sed -n -r 'H; ${g; s/([^}]|\}[^%])*%\{//; s/\}%([^%]|%[^{])*//; p}'
will gather the entirety of the input into the pattern space, then remove ...%{ (taking care to ensure that the ... doesn't contain }%) and }%... (taking care to ensure that the ... doesn't contain %{), and then print the result. So it's suitable for the case where you need just one block. The case with multiple blocks is trickier, but I'll think about it further, and update this answer if I get that working well.
Note that -r (to support Extended Regular Expressions, instead of Basic ones) is a GNU extension to sed, so if you're using a non-GNU sed that doesn't support it, let me know.
Edited to add: O.K., here's a version that supports multiple blocks:
sed -n -r 'H; ${g; s/^([^}]|\}[^%])*%\{//; s/\}%([^%]|%[^{])*$//; s/\}%([^%]|%[^{])*([^}]|\}[^%])*%\{/\n/g; p}'
It uses essentially the same approach as the previous, except that it only removes ...%{ at start-of-input and }%... at end-of-input, and that after it's done that, it proceeds to remove all instances of }%...%{ that do not contain %{...}%, replacing them with a newline.
The AWK way:
gawk '
/%{/ {
match($0,/%{.*/)
text=substr($0,RSTART+2,RLENGTH-2)
}
!/% {/ && !/}%/ {
text=text "\n" $0
}
/}%/ {
match($0,/}%/)
text=text "\n" substr($0,1,RSTART-1)
print text
exit
}'
This won't work if there's more than one {% or %} in the same line. In this case you need minor modification - use array in the match command.
One possible TXR way:
Simply scan the input freeform (as one big line) collecting matches for a regular expression into the variable wanted which gets implicitly collected into a list called wanted.
Then spit out the pieces, chopping two characters from the head and tail of each.
$ txr -c '#(freeform)
#(coll)#{wanted /\%{(~(.*(\%{|}\%).*))}\%/}#(end)
#(output)
#(rep)#{wanted [2..-2]}#(end)
#(end)' -
asdf asdf %{
%{ asdf
asdf
}% %{boo}% }%
[Ctrl-D][Enter]
asdf
asdf
boo
The regex ~ operator means complement. The variable wanted captures text which consists of %{ followed by the longest matching string which does not contain %{ or }% as a substring, followed by %}. TXR regex supports complement, intersection, difference. We have to write \% character because % is the non-greedy zero-or-more operator.
The output for the example given in the question is:
hello
world
rather than
hello
world
Author didn't clarify if that is really needed. This complicates the problem because %{hello occurs somewhere in the middle of the line, and so we must know the column position of the h in hello in order to know that the w in world is two spaces over.

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.
If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.
But in general, awk is the wrong tool for this job.
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.
By your script, if you can get what you want (it means <li> and <a> tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.
There are several issues that I see:
The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.
It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.