sed don't match characters inside parenthesis - regex

I'm trying to come up with a SED greedy expression which ignores the stuff inside html quotes and ONLY matches the text of that element.
<p alt="100">100</p> #need to match only second 100
<img src="100.jpg">100</img> #need to match only second 100
<span alt="tel:100">100</span> #need to match only second 100
These are my attempts:
grep -E '(!?\")100(!?\")' html # this matches string as well as quotes
grep -E '[^\"]100[^\"]' html # this doesn't work either
Edit
Ok. I was trying to simplify the question but maybe that's wrong.
with command sed -r '/?????/__replaced__/g' file i would need to see :
<p alt="100">__replaced__</p>
<img src="100.jpg">__replaced__</img>
<span alt="tel:100">__replaced__</span>

I don't think handling HTML with sed (or grep) is a good idea. Consider using python, which has an HTML push parser in its standard library. This makes separating tags from data easy. Since you only want to handle the data between tags, it could look something like this:
#!/usr/bin/python
from HTMLParser import HTMLParser
from sys import argv
class MyParser(HTMLParser):
def handle_data(self, data):
# data is the string between tags. You can do anything you like with it.
# For a simple example:
if data == "100":
print data
# First command line argument is the HTML file to handle.
with open(argv[1], "r") as f:
MyParser().feed(f.read())
Update for updated question: To edit HTML with this, you'll have to implement the handle_starttag and handle_endtag methods as well as handle_data in a manner that reprints the parsed tags. For example:
#!/usr/bin/python
from HTMLParser import HTMLParser
from sys import stdout, argv
import re
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
stdout.write("<" + tag)
for k, v in attrs:
stdout.write(' {}="{}"'.format(k, v))
stdout.write(">")
def handle_endtag(self, tag):
stdout.write("</{}>".format(tag))
def handle_data(self, data):
data = re.sub("100", "__replaced__", data)
stdout.write(data)
with open(argv[1], "r") as f:
MyParser().feed(f.read())

First warning is that HTML is not a good idea to parse with regular expressions - generally speaking - use an HTML parser is the answer. Most scripting languages (perl, python etc.) have HTML parsers.
See here for an example as to why: RegEx match open tags except XHTML self-contained tags
If you really must though:
/(?!\>)([^<>]+)(?=\<)/
DEMO

You may try the below PCRE regex.
grep -oP '"[^"]*100[^"]*"(*SKIP)(*F)|\b100\b' file
or
grep -oP '"[^"]*"(*SKIP)(*F)|\b100\b' file
This would match the number 100 which was not present inside double quotes.
DEMO

You're questions gotten kinda muddy through it's evolution but is this what you're asking for?
$ sed -r 's/>[^<]+</>__replaced__</' file
<p alt="100">__replaced__</p> #need to match only second 100
<img src="100.jpg">__replaced__</img> #need to match only second 100
<span alt="tel:100">__replaced__</span> #need to match only second 100
If not please clean up your question to just show the latest sample input and expected output and explanation.

Related

Regex if then without else in ripgrep

I am trying to match some methods in a bunch of python scripts if certain conditions are met. First thing i am looking at is if import re exists in a file, and if it does, then find all cases of re.sub(something). I tried following the documentation here on how to use if then without else regexs, but cant seem to make it work with ripgrep with or without pcre2.
My next approach was to use groups, so rg -n "(^import.+re)|(re\.sub.+)" -r '$2', but the issue with this approach is that because the first import group matches, i get a lot of empty files back in my output. The $2 is being handled correctly.
I am hoping to avoid doing a or group capture, and use the regex if option if possible.
To summarize, what I am hoping for is, if import re appears anywhere in a file, then search for re\.sub.+ and output only the matching files and lines using ripgrep. Using ripgrep is a hard dependency.
Some sample code:
import re
for i in range(10):
re.match(something)
print(i)
re.sub(something)
This can be accomplished pretty easily with a shell pipeline and xargs. The idea is to use the first regex as a filter for which files to search in, and the second regex to show the places where re.sub occurs.
Here are three Python files to test with.
import-without-sub.py has an import re but no re.sub:
import re
for i in range(10):
re.match(something)
print(i)
import-with-sub.py has both an import re and an re.sub:
import re
for i in range(10):
re.match(something)
print(i)
re.sub(something)
And finally, no-import.py has no import re but does have a re.sub:
for i in range(10):
re.match(something)
print(i)
re.sub(something)
And now here's the command to show only matches of re.sub in files that contain import re:
rg '^import\s+re$' --files-with-matches --null | xargs -0 rg -F 're.sub('
--files-with-matches and --null print out all matching file paths separated by a NUL byte. xargs -0 then reads those file paths and turns them into arguments to be given to rg -F 're.sub('. (We use --null and -0 in order to correctly handle file names that contain spaces.)
Its output in a directory with all three of the above files is:
import-with-sub.py
7:re.sub(something)

Parsing HTML page using bash

I have a web HTML page and im trying to parse it.
Source ::
<tr class="active0"><td class=ac><a name="redis/172.29.219.17"></a><a class=lfsb href="#redis/172.29.219.17">172.29.219.17</a></td><td>0</td><td>0</td><td>-</td><td>0</td><td>0</td><td></td><td>0</td><td>0</td><td>-</td><td><u>0<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>0</td></tr><tr><th colspan=3>Avg over last 1024 success. conn.</th></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td>0</td><td>ms</td></tr></table></div></u></td><td>0</td><td>?</td><td>0</td><td>0</td><td></td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 0 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m DOWN</td><td class=ac><u> L7TOUT in 1001ms<div class=tips>Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')</div></u></td><td class=ac>1</td><td class=ac>Y</td><td class=ac>-</td><td><u>1<div class=tips>Failed Health Checks</div></u></td><td>1</td><td>17h12m</td><td class=ac>-</td></tr>
<tr class="backend"><td class=ac><a name="redis/Backend"></a><a class=lfsb href="#redis/Backend">Backend</a></td><td>0</td><td>0</td><td></td><td>1</td><td>24</td><td></td><td>29</td><td>41</td><td>200</td><td><u>5<span class="rls">4</span>033<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>5<span class="rls">4</span>033</td></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td><span class="rls">6</span>094</td><td>ms</td></tr></table></div></u></td><td>5<span class="rls">4</span>033</td><td>1s</td><td><span class="rls">4</span>89<span class="rls">1</span>000</td><td>1<span class="rls">8</span>11<span class="rls">6</span>385<div class=tips>compression: in=0 out=0 bypassed=0 savings=0%</div></td><td>0</td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 54004 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m UP</td><td class=ac> </td><td class=ac>1</td><td class=ac>1</td><td class=ac>0</td><td class=ac> </td><td>0</td><td>0s</td><td></td></tr></table><p>
What I want is ::
172.29.219.17 L7TOUT in 1001ms
So what Im trying right now is ::
grep redis index.html | grep 'a name=\"redis\/[0-9]*.*\"'
to extract the IP address.
But the regex doesnt seem to look at pick out the only the first row and returns both the rows whereas the IP is only in row 1.
Ive doublecheck the regex im using but it doesnt seem to work.
Any ideas ?
Using xpath expressions in xmllint with its built-in HTML parser would produce an output as
ipAddr=$(xmllint --html --xpath "string(//tr[1]/td[1])" html)
172.29.219.17
and for the time out value prediction, I did a manual calculation of the number of the td row containing the value, which turned out to be 24
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html
produces an output as
L7TOUT in 1001ms
Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')
removing the whitespaces and extracting out only the needed parts with Awk as
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
L7TOUT in 1001ms
put in a variable as
timeOut=$(xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
Now you can print both the values together as
echo "${ipAddr} ${timeOut}"
172.29.219.17 L7TOUT in 1001ms
version details,
xmllint --version
xmllint: using libxml version 20902
Also there is an incorrect tag in your HTML input file </table> at the end just before <p> which xmllint reports as
htmlfile:147: HTML parser error : Unexpected end tag : table
remove the line before further testing.
Here is a list of command line tools that will help you parse different formats via bash; bash is extremely powerful and useful.
JSON utilize jq
XML/HTML utilize xq
YAML utilize yq
CSS utilize bashcss
I have tested all the other tools, comment on this one
If the code starts getting truly complex you might consider the naive answer below as coding languages with class support will assit.
naive - Old Answer
Parsing complex formats like JSON, XML, HTML, CSS, YAML, ...ETC is extremely difficult in bash and likely error prone. Because of this I recommend one of the following:
PHP
RUBY
PYTHON
GOLANG
because these languages are cross platform and have parsers for all the above listed formats.
If you want to parse HTML with regexes, then you have to make assumptions about the HTML formatting. E.g. you assume here that the a tag and its name attribute is on the same line. However, this is perfect HTML too:
<a
name="redis/172.29.219.17">
Some text
</a>
Anyway, let's sole the problem assuming that the a tags are on one line and the name is the first attribute. This is what I could come up with:
sed 's/\(<a name="redis\)/\n\1/g' index.html | grep '^<a name="redis\/[0-9.]\+"' | sed -e 's/^<a name="redis\///g' -e 's/".*//g'
Explanation:
The first sed command makes sure that all <a name="redis text goes to a separate line.
Then the grep keeps only those lines that start with `
The last sed contains two expressions:
The first expressions removes the leading <a name="redis/ text
The last expression removes everything that comes after the closing "

grep text before string - regex

I have to extract few fields from below input html text using bash (only).
HTML input
SOMETEXT
I have extract id value and SOMETEXT from above input.
I am hoping that grep using some regex should workout.
For id_value I am using following regex
"id=[0-9]*"
which is giving me correct results.
grep -o 'id=[0-9]*' index.html | head -n 5
But I am not sure what sort of regex I should use to grab text till next </a>.
Thanks in advance.
(?<=>).*?(?=<)
You can use this with grep -P,since this uses lookarounds supported by perl.See demo.
https://regex101.com/r/fM9lY3/21
The regex you have in your OP ("id=[0-9]*") looks like it worked in your case, but a better approach is to hone down on the anchor tags themselves.
Here is a regex to extract out the id value:
<a.*?id=(\d.*?)">
And here is a regex to extract out the contents inside the <a> tag:
<a.*?">(.*?)<\/a>

sed/grep - get text between two strings (html)

I am trying to extract "pagename" from the following:
<a class="timetable work" href="http://www.test.com/pagename?tag=meta376">Test</a>
I tried to get it to work using "sed" but it only says invalid command code.
What line of code would you guys suggest to get the pagename? By the way: This is not a single line but there is more content on the same line - but that should not make a difference as it should just matter what is between the limiters, right?
Thanks in advance for helping me out!
I would use awk for this:
awk -F"[/?]" '/timetable work/ {print $4}'file
pagename
It search for a line containing timetable work, then print fourth field using \ or ? as separator.
As you commented, if you want to extract "<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>" you can use the following regex:
<a class="timetable.*?<\/a>
Working demo
If you want to grab the content just surround the regex with capturing groups:
(<a class="timetable.*?<\/a>)
The match is:
MATCH 1
1. [9-80] `<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>`
I think this is what you want:
sed 's_^.*<a [^<>]* href="https*://[^/]*/\([^"?]*\).*$_\1_'
Giving you exactly what you asked for using exactly the delimiters you told us to use:
$ sed -n 's|.*<a class="timetable work" href="http://www\.test\.com/\(.*\)?tag=meta376">Test</a>|\1|p' file
pagename
I know it may be tempting to handle this using a regular expression but here's an alternative.
You are trying to parse some HTML, so use an HTML parser. Here's an example in Perl:
use strict;
use warnings;
use feature qw(say);
use HTML::TokeParser::Simple;
use URI::URL;
my $filename = 'file.html';
my $parser = HTML::TokeParser::Simple->new($filename);
while (my $anchor = $parser->get_tag('a')) {
next unless defined(my $class = $anchor->get_attr('class'));
next unless $class =~ /\btimetable\b/ and $class =~ /\bwork\b/;
my $url = url $anchor->get_attr('href');
say substr($url->path, 1);
}
Parse the HTML using HTML::TokeParser::Simple. loop through the <a> tags, skipping any that don't have the correct classes defined. For the ones that do, use URI::URL to parse the url and extract the "path" component (which in your case, would be "/pagename"). As you didn't want the leading slash, I used substr to remove the first character.
Output:
pagename
I know it's much longer than a single regex but it's also a lot more robust and will continue to work even when the format of your HTML changes slightly in the future. HTML parsers exist for a reason :)

Sed program - deleted strings reappearing?

I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed to clean it up. I understand that sed works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.
Here is the relevant section of my source file:
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
I would like this to be made into the following plaintext format:
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000
The key is that City, Region, and Post code are all on one line now.
I use sed -f commands.sed file.html > output.txt and I believe that the following sed program (commands.sed) should put it in that format:
#using the '#' symbol as delimiter instead of '/'
#remove tags
s#<.*>\(.*\)</.*>#\1#g
#remove the nbsp
s#\( \)*##g
#add a newline before the address (actually typing a newline in the file)
s#\(123 street\)#\
\1#g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s#\(.*\)\n\(.*\)\n\(.*\)#\1 \2 \3#g
}
Seems to make sense. Tags are all stripped and then three lines are put into one.
Buuuuut it doesn't work that way. Here is the result I get:
My Name
123 street
City <span class="region">Region</span> <span class="postal-code">1A1 A1A</span>
my#email.ca
000-000-0000
To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?
I'm running Mac OS X 10.4.11 with the bash shell and using the version of sed that comes with it.
I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.
See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.
Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.
If you have only one data block per php file, try the following (using sed)
kent$ cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
kent$ sed 's/<[^>]*>//g; s/ //g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000