Return each instance of a regex - regex

I've extensively googled and everyone keeps telling me how to return the LINE that the regex matches...
go lets say that i have a line like this in a text file:
<a href=http://google.com> Google </a>
I want to be able to return ONLY what occurs between > and < ("Google"). The problem is that I could have a file with thousands of lines like that and I only want to have sed/awk return the EXACT string that matches the regex.
I figured it would be something along the lines of :
sed 's/>.*</p'
but obviously that wont work...
Its killing me because im sure its probably very simple but i just cant find the right sed line. can sed just not do it?
So I just want it to search through a file, match the regex i give it, and return the exact match (not the line)
anyone have any ideas?

With `sed
sed -n 's/^.*>\([^<]*\)<.*$/\1/p'

If you have GNU grep, the -o option does what you want.
echo '<a href=http://google.com> Google </a><span>foo</span>' |
grep -oP '(?<=>).*?(?=<)'
returns
Google
foo
This assumes your html tags and their enclosed data are on one line. This would not match
<a href=...>
blah
</a>

Related

sed regex match and replace any last digit

I have lots of file containing following ipaddress, and i want to replace last digit of ip and look like i am having struggle to come up with correct regex
file1
IPADDR=10.30.2.26
NETMASK=255.255.0.0
GATEWAY=10.30.0.1
I want to replace 10.30.2.26 to 10.30.2.27 using sed but somehow i am missing something, i have tried following.
I have many file which i want to replace and last digit could be anything.
I have tried sed 's/[^IPADDR].$/7/g' file1
how do i match anything between ^IPADDR{anything}$ ?
In your regex, [^IPADDR] is a character class that search for any character except those listed between brackets. I'm not sure that's what you want.
You can use an address instead to find lines starting with IPADDR(/^IPADDR/) and apply the substitution command on it:
sed '/^IPADDR/s/[0-9]$/7/' file
You may use the following command:
sed -r 's/(^IPADDR=[0-9.]+)([0-9]$)/\17/g' file
Prints:
IPADDR=10.30.2.27
NETMASK=255.255.0.0
GATEWAY=10.30.0.1

print multiple patterns with sed

I try to print multiple patterns with sed.
Here's a typical string to process :
(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>
and I would like : (1.15)
For this, I tried :
sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
but I get (1.)15</span>)</td></tr>
Anyone could see what's wrong ?
Thanks
If you are Chuck Norris, use regex, brainfuck or assembly. If you're not, don't use regex to parse HTML, instead, use a tool that support xpath, like xmllint. In 2014, it's a solved problem :
xmllint --html --xpath '//span[#class="arabic"]/text()' file_or_URL
Check the famous RegEx match open tags except XHTML self-contained tags
xmllint comes from libxml2-utils package (for debian and derivatives)
Reason why you are getting "(1.)15) as your output"
sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
^^
the two characters "> needs to be placed before \([0-9]*\) since "> in your line is before the two digits (in this case). This way sed can find the pattern
The correct sed command
sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
^^
Correct Command line
echo '(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>'|sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
results using the command line above
(1.15)
If data is at the same place all the time, awk may be a simpler solution than sed:
awk -F"[<>]" '{print "("$3"."$7")"}' file
(1.15)
$ lynx -dump -nomargins file.htm
(1.15)

Working RegEx that fails in Perl find & replace one-liner

I have the following RegEx (<th>Password<\/th>\s*<td>)\w*(<\/td>) which matches <th>Password</th><td>root</td> in this HTML:
<tr>
<th>Password</th>
<td>root</td>
</tr>
However this Terminal command fails to find a match:
perl -pi -w -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
It appears to have something to do with the whitespace between the </th> and <td> but the <\/th>\s*<td> works in the RegEx so why not in Perl?
Have tried substituting \s* for \n*, \r*, \t* and various combinations thereof but still no match.
A working example can be seen here.
Any help would be gratefully appreciated.
The substitution is only applied to one line of your file at a time.
You can read the entire file in at once using the -0 option, like this
perl -w -0777 -pi -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
Note that it is far preferable to use a proper HTML parser, such as HTML::TreeBuilder::XPath, to process data like this, as it is very difficult to account for all possible representations of a given HTML construct using regular expressions.
Perl evaluates a file one line at a time, in your example you're trying to match over two lines so perl never finds the end of the string it's looking for on the first line, and never finds the beginning of the line it's looking for on the second line.
You can either flatten file.html to a single line temporarily (which might work if the file's small / performance is not so important) or you'll need to write more sophisticated logic to keep track of lines it's found.
Try searching for 'multiline regex perl' :)
You could use sed to do this:
sed -i '/<th>Password<\/th>/{n;s!<td>[^<]*!<td>NEWPASSWORD!}' file.html
Another sed version:
sed -i '/<th>Password<\/th>/!b;n;s/<td>[^<]*/<td>NEWPASSWORD/' file.html

Bash/PHP extract URL from HTML via regex

Is there any easy way to extract this URL in bash/or PHP?
http://shop.image-site.com/images/2/format2013/fullies/kju_product.png
From this HTML code?
<a href="javascript: open_window_zoom('http://shop.image-site.com/image.php?image=http://shop.image-site.com/images/2/format2013/fullies/kju_product.png&pID=31777&download=kju.png&name=13011 KELLYS Kju: 490mm (19.5")',550,366);">
With perl you could do a match and a capture
perl -n -e 'print "$1\n" if (m/image=(.*?)\&/);'
This captures everything between image= and the next & and prints it $1.
For more on regular expressions, see perlre or http://www.regular-expressions.info/
In bash, you can try the following:
sed 's/.*image=\(http:\/\/[^&]*\).*/\1/g'
Update:
The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.
Whichever way you decide to dress it up, you could simply split with the delimiter equal to ?image= and then split the second token you receive (i.e. result[1]) with a simple & delimiter. The first result from that split is your answer.
However, a pure regex match would look something like: m#image=(a-z0-9\:/\.\-)&#i. You can take that regex and put it wherever you want to get your result stored in $1. Despite what a lot of people think, you do not have to match the beginning of a line and the end of a line to match a result.
Try doing this :
xmllint --html --xpath '//a/#href' file://file.html |
grep -oP 'image=\Khttp://.*?\.png'
You can use an URL instead of a local file :
http://domain.tld/path
Or if you had already extracted the line to parse in the $string variable :
grep -oP 'image=\Khttp://.*?\.png' <<< "$string"

Regex in perl/sed replacement not matching whitespace/characters

Given this file, I'm trying to do a super primitive sed or perl replacement of a footer.
Typically I use DOM to parse HTML files but so far I've had no issues due to the primitive HTML files I'm dealing with ( time matters ) using sed/perl.
All I need is to replace the <div id="footer"> which contains whitespace, an element that has another element, and the closing </div> with <?php include 'footer.php';?>.
For some reason I can't even get this pattern to match up until the <div id="stupid">. I know there are whitespace characters so i used \s*:
perl -pe 's|<div id="footer">.*\s*.*\s*|<?php include INC_PATH . 'includes/footer.php'; ?>|' file.html | less
But that only matches the first line. The replacement looks like this:
<?php include INC_PATH . includes/footer.php; ?>
<div id="stupid"><img src="file.gif" width="206" height="252"></div>
</div>
Am I forgetting something simple, or should I specify some sort of flag to deal with a multiline match?
perl -v is 5.14.2 and I'm only using the pe flags.
You probably want -0777, which will force perl to read the entire file at once.
perl -0777 -n -e 's|something|else|g' file
Also, your strategy of doing .*\s*.*\s* is pretty fragile. It'll match e.g. <div id="foo", which is just a fragment...
Are you forgetting that almost all regex parsing works on a line-by-line basis?
I've always had to use tr to convert the newlines into some other character, and then back again after the regex.
Just found this: http://www.perlmonks.org/?node_id=17947
You need to tell the regex engine to treat your scalar as a multiline string with the /m option; otherwise it won't attempt to match across newlines.
perl -p
is working on the file on a line by line basis see perl.com
that means your regex will never see all lines to match, it will only match when it gets the line that starts with "<div id="footer">" and on the following lines it will not match anymore.