sed/grep - get text between two strings (html) - regex

I am trying to extract "pagename" from the following:
<a class="timetable work" href="http://www.test.com/pagename?tag=meta376">Test</a>
I tried to get it to work using "sed" but it only says invalid command code.
What line of code would you guys suggest to get the pagename? By the way: This is not a single line but there is more content on the same line - but that should not make a difference as it should just matter what is between the limiters, right?
Thanks in advance for helping me out!

I would use awk for this:
awk -F"[/?]" '/timetable work/ {print $4}'file
pagename
It search for a line containing timetable work, then print fourth field using \ or ? as separator.

As you commented, if you want to extract "<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>" you can use the following regex:
<a class="timetable.*?<\/a>
Working demo
If you want to grab the content just surround the regex with capturing groups:
(<a class="timetable.*?<\/a>)
The match is:
MATCH 1
1. [9-80] `<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>`

I think this is what you want:
sed 's_^.*<a [^<>]* href="https*://[^/]*/\([^"?]*\).*$_\1_'

Giving you exactly what you asked for using exactly the delimiters you told us to use:
$ sed -n 's|.*<a class="timetable work" href="http://www\.test\.com/\(.*\)?tag=meta376">Test</a>|\1|p' file
pagename

I know it may be tempting to handle this using a regular expression but here's an alternative.
You are trying to parse some HTML, so use an HTML parser. Here's an example in Perl:
use strict;
use warnings;
use feature qw(say);
use HTML::TokeParser::Simple;
use URI::URL;
my $filename = 'file.html';
my $parser = HTML::TokeParser::Simple->new($filename);
while (my $anchor = $parser->get_tag('a')) {
next unless defined(my $class = $anchor->get_attr('class'));
next unless $class =~ /\btimetable\b/ and $class =~ /\bwork\b/;
my $url = url $anchor->get_attr('href');
say substr($url->path, 1);
}
Parse the HTML using HTML::TokeParser::Simple. loop through the <a> tags, skipping any that don't have the correct classes defined. For the ones that do, use URI::URL to parse the url and extract the "path" component (which in your case, would be "/pagename"). As you didn't want the leading slash, I used substr to remove the first character.
Output:
pagename
I know it's much longer than a single regex but it's also a lot more robust and will continue to work even when the format of your HTML changes slightly in the future. HTML parsers exist for a reason :)

Related

Bash/PHP extract URL from HTML via regex

Is there any easy way to extract this URL in bash/or PHP?
http://shop.image-site.com/images/2/format2013/fullies/kju_product.png
From this HTML code?
<a href="javascript: open_window_zoom('http://shop.image-site.com/image.php?image=http://shop.image-site.com/images/2/format2013/fullies/kju_product.png&pID=31777&download=kju.png&name=13011 KELLYS Kju: 490mm (19.5")',550,366);">
With perl you could do a match and a capture
perl -n -e 'print "$1\n" if (m/image=(.*?)\&/);'
This captures everything between image= and the next & and prints it $1.
For more on regular expressions, see perlre or http://www.regular-expressions.info/
In bash, you can try the following:
sed 's/.*image=\(http:\/\/[^&]*\).*/\1/g'
Update:
The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.
Whichever way you decide to dress it up, you could simply split with the delimiter equal to ?image= and then split the second token you receive (i.e. result[1]) with a simple & delimiter. The first result from that split is your answer.
However, a pure regex match would look something like: m#image=(a-z0-9\:/\.\-)&#i. You can take that regex and put it wherever you want to get your result stored in $1. Despite what a lot of people think, you do not have to match the beginning of a line and the end of a line to match a result.
Try doing this :
xmllint --html --xpath '//a/#href' file://file.html |
grep -oP 'image=\Khttp://.*?\.png'
You can use an URL instead of a local file :
http://domain.tld/path
Or if you had already extracted the line to parse in the $string variable :
grep -oP 'image=\Khttp://.*?\.png' <<< "$string"

Why is using string substitiution to form a regex not working?

I have a regular expression for use with awk to find any of the specified words in a line of a file. It looks like this awk "/word1/||/word2/||/word3/" filename. As an alternative, I have been trying to specify the words like this WORDS="word1 word2 word3" and then use bash string substitution to form the regular expression to pass to awk.
I've tried numerous ways of doing this to no avail. awk simply dumps the contents of the entire file or spits out some complaint about the regex form.
Here's what I have:
#!/bin/bash
FILE="myfile"
WORDS="word1 word2 word3"
# use BASH string substitution to obtain the regex which should look like this:
# "/word1/||/word2/||/word3/"
REGEX=\"/${WORDS// //||/}/\"
awk ${REGEX} $FILE
I'm fairly sure it has to do with quoting and I've tried various methods using echo and back ticks and can get it look right (when echoed) but when actually trying to use it, it fails.
Try to replace:
REGEX=\"/${WORDS// //||/}/\"
with:
REGEX="/${WORDS// //||/}/"
Note that there is no need to escape double quotes since they are not really part of the regular expression.

Global Regex Substitution with Unique Arbitrary Values

I have one huge HTML files with many links i.e. <a href="...">. I need to substitute each href with a unique arbitrary value. So, after substitution the first link will be <a href="http://link1">, second link <a href="http://link2">, and so on.
Can we do this using a regex? Or, do I need to write a small script to scan over the file? Ideally, the solution will be a Perl or bash script (not something proprietary).
Thanks.
Perl is probably your best bet, but I wouldn't try to do it in one regex (might not even be possible). I think this is as short as you can make the script while still making it readable:
#!/usr/bin/perl
$link = 1;
while(<>) {
$link++ while( s/href="(?!link\d)[^"]*"/href="link$link"/ );
print;
}
Then call it like so:
./thatScript.pl inputFile.html > newInputFile.html
It will examine each line of input, and for each href="..." it finds, replaces it with a numbered link and increments the link number. There is also a negative lookahead to avoid replacing the same href continuously.
EDIT: Just for the hell of it, here's how you would compress the above into a single line of bash:
perl -pe '$link++ while( s/href="(?!link\d)[^"]*"/href="link$link"/ )' inFile.html > outFile.html
This makes use of Perl's amazing -p flag, as explained here.
I definitely don't recommend this (tchrist is right, of course, it should be a script) but it does have the virtue of being terse and fulfilling the literal requirements in a deterministic/repeatable way without needing to save state/mapping.
perl -MDigest::MD5=md5_hex -MXML::LibXML -le '$d = XML::LibXML->load_html( location => shift || die "need location" ); for $a ( $d->findnodes("//\#href") ) { $a->setValue( md5_hex $a->value ) }; print $d->serialize' targeted.html
Digest::MD5
XML::LibXML
untested:
perl -pe 's{(href=")[^"]+}{$1 . "http://link" . ++$count}ge' filename > newfile

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.
If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.
But in general, awk is the wrong tool for this job.
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.
By your script, if you can get what you want (it means <li> and <a> tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.
There are several issues that I see:
The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.
It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.