regular expression to extract data from html page

regular expression to extract data from html page - regex

I want to extract all anchor tags from html pages. I am using this in Linux.
lynx --source http://www.imdb.com | egrep "<a[^>]*>"
but that is not working as expected, since result contains unwanted results
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>
I want just
<a href >...</a>
any good way ?

If you have a -P option in your grep so that it accepts PCRE patterns, you should be able to use better regexes. Sometimes a minimal quantifier like *? helps. Also, you’re getting the whole input line, not just the match itself; if you have a -o option to grep, it will list only the part that matches.
egrep -Po '<a[^<>]*>'
If your grep doesn’t have those options, try
perl -00 -nle 'print $1 while /(<a[^<>]*>)/gi'
Which now crosses line boundaries.
To do a real parse of HTML requires regexes subtantially more more complex than you are apt to wish to enter on the command line. Here’s one example, and here’s another. Those may not convince you to try a non-regex approach, but they should at least show you how much harder it is in the general case than in specific ones.
This answer shows why all things are possible, but not all are expedient.

why can't you use options like --dump ?
lynx --dump --listonly http://www.imdb.com

Try grep -Eo:
$ echo '<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>' | grep -Eo '<a[^>]*>'
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">
But please read the answer that MAK linked to.

Here's some examples of why you should not use regex to parse html.
To extract values of 'href' attribute of anchor tags, run:
$ python -c'import sys, lxml.html as h
> root = h.parse(sys.argv[1]).getroot()
> root.make_links_absolute(base_url=sys.argv[1])
> print "\n".join(root.xpath("//a/#href"))' http://imdb.com | sort -u
Install lxml module if needed: $ sudo apt-get install python-lxml.
Output
http://askville.amazon.com
http://idfilm.blogspot.com/2011/02/another-class.html
http://imdb.com
http://imdb.com/
http://imdb.com/a2z
http://imdb.com/a2z/
http://imdb.com/advertising/
http://imdb.com/boards/
http://imdb.com/chart/
http://imdb.com/chart/top
http://imdb.com/czone/
http://imdb.com/features/hdgallery
http://imdb.com/features/oscars/2011/
http://imdb.com/features/sundance/2011/
http://imdb.com/features/video/
http://imdb.com/features/video/browse/
http://imdb.com/features/video/trailers/
http://imdb.com/features/video/tv/
http://imdb.com/features/yearinreview/2010/
http://imdb.com/genre
http://imdb.com/help/
http://imdb.com/helpdesk/contact
http://imdb.com/help/show_article?conditions
http://imdb.com/help/show_article?rssavailable
http://imdb.com/jobs
http://imdb.com/lists
http://imdb.com/media/index/rg2392693248
http://imdb.com/media/rm3467688448/rg2392693248
http://imdb.com/media/rm3484465664/rg2392693248
http://imdb.com/media/rm3719346688/rg2392693248
http://imdb.com/mymovies/list
http://imdb.com/name/nm0000207/
http://imdb.com/name/nm0000234/
http://imdb.com/name/nm0000631/
http://imdb.com/name/nm0000982/
http://imdb.com/name/nm0001392/
http://imdb.com/name/nm0004716/
http://imdb.com/name/nm0531546/
http://imdb.com/name/nm0626362/
http://imdb.com/name/nm0742146/
http://imdb.com/name/nm0817980/
http://imdb.com/name/nm2059117/
http://imdb.com/news/
http://imdb.com/news/celebrity
http://imdb.com/news/movie
http://imdb.com/news/ni7650335/
http://imdb.com/news/ni7653135/
http://imdb.com/news/ni7654375/
http://imdb.com/news/ni7654598/
http://imdb.com/news/ni7654810/
http://imdb.com/news/ni7655320/
http://imdb.com/news/ni7656816/
http://imdb.com/news/ni7660987/
http://imdb.com/news/ni7662397/
http://imdb.com/news/ni7665028/
http://imdb.com/news/ni7668639/
http://imdb.com/news/ni7669396/
http://imdb.com/news/ni7676733/
http://imdb.com/news/ni7677253/
http://imdb.com/news/ni7677366/
http://imdb.com/news/ni7677639/
http://imdb.com/news/ni7677944/
http://imdb.com/news/ni7678014/
http://imdb.com/news/ni7678103/
http://imdb.com/news/ni7678225/
http://imdb.com/news/ns0000003/
http://imdb.com/news/ns0000018/
http://imdb.com/news/ns0000023/
http://imdb.com/news/ns0000031/
http://imdb.com/news/ns0000128/
http://imdb.com/news/ns0000136/
http://imdb.com/news/ns0000141/
http://imdb.com/news/ns0000195/
http://imdb.com/news/ns0000236/
http://imdb.com/news/ns0000344/
http://imdb.com/news/ns0000345/
http://imdb.com/news/ns0004913/
http://imdb.com/news/top
http://imdb.com/news/tv
http://imdb.com/nowplaying/
http://imdb.com/photo_galleries/new_photos/2010/
http://imdb.com/poll
http://imdb.com/privacy
http://imdb.com/register/login
http://imdb.com/register/?why=footer
http://imdb.com/register/?why=mymovies_footer
http://imdb.com/register/?why=personalize
http://imdb.com/rg/NAV_TWITTER/NAV_EXTRA/http://www.twitter.com/imdb
http://imdb.com/ri/TRAILERS_HPPIRATESVID/TOP_BUCKET/102785/video/imdb/vi161323033/
http://imdb.com/search
http://imdb.com/search/
http://imdb.com/search/name?birth_monthday=02-12
http://imdb.com/search/title?sort=num_votes,desc&title_type=feature&my_ratings=exclude
http://imdb.com/sections/dvd/
http://imdb.com/sections/horror/
http://imdb.com/sections/indie/
http://imdb.com/sections/tv/
http://imdb.com/showtimes/
http://imdb.com/tiger_redirect?FT_LIC&licensing/
http://imdb.com/title/tt0078748/
http://imdb.com/title/tt0279600/
http://imdb.com/title/tt0377981/
http://imdb.com/title/tt0881320/
http://imdb.com/title/tt0990407/
http://imdb.com/title/tt1034389/
http://imdb.com/title/tt1265990/
http://imdb.com/title/tt1401152/
http://imdb.com/title/tt1411238/
http://imdb.com/title/tt1411238/trivia
http://imdb.com/title/tt1446714/
http://imdb.com/title/tt1452628/
http://imdb.com/title/tt1464174/
http://imdb.com/title/tt1464540/
http://imdb.com/title/tt1477837/
http://imdb.com/title/tt1502404/
http://imdb.com/title/tt1504320/
http://imdb.com/title/tt1563069/
http://imdb.com/title/tt1564367/
http://imdb.com/title/tt1702443/
http://imdb.com/tvgrid/
http://m.imdb.com
http://pro.imdb.com/r/IMDbTabNB/
http://resume.imdb.com
http://resume.imdb.com/
https://secure.imdb.com/register/subscribe?c=a394d4442664f6f6475627
http://twitter.com/imdb
http://wireless.amazon.com
http://www.3news.co.nz/The-Hobbit-media-conference--full-video/tabid/312/articleID/198020/Default.aspx
http://www.amazon.com/exec/obidos/redirect-home/internetmoviedat
http://www.audible.com
http://www.boxofficemojo.com
http://www.dpreview.com
http://www.endless.com
http://www.fabric.com
http://www.imdb.com/board/bd0000089/threads/
http://www.imdb.com/licensing/
http://www.imdb.com/media/rm1037220352/rg261921280
http://www.imdb.com/media/rm2695346688/tt1449283
http://www.imdb.com/media/rm3987585536/tt1092026
http://www.imdb.com/name/nm0000092/
http://www.imdb.com/photo_galleries/new_photos/2010/index
http://www.imdb.com/search/title?sort=num_votes,desc&title_type=tv_series&my_ratings=exclude
http://www.imdb.com/sections/indie/
http://www.imdb.com/title/tt0079470/
http://www.imdb.com/title/tt0079470/quotes?qt0471997
http://www.imdb.com/title/tt1542852/
http://www.imdb.com/title/tt1606392/
http://www.imdb.de
http://www.imdb.es
http://www.imdb.fr
http://www.imdb.it
http://www.imdb.pt
http://www.movieline.com/2011/02/watch-jon-hamm-talk-butthole-surfers-paul-rudd-impersonate-jay-leno-at-book-reading-1.php
http://www.movingimagesource.us/articles/un-tv-20110210
http://www.npr.org/blogs/monkeysee/2011/02/10/133629395/james-franco-recites-byron-to-the-worlds-luckiest-middle-school-journalist
http://www.nytimes.com/2011/02/06/books/review/Brubach-t.html
http://www.shopbop.com/welcome
http://www.smallparts.com
http://www.twinpeaks20.com/details/
http://www.twitter.com/imdb
http://www.vanityfair.com/hollywood/features/2011/03/lauren-bacall-201103
http://www.warehousedeals.com
http://www.withoutabox.com
http://www.zappos.com

To extract values of 'href' attribute of anchor tags you may also use xmlstarlet after converting HTML to XHTML using HTML Tidy (Mac OS X version released on 25 March 2009):
curl -s www.imdb.com |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a/#href" -v '.' -n |
grep '^[[:space:]]*http://' | sort -u | nl

On Mac OS X you may also use the command line tool linkscraper:
linkscraper http://www.imdb.com
see: http://codesnippets.joyent.com/posts/show/10772

Related

Extract text with awk or sed

I would like to extract and remove some texts from the below
[root#test]# du -k ./[a-zA-Z0-9] --max-depth=1 | sort -hr
Before
7789696 ./b/bklee
946792 ./a
796588 ./b/bluecyn
477860 ./b/bborikun
473652 ./b/bluechiper
220780 ./a/ara316
144244 ./a/aceload
131088 ./b/belivart
79108 ./a/athlon85
78644 ./b/beschur512
66264 ./b/bogdanov
52460 ./A
After
796588 bluecyn
477860 bborikun
473652 bluechiper
220780 ara316
144244 aceload
131088 belivart
79108 athlon85
78644 beschur512
66264 bogdanov
what I want is to remove the repetitive pattern, "./a/" and lines that only prints out like "/a"
I am trying to figure it out but since I am a beginner at AWK and SED, I need some help.
Thanks!

Like this, using GNU sed :
sed -E '/\s\.\/\w$/d; s!\./\w/?!!' file
7789696 bklee
796588 bluecyn
477860 bborikun
473652 bluechiper
220780 ara316
144244 aceload
131088 belivart
79108 athlon85
78644 beschur512
66264 bogdanov

du -k ./[a-zA-Z0-9] --max-depth=1 | sort -hr | sed -e 's,\./[a-z]/,,; /\.\/[Aa-Zz]/d'
7789696 bklee
796588 bluecyn
477860 bborikun
473652 bluechiper
220780 ara316
144244 aceload
131088 belivart
79108 athlon85
78644 beschur512
66264 bogdanov

How to write unix regular expression to select for specific files in a cp for-loop

I've got a directory with a bunch of files. Instead of describing the filenames and extensions, I'll just show you what is in the directory:
P01_1.atag P03_3.tgt P05_6.src P08_3.atag P10_5.tgt
P01_1.src P03_4.atag P05_6.tgt P08_3.src P10_6.atag
P01_1.tgt P03_4.src P06_1.atag P08_3.tgt P10_6.src
P01_2.atag P03_4.tgt P06_1.src P08_4.atag P10_6.tgt
P01_2.src P03_5.atag P06_1.tgt P08_4.src P11_1.atag
P01_2.tgt P03_5.src P06_2.atag P08_4.tgt P11_1.src
P01_3.atag P03_5.tgt P06_2.src P08_5.atag P11_1.tgt
P01_3.src P03_6.atag P06_2.tgt P08_5.src P11_2.atag
P01_3.tgt P03_6.src P06_3.atag P08_5.tgt P11_2.src
P01_4.atag P03_6.tgt P06_3.src P08_6.atag P11_2.tgt
P01_4.src P04_1.atag P06_3.tgt P08_6.src P11_3.atag
P01_4.tgt P04_1.src P06_4.atag P08_6.tgt P11_3.src
P01_5.atag P04_1.tgt P06_4.src P09_1.atag P11_3.tgt
P01_5.src P04_2.atag P06_4.tgt P09_1.src P11_4.atag
P01_5.tgt P04_2.src P06_5.atag P09_1.tgt P11_4.src
P01_6.atag P04_2.tgt P06_5.src P09_2.atag P11_4.tgt
P01_6.src P04_3.atag P06_5.tgt P09_2.src P11_5.atag
P01_6.tgt P04_3.src P06_6.atag P09_2.tgt P11_5.src
P02_1.atag P04_3.tgt P06_6.src P09_3.atag P11_5.tgt
P02_1.src P04_4.atag P06_6.tgt P09_3.src P11_6.atag
P02_1.tgt P04_4.src P07_1.atag P09_3.tgt P11_6.src
P02_2.atag P04_4.tgt P07_1.src P09_4.atag P11_6.tgt
P02_2.src P04_5.atag P07_1.tgt P09_4.src P12_1.atag
P02_2.tgt P04_5.src P07_2.atag P09_4.tgt P12_1.src
P02_3.atag P04_5.tgt P07_2.src P09_5.atag P12_1.tgt
P02_3.src P04_6.atag P07_2.tgt P09_5.src P12_2.atag
P02_3.tgt P04_6.src P07_3.atag P09_5.tgt P12_2.src
P02_4.atag P04_6.tgt P07_3.src P09_6.atag P12_2.tgt
P02_4.src P05_1.atag P07_3.tgt P09_6.src P12_3.atag
P02_4.tgt P05_1.src P07_4.atag P09_6.tgt P12_3.src
P02_5.atag P05_1.tgt P07_4.src P10_1.atag P12_3.tgt
P02_5.src P05_2.atag P07_4.tgt P10_1.src P12_4.atag
P02_5.tgt P05_2.src P07_5.atag P10_1.tgt P12_4.src
P02_6.atag P05_2.tgt P07_5.src P10_2.atag P12_4.tgt
P02_6.src P05_3.atag P07_5.tgt P10_2.src P12_5.atag
P02_6.tgt P05_3.src P07_6.atag P10_2.tgt P12_5.src
P03_1.atag P05_3.tgt P07_6.src P10_3.atag P12_5.tgt
P03_1.src P05_4.atag P07_6.tgt P10_3.src P12_6.atag
P03_1.tgt P05_4.src P08_1.atag P10_3.tgt P12_6.src
P03_2.atag P05_4.tgt P08_1.src P10_4.atag P12_6.tgt
P03_2.src P05_5.atag P08_1.tgt P10_4.src
P03_2.tgt P05_5.src P08_2.atag P10_4.tgt
P03_3.atag P05_5.tgt P08_2.src P10_5.atag
P03_3.src P05_6.atag P08_2.tgt P10_5.src
I have a file that is just outside of this directory that I need to copy to all of the files that end with "_1.src" inside the directory.
I'm working with unix in the Terminal app, so I tried writing this for loop, but it rejected my regular expression:
for .*1.src in ./
> do
> cp ../1.src
> done
I've only written regular expressions in Python before and have minimal experience, but I was under the impression that .* would match any combination of characters. However, I got the following error message:
-bash: `.*1.src': not a valid identifier
I then tried the same for loop with the following regular expression:
^[a-zA-Z0-9_]*1.src$
But I got the same error message:
-bash: `^[a-zA-Z0-9_]*1.src$': not a valid identifier
I tried the same regular expression with and without quotation marks, but it always gives the same 'not a valid identifier' error message.

Tested on Bash 4.4.12, the following is possible:
$ for i in ./*_1.src; do echo "$i" ; done
This will echo every file ending with _1.src to the screen, thus moving it will be possible as well.
$ mkdir tmp
$ for i in ./*_1.src; do mv "$i" tmp/.; done
I've tested with the following data:
$ touch P{1,2}{0,1,2}_{0..6}.{src,tgt,atag}
$ ls
P10_0.atag P10_5.src P11_3.tgt P12_2.atag P20_0.src P20_5.tgt P21_4.atag P22_2.src
P10_0.src P10_5.tgt P11_4.atag P12_2.src P20_0.tgt P20_6.atag P21_4.src P22_2.tgt
P10_0.tgt P10_6.atag P11_4.src P12_2.tgt P20_1.atag P20_6.src P21_4.tgt P22_3.atag
P10_1.atag P10_6.src P11_4.tgt P12_3.atag P20_1.src P20_6.tgt P21_5.atag P22_3.src
P10_1.src P10_6.tgt P11_5.atag P12_3.src P20_1.tgt P21_0.atag P21_5.src P22_3.tgt
P10_1.tgt P11_0.atag P11_5.src P12_3.tgt P20_2.atag P21_0.src P21_5.tgt P22_4.atag
P10_2.atag P11_0.src P11_5.tgt P12_4.atag P20_2.src P21_0.tgt P21_6.atag P22_4.src
P10_2.src P11_0.tgt P11_6.atag P12_4.src P20_2.tgt P21_1.atag P21_6.src P22_4.tgt
P10_2.tgt P11_1.atag P11_6.src P12_4.tgt P20_3.atag P21_1.src P21_6.tgt P22_5.atag
P10_3.atag P11_1.src P11_6.tgt P12_5.atag P20_3.src P21_1.tgt P22_0.atag P22_5.src
P10_3.src P11_1.tgt P12_0.atag P12_5.src P20_3.tgt P21_2.atag P22_0.src P22_5.tgt
P10_3.tgt P11_2.atag P12_0.src P12_5.tgt P20_4.atag P21_2.src P22_0.tgt P22_6.atag
P10_4.atag P11_2.src P12_0.tgt P12_6.atag P20_4.src P21_2.tgt P22_1.atag P22_6.src
P10_4.src P11_2.tgt P12_1.atag P12_6.src P20_4.tgt P21_3.atag P22_1.src P22_6.tgt
P10_4.tgt P11_3.atag P12_1.src P12_6.tgt P20_5.atag P21_3.src P22_1.tgt P10_5.atag
P11_3.src P12_1.tgt P20_0.atag P20_5.src P21_3.tgt P22_2.atag

Apparently, my previous answer didn't work. But this seems to:
$ for x in `echo ./P[01][012]_1.src`; do echo "$x"; done
./P01_1.src
./P02_1.src
So, when you run this echo alone, this pattern gets expanded into many names:
$ echo ./P[01][012]_1.src # note that the 'regex' is not enclosed in quotes
./P01_1.src ./P02_1.src
And then you can iterate over these names in a loop.
BTW, as noted in the comments, you don't even need that echo, so you can plug the pattern right into the loop:
for x in ./P[01][012]_1.src; do echo "$x"; done

Please correct me if your goal is something other than
"overwrite many existing files sharing a common suffix with the contents of a single file"
find /path/to/dest_dir -type f -name "*_1.src" |xargs -n1 cp /path/to/source_file
Note that without the -maxdepth 1 option, find will recurse through your destination directory.

Thanks to everyone; this is what ended up working:
for x in `echo ./P[0-9]*_1.src`
> do
> cp ../1.src "$x"
> done
This loop allowed me to copy the contents of the one file to all of the files in the subdirectory that ended with "_1.src"

bash script - fetch only unique domains from email list to variable

I am new to bash and having problem understanding how to get this done.
Check all "To:" field email address domains and list all unique domains to a variable to compare it to from domain.
I get the "from address" domain by using
grep -m 1 "From: " filename | cut -f 2 -d '#' | cut -d ">" -f 1
when reading a mail stored in file filename.
For "to address" domain there can be multiple To: addresses and having multiple domains. I am not sure how to get unique domains from "to address field".
Example to address line will be like this:
To: user#domain.com, user2#domain.com,
User Name <sample#domaintest.com>, test#domainname.com
grep -m 1 "^To: " filename | cut -f 2 -d '#' | cut -d ">" -f 1
but there are different format of email. So I am not sure if grep is right or if I should search for awk or something.
I need to get the unique domain list from the "To:" field email address/addresses to a variable in bash script.
Desired output for above example:
domain.com,domaintest.com,domainname.com

If you are hellbent on doing this with line-oriented utilities, there is a utility formail in the Procmail distribution which can normalize things for you somewhat.
bash$ formail -czxTo: <<\==test==
> From: me <sender#example.com>
> To: you <first#example.org>,
> them <other#example.net>
> Subject: quick demo
>
> Very quick, innit.
> ==test==
first#example.org, other#example.net
So with that you have input which you can actually pass to grep or Awk ... or sed.
fromdom=$(formail -czxTo: <message | tr ',' '\n' | sed 's/.*#//')
The From: address will not be normalized by formail -czxFrom: but you can use a neat trick: make formail generate a reply back to the From: address, and then extract the To: header from that.
todoms=$(formail -rtzcxTo: <message | sed 's/.*#//')
In some more detail, -r says to create a new reply to whoever sent you message, and then we do -zcxTo: on that.
(The -t option may or may not do what you want. In this case, I would perhaps omit it. http://www.iki.fi/era/procmail/formail.html has (vague) documentation for what it does; see also the section just before http://www.iki.fi/era/procmail/mini-faq.html#group-writable and sorry for the clumsy link -- there doesn't seem to be a good page-internal anchor to link to.)

Email address normalization is tricky because there are so many variants to choose from.
From: Elvis Parsley <king#graceland.example.com>
From: king#graceland.example.com
From: "Parsley, Elvis" <king#graceland.example.com> (kill me, I have to use Outlook)
From: "quoted#string" <king#graceland.example.com> (wait, he is already dead)
To: This could fold <recipient#example.net>,
over multiple lines <another#example.org>
I would turn to a more capable language with proper support for parsing all of these formats. My choice would be Python, though you could probably also pull this off in a few lines of Ruby or Perl.
The email library was revamped in Python 3.6 so this assumes you have at least that version. The email.Headerregistry class which is new in 3.6 is particularly convenient here.
#!/usr/bin/env python3
from email.policy import default
from email import message_from_binary_file
import sys
if len(sys.argv) == 1:
sys.argv.append('-')
for arg in sys.argv[1:]:
if arg == '-':
handle = sys.stdin
else:
handle = open(arg, 'rb')
message = message_from_binary_file(handle, policy=default)
from_dom = message.get('From').address.domain
to_doms = set()
for addr in message.get('To').addresses:
dom = addr.domain
if dom == from_dom:
continue
to_doms.add(dom)
print(','.join([from_dom] + list(to_doms)))
if arg != '-':
handle.close()
This simply produces a comma-separated list of domain names; you might want to do the rest of the processing in Python too instead, or change this so that it prints something in a slightly different format.
You'd save this in a convenient place (say, /usr/local/bin/fromto) and mark it as executable (chmod 755 /usr/local/bin/fromto). Now you can call this from the shell like any other utility like grep.

Git Bash regex to match latest tag

My VCS has these tags
0.0.3.156-alpha+2
0.0.3.154
0.0.3.153
build-.139
build-.140
build-.142
build-0.0.1.28
build-0.0.1.29
build-0.0.1.30
build-0.0.1.32
I want to git describe --match "<regex>" to get the latest tag of the form number.number.number.number (so it's 0.0.3.154 in this case)
I have tried with git describe --match "[0-9]*.[0-9]*.[0-9]*.[0-9]*$" but it doesn't result in anything, and neither do these pattern:
"[0-9]*.[0-9]*.[0-9]*.[0-9]+"
"[0-9]*.[0-9]*.[0-9]*.[0-9]{1,}"
I need to get the latest tag in other to bump version for the next release. So i'm thinking of doing this automatically. Please let me know if I miss anything
Thanks
UPDATE:
In my build.gradle file I have a function to get tag like this (follow #Marc reply):
version getVersionFromTag()
def getVersionFromTag() {
def stdout = new ByteArrayOutputStream()
exec {
commandLine 'git', 'tag', '|' , 'grep', '^\([0-9]\+\.\?\)\+$', '|', 'sort' , '-nr', '|', 'head', '-1'
standardOutput = stdout
}
return stdout.toString().trim()
}
Here it gives errors Unexpected Char '\' in the regex above. Hence I removed them to becomes '^([0-9]+.?)+$', then it runs fine but in my final artifact, it does not have the version appended to the name (i.e helloword.jar instead of helloword-0.0.3.154.jar
=> My question is how should I put #Marc's suggested command to the gradle function correctly?

For testing I've put the output of your git describe in a file. This will do:
cat file | grep '^\([0-9]\+\.\?\)\+$' | sort -nr | head -1
0.0.3.154
Suppose you've created some irregular formatted tags and you want to use those as well (like your build--tags) for finding the highest tag:
sed -E 's/^[^0-9.]*//' | grep '^\([0-9]\+\.\?\)\+$' | sort -nr | head -1

How to download latest version of software from same url using wget

I would like to download a latest source code of software (WRF) from some url and automate the installation process thereafter. A sample url like is given below:-
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.1.TAR.gz
In the above url, the version number may change time to time after the developer release the new version. Now I would like to download the latest available version from the main script. I tried the following:-
wget -k -l 0 "http://www2.mmm.ucar.edu/wrf/src/" -O index.html ; cat index.html | grep -o 'http:[^"]*.gz' | grep 'WRFV'
With above code, I could pull all available version of the software. The output of the above code is below:-
http://www2.mmm.ucar.edu/wrf/src/WRFV2.0.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.0.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.0.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.4.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.4.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.5.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.5.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.6.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.6.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Var-do-not-use.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.0.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.0.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.4.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.4.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.5.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.5.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3_OVERLAY_3.0.1.1.TAR.gz
However, I am unable to go further to filter out only later version from the link.

Usually, for processing the html-pages i recommendig some perl tools, but because this is an Directory Index output, (probably) can be done by bash tools like grep sed and such...
The following code is divided to several smaller bash functions, for easy changes
#!/bin/bash
#getdata - should output html source of the page
getdata() {
#use wget with output to stdout or curl or fetch
curl -s "http://www2.mmm.ucar.edu/wrf/src/"
#cat index.html
}
#filer_rows - get the filename and the date columns
filter_rows() {
sed -n 's:<tr><td.*href="\([^"]*\)">.*>\([0-9].*\)</td>.*</td>.*</td></tr>:\2#\1:p' | grep "${1:-.}"
}
#sort_by_date - probably don't need comment... sorts the lines by date... ;)
sort_by_date() {
while IFS=# read -r date file
do
echo "$(date --date="$date" +%s)#$file"
done | sort -gr
}
#MAIN
file=$(getdata | filter_rows WRFV | sort_by_date | head -1 | cut -d# -f2)
echo "You want download: $file"
prints
You want download: WRFV3-Chem-3.6.1.TAR.gz

What about adding a numeric sort and taking the top line:
wget -k -l 0 "http://www2.mmm.ucar.edu/wrf/src/" -O index.html ; cat index.html | grep -o 'http:[^"]*.gz' | grep 'WRFV[0-9]*[0-9]\.[0-9]' | sort -r -n | head -1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regular expression to extract data from html page - regex

why can't you use options like --dump ? lynx --dump --listonly http://www.imdb.com

Try grep -Eo: $ echo '<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>' | grep -Eo '<a[^>]*>' <a class="amazon-affiliate-site-name" href="http://www.fabric.com"> But please read the answer that MAK linked to.

On Mac OS X you may also use the command line tool linkscraper: linkscraper http://www.imdb.com see: http://codesnippets.joyent.com/posts/show/10772

Related

Extract text with awk or sed

How to write unix regular expression to select for specific files in a cp for-loop

bash script - fetch only unique domains from email list to variable

Git Bash regex to match latest tag

How to download latest version of software from same url using wget

Categories

Resources