I can query the informations about a rpm-package with
rpm -qi <rpm-package-name>
Example-Result of a Query:
tfaa004:/sm/bin # rpm -qi expect-5.45-16.1.3.i586
Name : expect
Version : 5.45
Release : 16.1.3
Architecture: i586
Install Date: Di 27 Jun 2017 15:31:08 CEST
Group : Development/Languages/Tcl
Size : 674166
License : SUSE-Public-Domain
Signature : RSA/SHA256, Do 25 Sep 2014 11:42:26 CEST, Key ID b88b2fd43dbdc284
Source RPM : expect-5.45-16.1.3.src.rpm
Build Date : Do 25 Sep 2014 11:42:16 CEST
Build Host : cloud120
Relocations : (not relocatable)
Packager : http://bugs.opensuse.org
Vendor : openSUSE
URL : http://expect.nist.gov
Summary : A Tool for Automating Interactive Programs
Description :
Expect is a tool primarily for automating interactive applications,
such as telnet, ftp, passwd, fsck, rlogin, tip, and more. Expect
really makes this stuff trivial. Expect is also useful for testing
these applications. It is described in many books, articles, papers,
and FAQs. There is an entire book on it available from O'Reilly.
Distribution: openSUSE 13.2
But I only want to query the Description. Is that possible?
The reason for that is that I want to process this information (the Description) in a C++ Program (I do this with popen()).
Maybe something like this:
rpm -qi -Description expect-5.45-16.1.3.i586
This is the correct sollution:
rpm -q --queryformat '%{DESCRIPTION}\n' expect-5.45-16.1.3.i586
[EDIT for openSUSE rpm output]:
rpm -qi package_name | sed '1,/Description/d;/Distribution/,$d'
This will only print lines between "Description" and "Distribution"
[The below cmds work for RHEL distros]
I do not believe the "rpm" utility has a flag to only print out the "Description" field, but it's as simple as using a pipe :)
You could do:
rpm -qi openssh-server-5.3p1-104.el6.x86_64 | awk '/Description/, 0'
Which will print every line after the pattern "Description" is found.
Or, if you're more inclined to use "grep":
rpm -qi openssh-server-5.3p1-104.el6.x86_64 | grep -A20 'Description'
the "-A n" flag tells grep to print n lines After the pattern is found.
***Edit: you may also use "sed":
rpm -qi openssh-server-5.3p1-104.el6.x86_64 | sed -e '1,/Description/ d'
Hope this helps.
Related
This question already has answers here:
Using different delimiters in sed commands and range addresses
(3 answers)
Closed 1 year ago.
When following their documentation and running ./build_packages --board=lakitu, I get the following error.
Using ubuntu 16.0.4. Looks like a sed syntax error? Am I missing a variable? Does sed work differently in different operating systems or is something wrong with their documentation/scripts? Followed their documentation to the t and didn't add or configure anything. Waiting for a successful run first.
Looking at similar questions, they all appear to be syntax errors...
* Package: sys-boot/shim-14.0.20180308-r4
* Repository: lakitu
* USE: abi_x86_64 amd64 elibc_glibc kernel_linux userland_GNU
* FEATURES: network-sandbox sandbox splitdebug userpriv usersandbox
* Running stacked hooks for pre_pkg_setup
* sysroot_build_bin_dir ... [ ok ]
* Running stacked hooks for post_pkg_setup
* python_eclass_hack ... [ ok ]
* Running stacked hooks for pre_src_unpack
* python_multilib_setup ... [ ok ]
>>> Unpacking source...
>>> Unpacking shim-14.0.20180308.tar.gz to /build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/work
>>> Source unpacked in /build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/work
* Running stacked hooks for post_src_unpack
* asan_init ... [ ok ]
>>> Preparing source in /build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/work/shim-79cdb2a215de2ace7d1bf0a294165a04b726c70a ...
>>> Source prepared.
>>> Configuring source in /build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/work/shim-79cdb2a215de2ace7d1bf0a294165a04b726c70a ...
>>> Source configured.
>>> Compiling source in /build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/work/shim-79cdb2a215de2ace7d1bf0a294165a04b726c70a ...
make -j8 ARCH=x86_64 CROSS_COMPILE=x86_64-cros-linux-gnu- EFI_INCLUDE=/build/lakitu//usr/include/efi EFI_PATH=/build/lakitu//usr/lib64 ARCH_LDFLAGS=--no-experimental-use-relr COMMITID=79cdb2a215de2ace7d1bf0a294165a04b726c70a DEFAULT_LOADER=\\\\grub-lakitu.efi shimx64.efi
sed -e "s,##VERSION##,14," \
-e "s,##UNAME##,Linux x86_64 Intel Xeon E312xx (Sandy Bridge, IBRS update) GenuineIntel GNU/Linux," \
-e "s,##COMMIT##,79cdb2a215de2ace7d1bf0a294165a04b726c70a," \
< /build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/work/shim-79cdb2a215de2ace7d1bf0a294165a04b726c70a/version.c.in > version.c
sed: -e expression #2, char 60: unknown option to `s'
make: *** [Makefile:183: version.c] Error 1
* ERROR: sys-boot/shim-14.0.20180308-r4::lakitu failed (compile phase):
* emake failed
*
* If you need support, post the output of `emerge --info '=sys-boot/shim-14.0.20180308-r4::lakitu'`,
* the complete build log and the output of `emerge -pqv '=sys-boot/shim-14.0.20180308-r4::lakitu'`.
* The complete build log is located at '/build/lakitu/tmp/portage/logs/sys-boot:shim-14.0.20180308-r4:20190531-002217.log'.
* For convenience, a symlink to the build log is located at '/build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/temp/build.log'.
* The ebuild environment file is located at '/build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/temp/environment'.
* Working directory: '/build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/work/shim-79cdb2a215de2ace7d1bf0a294165a04b726c70a'
* S: '/build/lakitu/tmp/portage/sys-boot/shim-14.0.20180308-r4/work/shim-79cdb2a215de2ace7d1bf0a294165a04b726c70a'
There's a , after Bridge
-e "s,##UNAME##,Linux x86_64 Intel Xeon E312xx (Sandy Bridge, IBRS update) GenuineIntel GNU/Linux," \
Change to
-e "s###UNAME###Linux x86_64 Intel Xeon E312xx (Sandy Bridge, IBRS update) GenuineIntel GNU/Linux#" \
I'm trying to match given string and match it to a package version in a /bin/sh script:
if test "x$version" = "x"; then
version="latest";
info "Version parameter not defined, assuming latest";
else
info "Version parameter defined: $version";
info "Matching version to package version"
case "$version" in
[^4.0.]*)
$package_version='1.0.1'
;;
[^4.1.]*)
$package_version='1.1.1'
;;
[^4.2.]*)
$package_version='1.2.6'
;;
*)
critical "Unable to match requested version to package version"
exit 1
;;
esac
fi
However, when I run it I get an error:
23:38:47 +0000 INFO: Version parameter defined: 4.0.0
23:38:47 +0000 INFO: Matching Puppet version to puppet-agent package version (See http://docs.puppetlabs.com/puppet/latest/reference/about_agent.html for more details)
23:38:47 +0000 CRIT: Unable to match requested puppet version to puppet-agent version - Check http://docs.puppetlabs.com/puppet/latest/reference/about_agent.html
23:38:47 +0000 CRIT: Please file a bug report at https://github.com/petems/puppet-install-shell/
23:38:47 +0000 CRIT:
23:38:47 +0000 CRIT: Version: 4.0.0
I'm using the same regex that worked for me in another part of the script:, and it seems to work there:
if test "$version" = 'latest'; then
apt-get install -y puppet-common puppet
else
case "$version" in
[^2.7.]*)
info "2.7.* Puppet deb package tied to Facter < 2.0.0, specifying Facter 1.7.4"
apt-get install -y puppet-common=$version-1puppetlabs1 puppet=$version-1puppetlabs1 facter=1.7.4-1puppetlabs1 --force-yes
;;
*)
apt-get install -y puppet-common=$version-1puppetlabs1 puppet=$version-1puppetlabs1 --force-yes
;;
esac
fi
What am I missing?
Full version of the script is here: https://github.com/petems/puppet-install-shell/blob/fix_puppet_agent_install/install_puppet_agent.sh
case ... esac in a POSIX shell script uses (glob-style) patterns, not regular expressions (while the two are distantly related, there are fundamental differences).
To get true regex matching in a sh script, you'd have to use expr with :, though it's probably not needed here.
To test for a prefix match, use <prefix>* in a case branch - case branches are always matched against the entire argument - no need for anchoring (which patterns don't support).
As an aside, what you're attempting would not even work for prefix matching as a regex. E.g., [^4.0.] is the same as [^.04] - i.e., a negated character class: it matches one character if it is neither . nor 0 nor 4.
When assigning to a variable in a POSIX shell script, do not use $.
To put it all together:
#/bin/sh
if [ "$version" = "" ]; then
version="latest";
info "Version parameter not defined, assuming latest"
else
info "Version parameter defined: $version";
info "Matching version to package version"
case "$version" in
4.0.*)
package_version='1.0.1'
;;
4.1.*)
package_version='1.1.1'
;;
4.2.*)
package_version='1.2.6'
;;
*)
critical "Unable to match requested version to package version"
exit 1
;;
esac
fi
I would like to download a latest source code of software (WRF) from some url and automate the installation process thereafter. A sample url like is given below:-
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.1.TAR.gz
In the above url, the version number may change time to time after the developer release the new version. Now I would like to download the latest available version from the main script. I tried the following:-
wget -k -l 0 "http://www2.mmm.ucar.edu/wrf/src/" -O index.html ; cat index.html | grep -o 'http:[^"]*.gz' | grep 'WRFV'
With above code, I could pull all available version of the software. The output of the above code is below:-
http://www2.mmm.ucar.edu/wrf/src/WRFV2.0.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV2.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.0.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.0.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.4.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.4.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.5.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.5.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.6.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Chem-3.6.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3-Var-do-not-use.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.0.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.0.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.1.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.2.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.3.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.4.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.4.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.5.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.5.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.1.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.6.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3.TAR.gz
http://www2.mmm.ucar.edu/wrf/src/WRFV3_OVERLAY_3.0.1.1.TAR.gz
However, I am unable to go further to filter out only later version from the link.
Usually, for processing the html-pages i recommendig some perl tools, but because this is an Directory Index output, (probably) can be done by bash tools like grep sed and such...
The following code is divided to several smaller bash functions, for easy changes
#!/bin/bash
#getdata - should output html source of the page
getdata() {
#use wget with output to stdout or curl or fetch
curl -s "http://www2.mmm.ucar.edu/wrf/src/"
#cat index.html
}
#filer_rows - get the filename and the date columns
filter_rows() {
sed -n 's:<tr><td.*href="\([^"]*\)">.*>\([0-9].*\)</td>.*</td>.*</td></tr>:\2#\1:p' | grep "${1:-.}"
}
#sort_by_date - probably don't need comment... sorts the lines by date... ;)
sort_by_date() {
while IFS=# read -r date file
do
echo "$(date --date="$date" +%s)#$file"
done | sort -gr
}
#MAIN
file=$(getdata | filter_rows WRFV | sort_by_date | head -1 | cut -d# -f2)
echo "You want download: $file"
prints
You want download: WRFV3-Chem-3.6.1.TAR.gz
What about adding a numeric sort and taking the top line:
wget -k -l 0 "http://www2.mmm.ucar.edu/wrf/src/" -O index.html ; cat index.html | grep -o 'http:[^"]*.gz' | grep 'WRFV[0-9]*[0-9]\.[0-9]' | sort -r -n | head -1
I have a string in a bash script that contains a line of a log entry such as this:
Oct 24 12:37:45 10.224.0.2/10.224.0.2 14671: Oct 24 2012 12:37:44.583 BST: %SEC_LOGIN-4-LOGIN_FAILED: Login failed [user: root] [Source: 10.224.0.58] [localport: 22] [Reason: Login Authentication Failed] at 12:37:44 BST Wed Oct 24 2012
To clarify; the first IP listed there "10.224.0.2" was the machine the submitted this log entry, of a failed login attempt. Someone tried to log in, and failed, from the machine at the 2nd IP address in the log entry, "10.224.0.58".
I wish to replace the first occurrence of the IP address "10.224.0.2" with the host name of that machine, as you can see presently is is "IPADDRESS/IPADDRESS" which is useless having the same info twice. So here, I would like to grep (or similar) out the first IP and then pass it to something like the host command to get the reverse host and replace it in the log output.
I would like to repeat this for the 2nd IP "10.224.0.58". I would like to find this IP and also replace it with the host name.
It's not just those two specific IP address though, any IP address. So I want to search for 4 integers between 1 and 3, separated by 3 full stops '.'
Is regex the way forward here, or is that over complicating the issue?
Many thanks.
Replace a fixed IP address with a host name:
$ cat log | sed -r 's/10\.224\.0\.2/example.com/g'
Replace all IP addresses with a host name:
$ cat log | sed -r 's/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/example.com/g'
If you want to call an external program, it's easy to do that using Perl (just replace host with your lookup tool):
$ cat log | perl -pe 's/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/`host \1`/ge'
Hopefully this is enough to get you started.
There's variou ways to find th IP addresses, here's one. Just replace "printf '<<<%s>>>' " with "host" or whatever your command name is in this GNU awk script:
$ cat tst.awk
{
subIp = gensub(/\/.*$/,"","",$4)
srcIp = gensub(/.*\[Source: ([^]]+)\].*/,"\\1","")
"printf '<<<%s>>>' " subIp | getline subName
"printf '<<<%s>>>' " srcIp | getline srcName
gsub(subIp,subName)
gsub(srcIp,srcName)
print
}
$
$ gawk -f tst.awk file
Oct 24 12:37:45 <<<10.224.0.2>>>/<<<10.224.0.2>>> 14671: Oct 24 2012 12:37:44.583 BST: %SEC_LOGIN-4-LOGIN_FAILED: Login failed [user: root] [Source: <<<10.224.0.58>>>] [localport: 22] [Reason: Login Authentication Failed] at 12:37:44 BST Wed Oct 24 2012
googled this one line command together. but was unable to pass the founded ip address to the ssh command:
sed -n 's/\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}/\nip&\n/gp' test | grep ip | sed 's/ip//' | sort | uniq
the "test" is the file the sed command is searching for for the pattern
I want to extract all anchor tags from html pages. I am using this in Linux.
lynx --source http://www.imdb.com | egrep "<a[^>]*>"
but that is not working as expected, since result contains unwanted results
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>
I want just
<a href >...</a>
any good way ?
If you have a -P option in your grep so that it accepts PCRE patterns, you should be able to use better regexes. Sometimes a minimal quantifier like *? helps. Also, you’re getting the whole input line, not just the match itself; if you have a -o option to grep, it will list only the part that matches.
egrep -Po '<a[^<>]*>'
If your grep doesn’t have those options, try
perl -00 -nle 'print $1 while /(<a[^<>]*>)/gi'
Which now crosses line boundaries.
To do a real parse of HTML requires regexes subtantially more more complex than you are apt to wish to enter on the command line. Here’s one example, and here’s another. Those may not convince you to try a non-regex approach, but they should at least show you how much harder it is in the general case than in specific ones.
This answer shows why all things are possible, but not all are expedient.
why can't you use options like --dump ?
lynx --dump --listonly http://www.imdb.com
Try grep -Eo:
$ echo '<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>' | grep -Eo '<a[^>]*>'
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">
But please read the answer that MAK linked to.
Here's some examples of why you should not use regex to parse html.
To extract values of 'href' attribute of anchor tags, run:
$ python -c'import sys, lxml.html as h
> root = h.parse(sys.argv[1]).getroot()
> root.make_links_absolute(base_url=sys.argv[1])
> print "\n".join(root.xpath("//a/#href"))' http://imdb.com | sort -u
Install lxml module if needed: $ sudo apt-get install python-lxml.
Output
http://askville.amazon.com
http://idfilm.blogspot.com/2011/02/another-class.html
http://imdb.com
http://imdb.com/
http://imdb.com/a2z
http://imdb.com/a2z/
http://imdb.com/advertising/
http://imdb.com/boards/
http://imdb.com/chart/
http://imdb.com/chart/top
http://imdb.com/czone/
http://imdb.com/features/hdgallery
http://imdb.com/features/oscars/2011/
http://imdb.com/features/sundance/2011/
http://imdb.com/features/video/
http://imdb.com/features/video/browse/
http://imdb.com/features/video/trailers/
http://imdb.com/features/video/tv/
http://imdb.com/features/yearinreview/2010/
http://imdb.com/genre
http://imdb.com/help/
http://imdb.com/helpdesk/contact
http://imdb.com/help/show_article?conditions
http://imdb.com/help/show_article?rssavailable
http://imdb.com/jobs
http://imdb.com/lists
http://imdb.com/media/index/rg2392693248
http://imdb.com/media/rm3467688448/rg2392693248
http://imdb.com/media/rm3484465664/rg2392693248
http://imdb.com/media/rm3719346688/rg2392693248
http://imdb.com/mymovies/list
http://imdb.com/name/nm0000207/
http://imdb.com/name/nm0000234/
http://imdb.com/name/nm0000631/
http://imdb.com/name/nm0000982/
http://imdb.com/name/nm0001392/
http://imdb.com/name/nm0004716/
http://imdb.com/name/nm0531546/
http://imdb.com/name/nm0626362/
http://imdb.com/name/nm0742146/
http://imdb.com/name/nm0817980/
http://imdb.com/name/nm2059117/
http://imdb.com/news/
http://imdb.com/news/celebrity
http://imdb.com/news/movie
http://imdb.com/news/ni7650335/
http://imdb.com/news/ni7653135/
http://imdb.com/news/ni7654375/
http://imdb.com/news/ni7654598/
http://imdb.com/news/ni7654810/
http://imdb.com/news/ni7655320/
http://imdb.com/news/ni7656816/
http://imdb.com/news/ni7660987/
http://imdb.com/news/ni7662397/
http://imdb.com/news/ni7665028/
http://imdb.com/news/ni7668639/
http://imdb.com/news/ni7669396/
http://imdb.com/news/ni7676733/
http://imdb.com/news/ni7677253/
http://imdb.com/news/ni7677366/
http://imdb.com/news/ni7677639/
http://imdb.com/news/ni7677944/
http://imdb.com/news/ni7678014/
http://imdb.com/news/ni7678103/
http://imdb.com/news/ni7678225/
http://imdb.com/news/ns0000003/
http://imdb.com/news/ns0000018/
http://imdb.com/news/ns0000023/
http://imdb.com/news/ns0000031/
http://imdb.com/news/ns0000128/
http://imdb.com/news/ns0000136/
http://imdb.com/news/ns0000141/
http://imdb.com/news/ns0000195/
http://imdb.com/news/ns0000236/
http://imdb.com/news/ns0000344/
http://imdb.com/news/ns0000345/
http://imdb.com/news/ns0004913/
http://imdb.com/news/top
http://imdb.com/news/tv
http://imdb.com/nowplaying/
http://imdb.com/photo_galleries/new_photos/2010/
http://imdb.com/poll
http://imdb.com/privacy
http://imdb.com/register/login
http://imdb.com/register/?why=footer
http://imdb.com/register/?why=mymovies_footer
http://imdb.com/register/?why=personalize
http://imdb.com/rg/NAV_TWITTER/NAV_EXTRA/http://www.twitter.com/imdb
http://imdb.com/ri/TRAILERS_HPPIRATESVID/TOP_BUCKET/102785/video/imdb/vi161323033/
http://imdb.com/search
http://imdb.com/search/
http://imdb.com/search/name?birth_monthday=02-12
http://imdb.com/search/title?sort=num_votes,desc&title_type=feature&my_ratings=exclude
http://imdb.com/sections/dvd/
http://imdb.com/sections/horror/
http://imdb.com/sections/indie/
http://imdb.com/sections/tv/
http://imdb.com/showtimes/
http://imdb.com/tiger_redirect?FT_LIC&licensing/
http://imdb.com/title/tt0078748/
http://imdb.com/title/tt0279600/
http://imdb.com/title/tt0377981/
http://imdb.com/title/tt0881320/
http://imdb.com/title/tt0990407/
http://imdb.com/title/tt1034389/
http://imdb.com/title/tt1265990/
http://imdb.com/title/tt1401152/
http://imdb.com/title/tt1411238/
http://imdb.com/title/tt1411238/trivia
http://imdb.com/title/tt1446714/
http://imdb.com/title/tt1452628/
http://imdb.com/title/tt1464174/
http://imdb.com/title/tt1464540/
http://imdb.com/title/tt1477837/
http://imdb.com/title/tt1502404/
http://imdb.com/title/tt1504320/
http://imdb.com/title/tt1563069/
http://imdb.com/title/tt1564367/
http://imdb.com/title/tt1702443/
http://imdb.com/tvgrid/
http://m.imdb.com
http://pro.imdb.com/r/IMDbTabNB/
http://resume.imdb.com
http://resume.imdb.com/
https://secure.imdb.com/register/subscribe?c=a394d4442664f6f6475627
http://twitter.com/imdb
http://wireless.amazon.com
http://www.3news.co.nz/The-Hobbit-media-conference--full-video/tabid/312/articleID/198020/Default.aspx
http://www.amazon.com/exec/obidos/redirect-home/internetmoviedat
http://www.audible.com
http://www.boxofficemojo.com
http://www.dpreview.com
http://www.endless.com
http://www.fabric.com
http://www.imdb.com/board/bd0000089/threads/
http://www.imdb.com/licensing/
http://www.imdb.com/media/rm1037220352/rg261921280
http://www.imdb.com/media/rm2695346688/tt1449283
http://www.imdb.com/media/rm3987585536/tt1092026
http://www.imdb.com/name/nm0000092/
http://www.imdb.com/photo_galleries/new_photos/2010/index
http://www.imdb.com/search/title?sort=num_votes,desc&title_type=tv_series&my_ratings=exclude
http://www.imdb.com/sections/indie/
http://www.imdb.com/title/tt0079470/
http://www.imdb.com/title/tt0079470/quotes?qt0471997
http://www.imdb.com/title/tt1542852/
http://www.imdb.com/title/tt1606392/
http://www.imdb.de
http://www.imdb.es
http://www.imdb.fr
http://www.imdb.it
http://www.imdb.pt
http://www.movieline.com/2011/02/watch-jon-hamm-talk-butthole-surfers-paul-rudd-impersonate-jay-leno-at-book-reading-1.php
http://www.movingimagesource.us/articles/un-tv-20110210
http://www.npr.org/blogs/monkeysee/2011/02/10/133629395/james-franco-recites-byron-to-the-worlds-luckiest-middle-school-journalist
http://www.nytimes.com/2011/02/06/books/review/Brubach-t.html
http://www.shopbop.com/welcome
http://www.smallparts.com
http://www.twinpeaks20.com/details/
http://www.twitter.com/imdb
http://www.vanityfair.com/hollywood/features/2011/03/lauren-bacall-201103
http://www.warehousedeals.com
http://www.withoutabox.com
http://www.zappos.com
To extract values of 'href' attribute of anchor tags you may also use xmlstarlet after converting HTML to XHTML using HTML Tidy (Mac OS X version released on 25 March 2009):
curl -s www.imdb.com |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a/#href" -v '.' -n |
grep '^[[:space:]]*http://' | sort -u | nl
On Mac OS X you may also use the command line tool linkscraper:
linkscraper http://www.imdb.com
see: http://codesnippets.joyent.com/posts/show/10772