Using awk sed or grep to parse URLs from webpage source

Using awk sed or grep to parse URLs from webpage source - regex

I am trying to parse the source of a downloaded web-page in order to obtain the link listing. A one-liner would work fine. Here's what I've tried thus far:
This seems to leave out parts of the URL from some of the page names.
$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3
This gets all of the URL's but I do not want to include links that have/are anchor links. Also I want to be able to specify the domain.org/folder/:
$ awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' file.html

If you are only parsing something like < a > tags, you could just match the href attribute like this:
$ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq
That will ignore the anchor and also guarantee that you have uniques. This does assume that the page has well-formed (X)HTML, but you could pass it through Tidy first.

lynx -dump http://www.ibm.com
And look for the string 'References' in the output. Post-process with sed if you need to.
Using a different tool sometimes makes the job simpler. Once in a while, a different tool makes the job dead simple. This is one of those times.

Related

Replace names in git log using sed on MacOs (for Gource)

I'm trying to make a nice Gource video on our software develop project. Using Gource a can generate a combined git log of all repos with:
first gource --output-custom-log ../logs/repo1.txt then
cat *.txt | sort -n > combined.txt
This generates a combined.txt file which is a pipe delimited file like:
1551272464|John|A|repo1/file1.txt
1551272464|john_doe|A|repo1/folder/file9.py
1551272464|Doe, John|A|repo2/filex.py
So its: EPOCH|Committer name|A or D or C|committed file
The actual problem I want to solve is the fact that my developers have used different git clients with different committer names so id like to replace all of their names to a single version. I do not mind setting multiple sed per situation.
So find "John", "john_doe" and "Doe, John" and replace it with "John Doe". And it should be done on my MacBook.
So I tried sed -i -r "s/John/user_john/g" combined.txt but the problem here is that it finds "John" and "Doe, John" and replaces just the "John" part so I'm need to do a fuzzy search and replace the whole column.
Who can help me get the correct regex?

A regex would almost certainly be the wrong approach for this as you'd get false matches unless you were extremely careful and it's inefficient.
Just create an aliases file containing a line for each name you want in your output followed by all the names that should be mapped to it and then you can do this to change them all clearly, simply, robustly, portably, and efficiently in one call to awk:
$ cat tst.awk
BEGIN { FS="[|]" ; OFS="|" }
NR==FNR {
for (i=2; i<=NF; i++) {
alias[$i] = $1
}
next
}
$2 in alias { $2 = alias[$2] }
{ print }
.
$ cat aliases
John Doe|John|john_doe|Doe, John
Susan Barker|Susie B|Barker, Susan
.
$ cat file
1551272464|John|A|repo1/file1.txt
1551272464|Susie B|A|repo2/filex.py
1551272464|john_doe|A|repo1/folder/file9.py
1551272464|Doe, John|A|repo2/filex.py
1551272464|Barker, Susan|A|repo2/filex.py
.
$ awk -f tst.awk aliases file
1551272464|John Doe|A|repo1/file1.txt
1551272464|Susan Barker|A|repo2/filex.py
1551272464|John Doe|A|repo1/folder/file9.py
1551272464|John Doe|A|repo2/filex.py
1551272464|Susan Barker|A|repo2/filex.py

As #WiktorStribizew mentioned, you can do:
sed -i -r "s/Doe, John|john_doe|John/user_john/g" combined.txt
And with that, you can even do:
sed -i -r -e "s/Doe, John|john_doe|John/user_john/g" -e "s/Wayne, Bruce|bruce_wayne|Bruce/user_bruce/g" combined.txt
And add more replacements to chain with the -e option:
-e script, --expression=script
add the script to the commands to be executed

try gnu sed:
sed -E "s/^(\w+\|)(john([\s_]doe)?|doe,\s*john)/\1John Doe/i" combined.txt
add -i option after examining to edit it; sed -Ei...

Run curl command on each line of a file and fetch data from result

Suppose I have a file containing a list of links of webpages.
www.xyz.com/asdd
www.wer.com/asdas
www.asdas.com/asd
www.asd.com/asdas
I know that doing curl www.xyz.com/asdd will fetch me the html of that webpage. I want to fetch some data from that webpage.
So the scenario is use curl to hit all the links in the file one by one and extract some data from the webpage and store somewhere else. Any ideas or suggestions.

As indicated in the comments, this will loop through your_file and curl each line:
while IFS= read -r line
do
curl "$line"
done < your_file
To get the <title> of a page, you can grep something like this:
grep -iPo '(?<=<title>).*(?=</title>)' file
So all together you could do
while IFS= read -r line
do
curl -s "$line" | grep -Po '(?<=<title>).*(?=</title>)'
done < your_file
Note curl -s is for silent mode. See an example with google page:
$ curl -s http://www.google.com | grep -Po '(?<=<title>).*(?=</title>)'
302 Moved

You can accomplish this in just one line with xargs. Let's say you have a file in the working directory with all your URLs (one per line) called sitemap
xargs -I{} curl -s {} <sitemap | grep title
This would extract any lines with the word "title" in it. To extract the title tags you'll want to change the grep a little. The -o flag ensures that only the grepped result is printed:
xargs -I{} curl -s {} <sitemap | grep -o "<title>.*</title>"
A couple of things to note:
If you want to extract certain data, you will need to \ escape characters.
For HTML attributes for example, you should match single and double quotes, and escape them like [\"\']
Sometimes, depending on the character set, you may get some unusual curl output with special characters. If you detect this, you'll need to switch the encoding with a utility like iconv

Grep Link and Link name to create CSV file

I am trying to create a list of urls and names from a file. The links are displayed like this:
<table class="list">
<tr><th valign="top">I</th><td>link45.php, link, link8, link 2</td></tr>
<tr><th valign="top">I</th><td>link45.php, link, link8, link 2</td></tr>
</table>
(there are probably some tr's and table tags in there also. please ignore the spaces at the start of the tags.
I need the output to be in a csv like format, but I am unsure how to do this with grep:
"linktoblah.html", "name of link"
I have a working grep which pulls out all of the links.html but not sure how I would pull out the name next to it.
cat list.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
Thanks

The line you showed can be extracted with
sed -e 's/.*=\(".*"\)>\(.*\)<.*$/\1, "\2"/'
Example:
echo '< a href="linktoblah.html">name of link < /a>.' | sed -a 's/.*=\(".*"\)>\(.*\)<.*$/\1, "\2"/'
produces
"linktoblah.html", "name of link "
Depending on what else is in your file, you may be able to replace the grep command with a selector in sed, like this:
sed -n -e '/href=/ s/.*=\(".*"\)>\(.*\)<.*$/\1, "\2"/p'
where the
/href=/
can be any regular expression that matches only the lines you want. The p at the end of the string means "and print", and the -n flag means "don't do anything unless there is a match". The combination of the two makes the separate grep unnecessary.

I have found a way on a different post using PERL HTML::TableExtract.
Get contents between table tags in everyfile in directory output to one file
Many thanks to choroba for his input.

A new awk
Not sure if this what you are looking for, but here is what I get from the new data:
awk -F"[\"<>]" -v RS="href=\"" 'NR>1 {print "\""$1"\",\""$3"\""}' file
"main.asp","link45.php"
"link.html","link"
"link8.asp","link8"
"link2.html","link 2"
"main.asp","link45.php"
"link.html","link"
"link8.asp","link8"
"link2.html","link 2"

Regex with sed to parse archive name

I'd like to parse different kinds of Java archive with the sed command line tool.
Archives can have the followin extensions:
.jar, .war, .ear, .esb
What I'd like to get is the name without the extension, e.g. for Foobar.jar I'd like to get Foobar.
This seems fairly simple, but I cannot come up with a solution that works and is also robust.
I tried something along the lines of sed s/\.+(jar|war|ear|esb)$//, but could not make it work.

You were nearly there:
sed -E 's/\.+(jar|war|ear|esb)$//' file
Just needed to add the -E flag to sed to interpret the expression. And of course, respect the sed 's/something/new/' syntax.
Test
$ cat a
aaa.jar
bb.war
hello.ear
buuu.esb
hello.txt
$ sed -E 's/\.+(jar|war|ear|esb)$//' a
aaa
bb
hello
buuu
hello.txt

Using sed:
s='Foobar.jar'
sed -r 's/\.(jar|war|ear|esb)$//' <<< "$s"
Foobar
OR better do it in BASH itself:
echo "${s/.[jwe]ar/}"
Foobar

You need to escape the | and the () and also add ' if you do not add option like -r or -E
echo "test.jar" | sed 's/\.\(jar\|war\|ear\|esb\)$//'
test
* is also not needed, sine you normal have only one .

On traditionnal UNIX (tested with AIX/KSH)
File='Foobar.jar'
echo ${File%.*}
from a list having only your kind of file
YourList | sed 's/\....$//'
form a list of all kind of file
YouList | sed -n 's/\.[jew]ar$/p
t
s/\.esb$//p'

find the first match of a regex in a file, and print it

I have a collection of words on one side, and a file on the other side. I need their intersection. i.e. the words that do appear at least once in the file.
I am able to extract the matching lines with
sed -rn 's/(word1|word2|blablabla|wordn)/\1/p' myfile.txt
but I cannot go forward.
Thank-you for helping, Olivier

Perhaps' grep may work here?
grep -o -E 'word1|word2|word3' file.txt | sort -u

You can do it using grep and sort:
grep -o 'word1\|word2\|word3' myfile.txt | sort -u
The -o switch makes grep only output the matching string not the complete line. sort -u sorts the matching words and removes duplicates.

If I got you, you just need to pipe sed results to uniq:
sed -rn 's/.*(word1|word2|blablabla|wordn).*/\1/p' myfile.txt | uniq
Also you need to match the whole line in sed in order to get just the desired words as output. That's why I've placed .* in front and at the end of the pattern.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using awk sed or grep to parse URLs from webpage source - regex

lynx -dump http://www.ibm.com And look for the string 'References' in the output. Post-process with sed if you need to. Using a different tool sometimes makes the job simpler. Once in a while, a different tool makes the job dead simple. This is one of those times.

Related

Replace names in git log using sed on MacOs (for Gource)

Run curl command on each line of a file and fetch data from result

Grep Link and Link name to create CSV file

Regex with sed to parse archive name

find the first match of a regex in a file, and print it

Categories

Resources