Grep Link and Link name to create CSV file - regex

I am trying to create a list of urls and names from a file. The links are displayed like this:
<table class="list">
<tr><th valign="top">I</th><td>link45.php, link, link8, link 2</td></tr>
<tr><th valign="top">I</th><td>link45.php, link, link8, link 2</td></tr>
</table>
(there are probably some tr's and table tags in there also. please ignore the spaces at the start of the tags.
I need the output to be in a csv like format, but I am unsure how to do this with grep:
"linktoblah.html", "name of link"
I have a working grep which pulls out all of the links.html but not sure how I would pull out the name next to it.
cat list.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
Thanks

The line you showed can be extracted with
sed -e 's/.*=\(".*"\)>\(.*\)<.*$/\1, "\2"/'
Example:
echo '< a href="linktoblah.html">name of link < /a>.' | sed -a 's/.*=\(".*"\)>\(.*\)<.*$/\1, "\2"/'
produces
"linktoblah.html", "name of link "
Depending on what else is in your file, you may be able to replace the grep command with a selector in sed, like this:
sed -n -e '/href=/ s/.*=\(".*"\)>\(.*\)<.*$/\1, "\2"/p'
where the
/href=/
can be any regular expression that matches only the lines you want. The p at the end of the string means "and print", and the -n flag means "don't do anything unless there is a match". The combination of the two makes the separate grep unnecessary.

I have found a way on a different post using PERL HTML::TableExtract.
Get contents between table tags in everyfile in directory output to one file
Many thanks to choroba for his input.

A new awk
Not sure if this what you are looking for, but here is what I get from the new data:
awk -F"[\"<>]" -v RS="href=\"" 'NR>1 {print "\""$1"\",\""$3"\""}' file
"main.asp","link45.php"
"link.html","link"
"link8.asp","link8"
"link2.html","link 2"
"main.asp","link45.php"
"link.html","link"
"link8.asp","link8"
"link2.html","link 2"

Related

Apply regex on matched substring

I have few thousands of text lines like this:
go to <CITY>rome</CITY> <COUNTRY>italy</COUNTRY>
My desired output is to replace everything from the first tagged word (rome) to the last one (italy) and put tag:
go to <ADDRESS>rome italy</ADDRESS>
I can match the portion of the text line which is tagged with:
<.*>
This will greedily select all text from first < to last >. I would like then the tags removed and put <ADDRESS> and </ADDRESS> around the matched portion.
The possible tags are: <STREETNUM>, <STREET>, <CITY>, <STATE>, <ZIP> and <COUNTRY>. Any subset of these tags can appear and in any order. The tags are never nested.
I have searched SO and googled to no avail. Perhaps I can use a named capturing group and then apply search/replace regex on it but I don't know how. Any help would appreciated.
This sed line will do it:
sed 's/<CITY>\(.*\)<\/CITY>.*<COUNTRY>\(.*\)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g'
For example:
sed 's/<CITY>\(.*\)<\/CITY>.*<COUNTRY>\(.*\)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g' <<< "go to <CITY>rome</CITY> <COUNTRY>italy</COUNTRY>"
It prints:
go to <ADDRESS>rome italy</ADDRESS>
It basically captures what is inside the CITY tag and inside the COUNTRY tag and then replace them with the captured groups values enclose the ADDRESS tag
If you're using Linux, you can avoid escaping ( using the -E flag:
sed -E 's/<CITY>(.*)<\/CITY>.*<COUNTRY>(.*)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g'
UPDATE:
To achieve the expected result you could use several commands in the following order of operation:
Remove the go to text: sed 's/go to //g'
Remove all the tag characters: tr -d '</>'
Once all tag chars are removed, you can safely delete the words STREETNUM, STREET, CITY, STATE, ZIP and COUNTRY from the input:
sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g'
Take the output generated from the previous commands concatenation and output it inside the <ADDRESS></ADDRESS> tags:
xargs -i echo "go to <ADDRESS>{}</ADDRESS>"
The final command is the following, here $LINE should contain the line to process:
sed 's/go to //g' <<< "$LINE" | tr -d '</>' | sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g' | xargs -i echo "go to <ADDRESS>{}</ADDRESS>"
An example:
Running:
sed 's/go to //g' <<< "go to <STATE>Bolivar</STATE> <COUNTRY>Venezuela</COUNTRY> <STREETNUM>5</STREETNUM> " | tr -d '</>' | sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g' | xargs -i echo "go to <ADDRESS>{}</ADDRESS>"
Will print:
go to <ADDRESS>Bolivar Venezuela 5 </ADDRESS>

How to cut a string from a string

My script gets this string for example:
/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file
let's say I don't know how long the string until the /importance.
I want a new variable that will keep only the /importance/lib1/lib2/lib3/file from the full string.
I tried to use sed 's/.*importance//' but it's giving me the path without the importance....
Here is the command in my code:
find <main_path> -name file | sed 's/.*importance//
I am not familiar with the regex, so I need your help please :)
Sorry my friends I have just wrong about my question,
I don't need the output /importance/lib1/lib2/lib3/file but /importance/lib1/lib2/lib3 with no /file in the output.
Can you help me?
I would use awk:
$ echo "/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file" | awk -F"/importance/" '{print FS$2}'
importance/lib1/lib2/lib3/file
Which is the same as:
$ awk -F"/importance/" '{print FS$2}' <<< "/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file"
importance/lib1/lib2/lib3/file
That is, we set the field separator to /importance/, so that the first field is what comes before it and the 2nd one is what comes after. To print /importance/ itself, we use FS!
All together, and to save it into a variable, use:
var=$(find <main_path> -name file | awk -F"/importance/" '{print FS$2}')
Update
I don't need the output /importance/lib1/lib2/lib3/file but
/importance/lib1/lib2/lib3 with no /file in the output.
Then you can use something like dirname to get the path without the name itself:
$ dirname $(awk -F"/importance/" '{print FS$2}' <<< "/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file")
/importance/lib1/lib2/lib3
Instead of substituting all until importance with nothing, replace with /importance:
~$ echo $var
/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file
~$ sed 's:.*importance:/importance:' <<< $var
/importance/lib1/lib2/lib3/file
As noted by #lurker, if importance can be in some dir, you could add /s to be safe:
~$ sed 's:.*/importance/:/importance/:' <<< "/dir1/dirimportance/importancedir/..../importance/lib1/lib2/lib3/file"
/importance/lib1/lib2/lib3/file
With GNU sed:
echo '/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file' | sed -E 's#.*(/importance.*)#\1#'
Output:
/importance/lib1/lib2/lib3/file
pure bash
kent$ a="/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file"
kent$ echo ${a/*\/importance/\/importance}
/importance/lib1/lib2/lib3/file
external tool: grep
kent$ grep -o '/importance/.*' <<<$a
/importance/lib1/lib2/lib3/file
I tried to use sed 's/.*importance//' but it's giving me the path without the importance....
You were very close. All you had to do was substitute back in importance:
sed 's/.*importance/importance/'
However, I would use Bash's built in pattern expansion. It's much more efficient and faster.
The pattern expansion ${foo##pattern} says to take the shell variable ${foo} and remove the largest matching glob pattern from the left side of the shell variable:
file_name="/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file"
file_name=${file_name##*importance}
Removeing the /file at the end as you ask:
echo '<path>' | sed -r 's#.*(/importance.*)/[^/]*#\1#'
Input /dir1/dir2/dir3.../importance/lib1/lib2/lib3/file
Returns: /importance/lib1/lib2/lib3
See this "Match groups" tutorial.

Regex with sed to parse archive name

I'd like to parse different kinds of Java archive with the sed command line tool.
Archives can have the followin extensions:
.jar, .war, .ear, .esb
What I'd like to get is the name without the extension, e.g. for Foobar.jar I'd like to get Foobar.
This seems fairly simple, but I cannot come up with a solution that works and is also robust.
I tried something along the lines of sed s/\.+(jar|war|ear|esb)$//, but could not make it work.
You were nearly there:
sed -E 's/\.+(jar|war|ear|esb)$//' file
Just needed to add the -E flag to sed to interpret the expression. And of course, respect the sed 's/something/new/' syntax.
Test
$ cat a
aaa.jar
bb.war
hello.ear
buuu.esb
hello.txt
$ sed -E 's/\.+(jar|war|ear|esb)$//' a
aaa
bb
hello
buuu
hello.txt
Using sed:
s='Foobar.jar'
sed -r 's/\.(jar|war|ear|esb)$//' <<< "$s"
Foobar
OR better do it in BASH itself:
echo "${s/.[jwe]ar/}"
Foobar
You need to escape the | and the () and also add ' if you do not add option like -r or -E
echo "test.jar" | sed 's/\.\(jar\|war\|ear\|esb\)$//'
test
* is also not needed, sine you normal have only one .
On traditionnal UNIX (tested with AIX/KSH)
File='Foobar.jar'
echo ${File%.*}
from a list having only your kind of file
YourList | sed 's/\....$//'
form a list of all kind of file
YouList | sed -n 's/\.[jew]ar$/p
t
s/\.esb$//p'

Using awk sed or grep to parse URLs from webpage source

I am trying to parse the source of a downloaded web-page in order to obtain the link listing. A one-liner would work fine. Here's what I've tried thus far:
This seems to leave out parts of the URL from some of the page names.
$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3
This gets all of the URL's but I do not want to include links that have/are anchor links. Also I want to be able to specify the domain.org/folder/:
$ awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' file.html
If you are only parsing something like < a > tags, you could just match the href attribute like this:
$ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq
That will ignore the anchor and also guarantee that you have uniques. This does assume that the page has well-formed (X)HTML, but you could pass it through Tidy first.
lynx -dump http://www.ibm.com
And look for the string 'References' in the output. Post-process with sed if you need to.
Using a different tool sometimes makes the job simpler. Once in a while, a different tool makes the job dead simple. This is one of those times.

Filter apache log file using regular expression

I have a big apache log file and I need to filter that and leave only (in a new file) the log from a certain IP: 192.168.1.102
I try using this command:
sed -e "/^192.168.1.102/d" < input.txt > output.txt
But "/d" removes those entries, and I needt to leave them.
Thanks.
What about using grep?
cat input.txt | grep -e "^192.168.1.102" > output.txt
EDIT: As noted in the comments below, escaping the dots in the regex is necessary to make it correct. Escaping in the regex is done with backslashes:
cat input.txt | grep -e "^192\.168\.1\.102" > output.txt
sed -n 's/^192\.168\.1\.102/&/p'
sed is faster than grep on my machines
I think using grep is the best solution but if you want to use sed you can do it like this:
sed -e '/^192\.168\.1\.102/b' -e 'd'
The b command will skip all following commands if the regex matches and the d command will thus delete the lines for which the regex did not match.