Parsing links from html with gawk

Parsing links from html with gawk - regex

I'm trying to take googles html, and parse out the links. I use curl obtain the html then pass it to gawk. From gawk I used the match() function, and it works but it only returns a small amount of links. Maybe 10 at most. If I test my regex on regex101.com it returns 51 links using the g global modifier. How can I use this in gawk to obtain all the links (relative and absolute)?
#!/bin/bash
html=$(curl -L "http://google.com")
echo "${html}" | gawk '
BEGIN {
RS=" "
IGNORECASE=1
}
{
match($0, /href=\"([^\"]*)/, array);
if (length(array[1]) > 0) {
print array[1];
}
}'

Instead of awk you can also use grep -oP:
curl -sL "http://google.com" | grep -iPo 'href="\K[^"]+'
However this is also fetching 31 links for me. This may vary with your browser because google.com serves a different page for different locations/signed in users.

Match only matches the leftmost match, you need to update the line each time.
Try
curl -sL "http://google.com" | gawk '{while(match($0, /href=\"([^\"]+)/, array)){
$0=substr($0,RSTART+RLENGTH);print array[1]}}'

Related

Grep to select the searched-for regexp surrounded on either/both sides by a certain number of characters? [duplicate]

I want to run ack or grep on HTML files that often have very long lines. I don't want to see very long lines that wrap repeatedly. But I do want to see just that portion of a long line that surrounds a string that matches the regular expression. How can I get this using any combination of Unix tools?

You could use the grep options -oE, possibly in combination with changing your pattern to ".{0,10}<original pattern>.{0,10}" in order to see some context around it:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e., force grep to behave as egrep).
For example (from #Renaud's comment):
grep -oE ".{0,10}mysearchstring.{0,10}" myfile.txt
Alternatively, you could try -c:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines.

Pipe your results thru cut. I'm also considering adding a --cut switch so you could say --cut=80 and only get 80 columns.

You could use less as a pager for ack and chop long lines: ack --pager="less -S" This retains the long line but leaves it on one line instead of wrapping. To see more of the line, scroll left/right in less with the arrow keys.
I have the following alias setup for ack to do this:
alias ick='ack -i --pager="less -R -S"'

grep -oE ".\{0,10\}error.\{0,10\}" mylogfile.txt
In the unusual situation where you cannot use -E, use lowercase -e instead.
Explanation:

cut -c 1-100
gets characters from 1 to 100.

The Silver Searcher (ag) supports its natively via the --width NUM option. It will replace the rest of longer lines by [...].
Example (truncate after 120 characters):
$ ag --width 120 '#patternfly'
...
1:{"version":3,"file":"react-icons.js","sources":["../../node_modules/#patternfly/ [...]
In ack3, a similar feature is planned but currently not implemented.

Taken from: http://www.topbug.net/blog/2016/08/18/truncate-long-matching-lines-of-grep-a-solution-that-preserves-color/
The suggested approach ".{0,10}<original pattern>.{0,10}" is perfectly good except for that the highlighting color is often messed up. I've created a script with a similar output but the color is also preserved:
#!/bin/bash
# Usage:
# grepl PATTERN [FILE]
# how many characters around the searching keyword should be shown?
context_length=10
# What is the length of the control character for the color before and after the
# matching string?
# This is mostly determined by the environmental variable GREP_COLORS.
control_length_before=$(($(echo a | grep --color=always a | cut -d a -f '1' | wc -c)-1))
control_length_after=$(($(echo a | grep --color=always a | cut -d a -f '2' | wc -c)-1))
grep -E --color=always "$1" $2 |
grep --color=none -oE \
".{0,$(($control_length_before + $context_length))}$1.{0,$(($control_length_after + $context_length))}"
Assuming the script is saved as grepl, then grepl pattern file_with_long_lines should display the matching lines but with only 10 characters around the matching string.

I put the following into my .bashrc:
grepl() {
$(which grep) --color=always $# | less -RS
}
You can then use grepl on the command line with any arguments that are available for grep. Use the arrow keys to see the tail of longer lines. Use q to quit.
Explanation:
grepl() {: Define a new function that will be available in every (new) bash console.
$(which grep): Get the full path of grep. (Ubuntu defines an alias for grep that is equivalent to grep --color=auto. We don't want that alias but the original grep.)
--color=always: Colorize the output. (--color=auto from the alias won't work since grep detects that the output is put into a pipe and won't color it then.)
$#: Put all arguments given to the grepl function here.
less: Display the lines using less
-R: Show colors
S: Don't break long lines

Here's what I do:
function grep () {
tput rmam;
command grep "$#";
tput smam;
}
In my .bash_profile, I override grep so that it automatically runs tput rmam before and tput smam after, which disabled wrapping and then re-enables it.

ag can also take the regex trick, if you prefer it:
ag --column -o ".{0,20}error.{0,20}"

Remove noisy pieces of html-extracted text in a shell script using regex

I loop through urls in my script and get a piece from html code extracted with Apache Tika for further processing.
while read p; do curl -s $p | curl -X PUT -T - http://10.0.2.208:9998/tika | head -1000; done < ~/file_with_urls.txt
Where urls are for example:
http://dailycurrant.com/2014/01/02/marijuana-overdoses-kill-37-in-colorado-on-first-day-of-legalization/
http://www.sott.net/article/271748-Father-sentenced-to-6-months-in-jail-for-paying-too-much-child-support
http://www.sunnyskyz.com/blog/79/The-27-Naughtiest-Cats-In-The-World-And-I-Can-t-Stop-Laughing
In a shell script I would like to do the following: skip or delete everything that comes in a form [image: some text], [bookmark: some text].
[image: USA][image: Map][image: Print][image: Hall and Son][image: Google+][image: FB Share][image: ][image: Email][image: Print this article][image: Discuss on Cassiopaea Forum][image: Pin it][bookmark: comment96580][bookmark: reply18433][bookmark: reply18457][bookmark: reply18484][bookmark: reply18487][bookmark: comment96583][image: Hugh Mann][bookmark: comment96595][image: Animanarchy][bookmark: reply18488][bookmark: comment96610][bookmark: reply18485][bookmark: comment96632][image: Close][image: Loading...] Plain text starts here
Out of the above I would only need "Plain text starts here".
Can I accomplish with regex using GNU grep with support for the -P option (to enable PCRE (Perl-Compatible Regular Expressions) support), something like recommended here:
while read p; do curl -s $p | curl -X PUT -T - http://10.0.2.208:9998/tika | head -1000 | grep -Po '_regex that will do the trick_'; done < ~/file_with_urls.txt

You can use this awk:
str='[image: USA][image: Map][image: Print][image: Hall and Son][image: Google+][image: FB Share][image: ][image: Email][image: Print this article][image: Discuss on Cassiopaea Forum][image: Pin it][bookmark: comment96580][bookmark: reply18433][bookmark: reply18457][bookmark: reply18484][bookmark: reply18487][bookmark: comment96583][image: Hugh Mann][bookmark: comment96595][image: Animanarchy][bookmark: reply18488][bookmark: comment96610][bookmark: reply18485][bookmark: comment96632][image: Close][image: Loading...] Plain text starts here'
awk 'BEGIN{FS="\\[[^]]*\\] *"} {for (i=1; i<=NF; i++) if ($i) print $i}' <<< "$str"
Plain text starts here
Here $str represents your long string given above.

Stripping email addresses from arbitrary file

What's the best way to get user#host.com combinations from a large fileset?
I assume that sed/awk can do this, but I'm not very familiar with regexp.
We have a file i.e., Staff_data.txt that houses more than just emails, and would like to strip the rest of the data, and gather only the email addresses (i.e., h#south.com)
I figured the easiest way would be via sed/awk in a terminal, but seeing as how complex regexp can be, I'd appreciate some guidance.
Thanks.

Here's a somewhat embarrassing but apparently working script I wrote a few years ago to do this job:
# Get rid of any Message-Id line like this:
# Message-ID: <AANLkTinSDG_dySv_oy_7jWBD=QWiHUMpSEFtE-cxP6Y1#mail.gmail.com>
#
# Change any character that can't be in an email address to a space.
#
# Print just the character strings that look like email addresses.
#
# Drop anything with multple "#"s and change any domain names (i.e.
# the part after the "#") to all lower case as those are not case-sensitive.
#
# If we have a local mail box part (i.e. the part before the "#") that's
# a mix of upper/lower and another that's all lower, keep them both. Ditto
# for multiple versions of mixed case since we don't know which is correct.
#
# Sort uniquely.
cat "$#" |
awk '!/^Message-ID:/' |
awk '{gsub(/[^-_.#[:alnum:]]+/," ")}1' |
awk '{for (i=1;i<=NF;i++) if ($i ~ /.+#.+[.][[:alpha:]]+$/) print $i}' |
awk '
BEGIN { FS=OFS="#" }
NF != 2 { printf "Badly formatted %s skipped.\n",$0 | "cat>&2"; next }
{ $2=tolower($2); print }
' |
tr '[A-Z]' '[a-z]' |
sort -u
It's not pretty, but it seems to be robust.

You want grep here not sed or awk. For example to display all emails from the domain south.com:
grep -o '[^ ]*#south\.com ' file

Add specific source file types to svn recursively

I know this is an ugly script but it does the job.
What I am facing now is adding a few more extensions what would clutter the scrip even more.
How can I make it more modular?
Specifically, how can I write this long regular expression (source file extensions) on multiple lines? say one extension on each line. I guess I am doing something wrong with string concatenation but not quite sure what exactly.
Here's the original file:
#!/bin/bash
COMMAND='svn status'
XARGS='xargs'
SVN='svn add'
$COMMAND | grep -E '(\.m|\.mat|\.java|\.js|\.php|\.cpp|\.h|\.c|\.py|\.hs|\.pl|\.xml|\.html|\.sh|.\asm|\.s|\.tex|\.bib|.\Makefile|.\jpg|.\gif|.\png|.\css)'$ | awk ' { print$2 } ' | $XARGS $SVN
and here's roughly what I am aiming at
...code code
'(.\m|
\.mat|
\.js|
.
.
.
.\css)'
..more code here
Anybody?

I know this doesn't answer the question directly, but from a readability perspective, I think most developers would agree that a single-line regex is the most common way to do things and therefore the most maintainable approach. Further, I'm not sure why you're including a period in each of your extensions, this should only need to be used once.
I wrote this little script to automatically add all images to svn. You should be able to simply add extensions between the pipes in the regex to add or remove different file types. Note that it makes sure to only add files that are unrecognized by making sure each line starts with a "?" (^\?) and ends with a period (\.(extensions)$). Hope it's helpful!
#!/bin/bash
svn st | grep -E "^\?.*\.(png|jpg|jpeg|tiff|bmp|gif)$" > /tmp/svn-auto-add-img
while read output; do
FILE=$(echo $output | awk '{ print $2 }')
svn add $FILE
done < /tmp/svn-auto-add-img
exit 0

How about this:
PATTERNS="
\.foo
\.bar
\.baz"
# Put them into one list separated by or ("|").
PATTERNS=`echo $PATTERNS |sed 's/\s\+/|/g'`
$COMMAND | grep -E "($PATTERNS)"
(Note that this would not work if you put quotes around $PATTERNS in the call to echo -- echo is taking care of stripping whitespace and converting newlines to spaces for us.)

#!/bin/bash
COMMAND='svn status'
XARGS='xargs'
SVNADD='svn add'
pats=
pats+=' \.m'
pats+=' \.mat'
pats+=' \.java'
pats+=' \.js'
# add your 'or-able' sub patterns here
# build the full pattern
pattern='(';for pat in $pats;do pattern+="$pat|";done;pattern=${pattern%\|}')$'
# run grep with the generated pattern
files=$($COMMAND | grep -E ${pattern} | awk ' { print $NF } ')
if [ " $files" != " " ]
then
$COMMAND | grep -E ${pattern} | awk ' { print $NF } ' | $XARGS $SVNADD
fi

bash grep - negative match

I want to show flag places in my Python unittests where I have been lazy and de-activated tests.
But I also have conditional executions that are not laziness, they are motivated by performance or system conditions at time of testing. Those are the skipUnless ones and I want to ignore them entirely.
Let's take some inputs that I have put in a file, test_so_bashregex.txt, with some comments.
!ignore this, because skipUnless means I have an acceptable conditional flag
#unittest.skipUnless(do_test, do_test_msg)
def test_conditional_function():
xxx
!catch these 2, lazy test-passing
#unittest.skip("fb212.test_urls_security_usergroup Test_Detail.test_related fails with 302")
def sometest_function():
xxx
#unittest.expectedFailure
def test_another_function():
xxx
!bonus points... ignore things that are commented out
# #unittest.expectedFailure
Additionally, I can't use a grep -v skipUnless in a pipe because I really want to use egrep -A 3 xxx *.py to give some context, as in:
grep -A 3 "#unittest\." *.py
test_backend_security_meta.py: #unittest.skip("rewrite - data can be legitimately missing")
test_backend_security_meta.py- def test_storage(self):
test_backend_security_meta.py- with getMultiDb() as mdb:
test_backend_security_meta.py-
What I have tried:
Trying # https://www.debuggex.com/
I tried #unittest\.(.+)(?!(Unless\()) and that didn't work, as it matches the first 3.
Ditto #unittest\.[a-zA-Z]+(?!(Unless\())
#unittest\.skip(?!(Unless\()) worked partially, on the 2 with skip.
All of those do partial matches despite the presence of Unless.
on bash egrep, which is where this going to end up, things don't look much better.
jluc#explore$ egrep '#unittest\..*(?!(Unless))' test_so_bashregex.txt
egrep: repetition-operator operand invalid

you could try this regex:
(?<!#\s)#unittest\.(?!skipUnless)(skip|expectedFailure).*
if you don't care if 'skip' or 'expectedFailure' appear you could simplify it:
(?<!#\s)#unittest\.(?!skipUnless).*

How about something like this - grep seems a bit restrictive
items=$(find . -name "*.py")
for item in $items; do
cat $item | awk '
/^\#unittest.*expectedFailure/{seen_skip=1;}
/^\#unittest.*skip/{seen_skip=1;}
/^def/{
if (seen_skip == 1)
print "Being lazy at " $1
seen_skip=0;
}
'
done

OK, I'll put up what I found with sweaver2112's help, but if someone has a good single-stage grep-ready regex, I'll take it.
bash's egrep/grep doesn't like ?! (ref grep: repetition-operator operand invalid). end of story there.
What I have done instead is to pipe it to some extra filters: negative grep -v skipUnless and another one to strip leading comments. These 2 strip out the unwanted lines. But, then pipe their output back into another grep looking for #unittest again and again with the -A 3 flag.
If the negative greps have cleared out a line, it won't show in the last pipestage so drops out of the input. If not, I get my context right back.
egrep -A 3 -n '#unittest\.' test_so_bashregex.txt | egrep -v "^\s*#" | egrep -v "skipUnless\(" | grep #unittest -A 3
output:
7:#unittest.skip("fb212.test_urls_security_usergroup Test_Detail.test_related fails with 302")
8-def sometest_function():
9- xxx
10:#unittest.expectedFailure
11-def test_another_function():
12- xxx
And my actual output from running it on * *.py*, rather than my test.txt file:
egrep -A 3 -n '#unittest\.' *.py | egrep -v "\d:\s*#" | egrep -v "skipUnless\(" | grep #unittest -A 3
output:
test_backend_security_meta.py:77: #unittest.skip("rewrite - data can be legitimately missing")
test_backend_security_meta.py-78- def test_storage(self):
test_backend_security_meta.py-79- with getMultiDb() as mdb:
test_backend_security_meta.py-80-
--
test_backend_security_meta.py:98: #unittest.skip("rewrite - data can be legitimately missing")
test_backend_security_meta.py-99- def test_get_li_tag_for_object(self):
test_backend_security_meta.py-100- li = self.mgr.get_li_tag()
test_backend_security_meta.py-101-

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing links from html with gawk - regex

Instead of awk you can also use grep -oP: curl -sL "http://google.com" | grep -iPo 'href="\K[^"]+' However this is also fetching 31 links for me. This may vary with your browser because google.com serves a different page for different locations/signed in users.

Match only matches the leftmost match, you need to update the line each time. Try curl -sL "http://google.com" | gawk '{while(match($0, /href=\"([^\"]+)/, array)){ $0=substr($0,RSTART+RLENGTH);print array[1]}}'

Related

Grep to select the searched-for regexp surrounded on either/both sides by a certain number of characters? [duplicate]

Remove noisy pieces of html-extracted text in a shell script using regex

Stripping email addresses from arbitrary file

Add specific source file types to svn recursively

bash grep - negative match

Categories

Resources