grep and sed regular expressions meaning - extracting urls from a web page

grep and sed regular expressions meaning - extracting urls from a web page - regex

grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
After trawling the internet finding the answer to my homework question, I finally got the above. But I don't completely understand the meaning of the two regular expressions used with sed and grep. Can somebody please shed some light on me? Thanks in advance.

The grep command looks for any lines that include a match to
'<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"'
which is
<a the characters <a
[^>] not followed by a close '>'
\+ the last thing one or more times (this is really not necessary I think.
with this, it would be "not followed by exactly one '>' which would be fine
href followed by the string 'href'
[ ]* followed by zero or more spaces (you don't really need the [], just ' *' would be enough)
= followed by the equals sign
[ \t]* followed by zero or more space or tab ("white space")
" followed by open quote (but only a double quote...)
\( open bracket (grouping)
ht characters 'ht'
\| or
f character f
\) close group (of the either-or)
tp characters 'tp'
s\? optionally followed by s
Note - the last few lines combined means 'http or https or ftp or ftps'
: character :
[^"]\+ one or more characters that are not a double quote
this is "everything until the next quote"
Does that get you started? You can do the same for the next bit...
Note to confuse you - the backslash is used to change the meaning of some special characters like ()+; just to keep everyone on their toes, whether these have special meaning with or without the backslash is not something that is defined by the regular expression syntax, but rather by the command in which you use it (and its options). For example, sed changes the meaning of things depending on whether you use the -E flag.

Related

Replace "advanced" pattern in sed

I cant figure out how to change this:
\usepackage{scrpage2}
\usepackage{pgf} \usepackage[latin1]{inputenc}\usepackage{times}\usepackage[T1]{fontenc}
\usepackage[colorlinks,citecolor=black,filecolor=black,linkcolor=black,urlcolor=black]{hyperref}
to this using sed only
REPLACED
REPLACED REPLACEDREPLACEDREPLACED
REPLACED
Im trying stuff like sed 's!\\.*\([.*]\)\?{.\+}!REPLACED!g' FILE
but that gives me
REPLACED
REPLACED
REPLACED
I think .* gets used and everything else in my pattern is just ignored, but I can't figure out how to go about this.
After I learned how to format a regex like that, my next step would be to change it to this:
\usepackage{scrpage2}
\usepackage{pgf}
\usepackage[latin1]{inputenc}
\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage[colorlinks,citecolor=black,filecolor=black,linkcolor=black,urlcolor=black]{hyperref}
So I would appreciate any pointers in that direction too.

Here's some code that happens to work for the example you gave:
sed 's/\\[^\\[:space:]]\+/REPLACED/g'
I.e. match a backslash followed by one or more characters that are not whitespace or another backslash.
To make things more specific, you can use
sed 's/\\[[:alnum:]]\+\(\[[^][]*\]\)\?{[^{}]*}/REPLACED/g'
I.e. match a backslash followed by one or more alphanumeric characters, followed by an optional [ ] group, followed by a { } group.
The [ ] group matches [, followed by zero or more non-bracket characters, followed by ].
The { } group matches {, followed by zero or more non-brace characters, followed by }.

Perl to the rescue! It features the "frugal quantifiers":
perl -pe 's!\\.*?\.?{.+?}!REPLACED!g' FILE
Note that I removed the capturing group as you didn't use it anywhere. Also, [.*] matches either a dot or an asterisk, but you probably wanted to match a literal dot instead.

What's the best way to replace text in round brackets with the same text in square brackets?

I'm trying to do a global find/replace of strings like ('id') with ['id'] using sed on a Mac. I'm having trouble putting together the correct regex to correctly match the brackets without causing syntax errors. I'm also not necessarily interested in using sed, it just seemed like the best way to do it.
I've tried the following code:
sed -i "" "s/(['].*['])/[\1]/g" file.txt
and
sed -i "" "s/[(]['].*['][)]/[\1]/g" file.txt
How should I approach this?

Assuming there are no ' in between (' and ') you may use
sed "s/(\('[^']*'\))/[\1]/g"
The point is that the capturing groups in BRE POSIX regex patterns must be declared with \(...\), while ( and ) denote literal ( and ) symbols. [^']* matches zero or more symbols other than '.
POSIX BRE pattern details:
( - a literal ( symbol
\('[^']*'\) - a capturing group matching:
' - a single quote
[^']* - a negated bracket expression matching zero or more (*) chars other than ' and then
' - a single quote
) - a literal ) symbol.

How robust do you need the script to be? are all of the examples a single set of parentheses or are some nested? Nested may be possible to do in practice, is provably hard to do robustly in sed. Should we account for having parentheses in strings & not replacing them? If so you've got quite a rabbit hole to go down/may be impossible.
Here's a reasonably simple one that assumes the simplest case:
sed 's/(\([^)]*\))/[\1]/g' test.tmp
Explanation:
sed 's/<find>/<replace>/g'
The sed substitute command searches for a regular expression within each line and replaces it as specified. The g option indicates a 'global' replacement meaning it replaces all occurances on a line, not just the first.
(\([^)]*\))
The outside parentheses match those you're hoping to replace. The inside escaped parentheses, \( and \), create a group around the text you want to keep. [^)] matches any character that is not a ), while the following * tells us to match 0+ such characters.
[\1]
The \1 represents the contents of the first (and only) group we formed earlier, and we then place the desired square brackets around it.
Any text not matched by the regular expression remains untouched.

Perl not matching regex?

I'm trying to remove all the comments in a bunch of SGF files, and have come up with the following perl command:
perl -pi -e 's/P?C\[(?:[^\]\\]++|\\.)*+\]//gm' *.sgf
I'm trying to match and remove a C or PC followed by a left bracket, then characters that aren't right brackets (if they are they have to be escaped with a \) and then a right bracket.
I'm trying to match the following examples:
C[HelloBot9 [-\]: GTP Engine for HelloBot9 (white): HelloBot version 0.6.26.08]
PC[IA [-\]: GTP Engine for IA (black): GNU Go version 3.7.11
]
C[person [-\]: \\\]]
C[AyaMC [3k\]: GTP Engine for AyaMC (black): Aya version 6.61 : If you pass, AyaMC
will pass. When AyaMC does not, please remove all dead stones.]
And some examples that shouldn't be matched:
XYZ[Other stuff \]]
C[stuff\]
PC[stuff\\\]
The regex works in several online regex testers (including a few that state they are perl regex testers), but for some reason doesn't work on the command line. Help is appreciated.

You need to run perl with -0777 option to make sure that contents spanning across lines and matching the pattern can be found. So, using perl -0777pi -e instead of perl -pi -e will solve the issue.
I would also suggest optimizing the pattern a bit by unrolling the alternation group, thus, making matching process "linear":
s/P?C\[[^]\\]*(?:\\.[^]\\]*+)*]//sg
Note that if PC should be matched as a whole word, add \b before P.
Pattern details:
P?C\[ - either PC[ or C[ literal char sequence
[^]\\]* - zero or more chars other than \ and ]
(?:\\.[^]\\]*+)* - zero or more sequences of:
\\. - a literal \ and then any char (.)
[^]\\]*+ - 0+ chars other than ] and \ (matched possessively, no backtracking into the pattern)
] - a literal ] symbol (note it does not have to be escaped outside the character class to denote a literal closing bracket)

grep for words ending in 'ing' immediately after a comma

I am trying to grep files for lines with a word ending in 'ing' immediately after a comma, of the form:
... we gave the dog a bone, showing great generosity ...
... this man, having no home ...
but not:
... this is a great place, we are having a good time ...
I would like to find instances where the 'ing' word is the first word after a comma. It seems like this should be very doable in grep, but I haven't figured out how, or found a similar example.
I have tried
grep -e ", .*ing"
which matches multiple words after the comma. Commands like
grep -i -e ", [a-z]{1,}ing"
grep -i -e ", [a-z][a-z]+ing"
don't do what I expect--they don't match phrases like my first two examples. Any help with this (or pointers to a better tool) would be much appreciated.

Try ,\s*\S+ing
Matches your first two phrases, doesn't match in your third phrase.
\s means 'any whitespace', * means 0 or more of that, \S means 'any non-whitespace' (capitalizing the letter is conventional for inverting the character set in regexes - works for \b \s \w \d), + means 'one or more' and then we match ing.

You can use the \b token to match on word boundaries (see this page).
Something like the following should work:
grep -e ".*, \b\w*ing\b"
EDIT: Except now I realised that the \b is unnecessary, and .*,\s*\w*ing would work, as Patashu pointed out. My regex-fu is rusty.

Vim regex backreference

I want to do this:
%s/shop_(*)/shop_\1 wp_\1/
Why doesn't shop_(*) match anything?

There's several issues here.
parens in vim regexen are not for capturing -- you need to use \( \) for captures.
* doesn't mean what you think. It means "0 or more of the previous", so your regex means "a string that contains shop_ followed by 0+ ( and then a literal ). You're looking for ., which in regex means "any character". Put together with a star as .* it means "0 or more of any character". You probably want at least one character, so use .\+ (+ means "1 or more of the previous")
Use this: %s/shop_\(.\+\)/shop_\1 wp_\1/.
Optionally end it with g after the final slash to replace for all instances on one line rather than just the first.

If I understand correctly, you want %s/shop_\(.*\)/shop_\1 wp_\1/
Escape the capturing parenthesis and use .* to match any number of any character.
(Your search is searching for "shop_" followed by any number of opening parentheses followed by a closing parenthesis)

If you would like to avoid having to escape the capture parentheses and make the regex pattern syntax closer to other implementations (e.g. PCRE), add \v (very magic!) at the start of your pattern (see :help \magic for more info):
:%s/\vshop_(*)/shop_\1 wp_\1/

#Luc if you look here: regex-info, you'll see that vim is behaving correctly. Here's a parallel from sed:
echo "123abc456" | sed 's#^([0-9]*)([abc]*)([456]*)#\3\2\1#'
sed: -e expression #1, char 35: invalid reference \3 on 's' command's RHS
whereas with the "escaped" parentheses, it works:
echo "123abc456" | sed 's#^\([0-9]*\)\([abc]*\)\([456]*\)#\3\2\1#'
456abc123
I hate to see vim maligned - especially when it's behaving correctly.
PS I tried to add this as a comment, but just couldn't get the formatting right.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep and sed regular expressions meaning - extracting urls from a web page - regex

Related

Replace "advanced" pattern in sed

What's the best way to replace text in round brackets with the same text in square brackets?

Perl not matching regex?

grep for words ending in 'ing' immediately after a comma

Vim regex backreference

Categories

Resources