grep for words ending in 'ing' immediately after a comma - regex

I am trying to grep files for lines with a word ending in 'ing' immediately after a comma, of the form:
... we gave the dog a bone, showing great generosity ...
... this man, having no home ...
but not:
... this is a great place, we are having a good time ...
I would like to find instances where the 'ing' word is the first word after a comma. It seems like this should be very doable in grep, but I haven't figured out how, or found a similar example.
I have tried
grep -e ", .*ing"
which matches multiple words after the comma. Commands like
grep -i -e ", [a-z]{1,}ing"
grep -i -e ", [a-z][a-z]+ing"
don't do what I expect--they don't match phrases like my first two examples. Any help with this (or pointers to a better tool) would be much appreciated.

Try ,\s*\S+ing
Matches your first two phrases, doesn't match in your third phrase.
\s means 'any whitespace', * means 0 or more of that, \S means 'any non-whitespace' (capitalizing the letter is conventional for inverting the character set in regexes - works for \b \s \w \d), + means 'one or more' and then we match ing.

You can use the \b token to match on word boundaries (see this page).
Something like the following should work:
grep -e ".*, \b\w*ing\b"
EDIT: Except now I realised that the \b is unnecessary, and .*,\s*\w*ing would work, as Patashu pointed out. My regex-fu is rusty.

Related

how to exclude comma in my regular expression

my character set is
-68,-79,-72,-70,-71,-71,-71,-71,-72,-73,R2,0000feaa-0000-1000-8000-00805f9b34fb
I want like
-68 -79 -73
and my regular expression is
[-][0-9]{2}[^0-9]
and result like
-68, -79,
I want to exclude comma in my character set
how can I solve my problem
Thank you for your help
Based on your regex and your results, I assume you are finding multiple matches and then putting spaces between each match. Let me break down what your regex is doing:
[-] matches the negative sign
[0-9]{2} matches two digits
[^0-9] matches any non-digit character, including a comma. So the commas are part of your match
If you want to exclude the commas from your match, but still assert that they are there, you need to use a positive lookahead. This is done like so:
[-][0-9]{2}(?=[^0-9])
Already said this in the comments but will post answer just for the sake of completion.
The solution to this isn't exactly regex. It's the replace function of whatever tool you're using. All you have to do is replace the , by a (space).
For example, in python .replace(',', ' ') is sufficient
which language are you using?
For example:
sed
echo "-34,-35,-34" | sed 's/,/ /g'
awk
echo "-34,-35,-34" | awk '{gsub(/,/, " ", $0); print $0}'

How to grep an exact string with slash in it?

I'm running macOS.
There are the following strings:
/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2
I want to grep only the bolded words.
I figured doing grep -wr 'superman/|/superman' would yield all of them, but it only yields /superman.
Any idea how to go about this?
You may use
grep -E '(^|/)superman($|/)' file
See the online demo:
s="/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2"
grep -E '(^|/)superman($|/)' <<< "$s"
Output:
/superman
/superman/batman
/superman/wonderwoman
/batman/superman
/wonderwoman/superman
The pattern matches
(^|/) - start of string or a slash
superman - a word
($|/) - end of string or a slash.
grep '/superman\>'
\> is the "end of word marker", and for "superman3", the end of word is not following "man"
The problems with your -w solution:
| is not special in a basic regex. You either need to escape it or use grep -E
read the man page about how -w works:
The test is that the
matching substring must either be at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end of the line or followed by a
non-word constituent character
In the case where the line is /batman/superman,
the pattern superman/ does not appear
the pattern /superman is:
at the end of the line, which is OK, but
is prededed by the character "n" which is a word constituent character.
grep -w superman will give you better results, or if you need to have superman preceded by a slash, then my original answer works.

Grep for a string that ends with specific character

Is there a way to use extended regular expressions to find a specific pattern that ends with a string.
I mean, I want to match first 3 lines but not the last:
file_number_one.pdf # comment
file_number_two.pdf # not interesting
testfile_number____three.pdf # some other stuff
myfilezipped.pdf.zip some comments and explanations
I know that in grep, metacharacter $ matches the end of a line but I'm not interested in matching a line end but string end. Groups in grep are very odd, I don't understand them well yet.
I tried with group matching, actually I have a similar REGEX but it does not work with grep -E
(\w+).pdf$
Is there a way to do string ending match in grep/egrep?
Your example works with matching the space after the string also:
grep -E '\.pdf ' input.txt
What you call "string" is similar to what grep calls "word". A Word is a run of alphanumeric characters. The nice thing with words is that you can match a word end with the special \>, which matches a word end with a march of zero characters length. That also matches at the end of line. But the word characters can not be changed, and do not contain punctuation, so we can not use it.
If you need to match at the end of line too, where there is no space after the word, use:
grep -E '\.pdf |\.pdf$' input.txt
To include cases where the character after the file name is not a space character '', but other whitespace, like a tab, \t, or the name is directly followed by a comment, starting with #, use:
grep -E '\.pdf[[:space:]#]|\.pdf$' input.txt
I will illustrate the matching of word boundarys too, because that would be the perfect solution, except that we can not use it here because we can not change the set of characters that are seen as parts of a word.
The input contains foo as separate word, and as part of longer words, where the foo is not at the end of the word, and therefore not at a word boundary:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n'
foo bar
foo.bar
foobar
foo_bar
foo
Now, to match the boundaries of words, we can use \< for the beginning, and \> to match the end:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n' | grep 'foo\>'
foo bar
foo.bar
foo
Note how _ is matched as a word char - but otherwise, wordchars are only the alphanumerics, [a-zA-Z0-9].
Also note how foo an the end of line is matched - in the line containing only foo. We do not need a special case for the end of line.
You can use \> operator
grep 'word\>' fileName
You need to escape the . in your regex. This regex will match anything that ends in .pdf (and only things that end in .pdf):
.*\.pdf$
Positive lookaheads are the most suited for this kinda stuff. Have a try :
grep -P "(^\w+\.pdf)(?=\s)" file
I assume filenames will always be on the start of the line.

grep and sed regular expressions meaning - extracting urls from a web page

grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
After trawling the internet finding the answer to my homework question, I finally got the above. But I don't completely understand the meaning of the two regular expressions used with sed and grep. Can somebody please shed some light on me? Thanks in advance.
The grep command looks for any lines that include a match to
'<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"'
which is
<a the characters <a
[^>] not followed by a close '>'
\+ the last thing one or more times (this is really not necessary I think.
with this, it would be "not followed by exactly one '>' which would be fine
href followed by the string 'href'
[ ]* followed by zero or more spaces (you don't really need the [], just ' *' would be enough)
= followed by the equals sign
[ \t]* followed by zero or more space or tab ("white space")
" followed by open quote (but only a double quote...)
\( open bracket (grouping)
ht characters 'ht'
\| or
f character f
\) close group (of the either-or)
tp characters 'tp'
s\? optionally followed by s
Note - the last few lines combined means 'http or https or ftp or ftps'
: character :
[^"]\+ one or more characters that are not a double quote
this is "everything until the next quote"
Does that get you started? You can do the same for the next bit...
Note to confuse you - the backslash is used to change the meaning of some special characters like ()+; just to keep everyone on their toes, whether these have special meaning with or without the backslash is not something that is defined by the regular expression syntax, but rather by the command in which you use it (and its options). For example, sed changes the meaning of things depending on whether you use the -E flag.

Vim regex backreference

I want to do this:
%s/shop_(*)/shop_\1 wp_\1/
Why doesn't shop_(*) match anything?
There's several issues here.
parens in vim regexen are not for capturing -- you need to use \( \) for captures.
* doesn't mean what you think. It means "0 or more of the previous", so your regex means "a string that contains shop_ followed by 0+ ( and then a literal ). You're looking for ., which in regex means "any character". Put together with a star as .* it means "0 or more of any character". You probably want at least one character, so use .\+ (+ means "1 or more of the previous")
Use this: %s/shop_\(.\+\)/shop_\1 wp_\1/.
Optionally end it with g after the final slash to replace for all instances on one line rather than just the first.
If I understand correctly, you want %s/shop_\(.*\)/shop_\1 wp_\1/
Escape the capturing parenthesis and use .* to match any number of any character.
(Your search is searching for "shop_" followed by any number of opening parentheses followed by a closing parenthesis)
If you would like to avoid having to escape the capture parentheses and make the regex pattern syntax closer to other implementations (e.g. PCRE), add \v (very magic!) at the start of your pattern (see :help \magic for more info):
:%s/\vshop_(*)/shop_\1 wp_\1/
#Luc if you look here: regex-info, you'll see that vim is behaving correctly. Here's a parallel from sed:
echo "123abc456" | sed 's#^([0-9]*)([abc]*)([456]*)#\3\2\1#'
sed: -e expression #1, char 35: invalid reference \3 on 's' command's RHS
whereas with the "escaped" parentheses, it works:
echo "123abc456" | sed 's#^\([0-9]*\)\([abc]*\)\([456]*\)#\3\2\1#'
456abc123
I hate to see vim maligned - especially when it's behaving correctly.
PS I tried to add this as a comment, but just couldn't get the formatting right.