Bash Regex: Search for a maximum of 3 consecutive vowels - regex

I am trying to Search for a maximum of 3 consecutive vowels
I tried
grep -E "([AEIOUaeiou]{3})" gpl3.txt
and got the results
What I want is to NOT get the (aaaaaaaaa) that you see in the first line of output. All other output is correct.
Any help is appreciated

If you want to avoid the -P option and lookaheads, you can use something like the following.
grep -iE '(^|[^aeiou])[aeiou]{3}([^aeiou]|$)' gpl3.txt
It just matches
the start of the line or a non-vowel
three vowels
a non-vowel or the end of the line
A test run:
IT070137 ~/tmp $ cat gpl3.txt
aaaaaaaaaaaaaaa
asdaiosd
aa
aaa
aaaa
this is a righteous queue
IT070137 ~/tmp $ grep -E '(^|[^aeiou])[aeiou]{3}([^aeiou]|$)' gpl3.txt
asdaiosd
aaa
this is a righteous queue

If you want to find all occurrences of exactly three vowels (no more, no less), then you can try this pattern:
grep -iP '(?<![aeiou])[aeiou]{3}(?![aeiou])'
Using option -P makes grep use the Perl library for regular expressions which is more feature-rich than the standard regexp library. For instance, it knows the patterns (?<!something) (?!something) which mean "must not be preceded by something" and "must not be followed by something", respectively. Using this I express the following:
»Find stuff which is three vowels long and not preceded by a vowel and not followed by a vowel.« This is another way of saying »exactly three vowels long«.
Concerning portability: Using this you need to use a grep which is capable of using Perl regular expressions. Today I guess this won't be an issue but if you happen to code for historical machines, you need to check this first.

Try using a negative lookahead which asserts that four or more vowels do not appear consecutively:
grep -P "^(?!.*[AEIOUaeiou]{4,}).*$" gpl3.txt
We need to run this in Perl mode to use negative lookaheads.
Demo

Related

How do I find words with three or more vowels (of the same kind) with regex using back referencing?

How can I find words with three or more vowels of the same kind with a regular expression using back referencing?
I'm searching in text with a 3-column tab format "Word+PoS+Lemma".
This is what I have so far:
ggrep -P -i --colour=always '^\w*([aeioueöäüèéà])\w*?\1\w*?\1\w*?\t' filename
However, this gives me words with three vowels but not of the same kind.
I'm confused, because I thought the back referencing would refer to the same vowel it found in the brackets? I solved this problem by changing the .*? to \w*.
Thanks for the help!
Your regex looks too complicated, not sure what you're trying to accomplish with the .*? but the usage looks suspect. I'd use something like:
([aeioueöäüèéà])\1\1
i.e. match a vowel as a capture group, then say you need two more.
Didn't realise you wanted to allow other letters between vowels, just allow zero or more "word" letters between backreferences:
([aeioueöäüèéà])(\w*\1){2}
I suggest with GNU grep:
grep -E --colour=always -i '\b\w*([aeioueöäüèéà])(\w*\1){2,}\w*'
See: The Stack Overflow Regular Expressions FAQ
Using grep
$ grep -E '(([aeioueöäüèéà])[^\2]*){3,}' input_file

Grep for lines not beginning with "//"

I'm trying but failing to write a regex to grep for lines that do not begin with "//" (i.e. C++-style comments). I'm aware of the "grep -v" option, but I am trying to learn how to pull this off with regex alone.
I've searched and found various answers on grepping for lines that don't begin with a character, and even one on how to grep for lines that don't begin with a string, but I'm unable to adapt those answers to my case, and I don't understand what my error is.
> cat bar.txt
hello
//world
> cat bar.txt | grep "(?!\/\/)"
-bash: !\/\/: event not found
I'm not sure what this "event not found" is about. One of the answers I found used paren-question mark-exclamation-string-paren, which I've done here, and which still fails.
> cat bar.txt | grep "^[^\/\/].+"
(no output)
Another answer I found used a caret within square brackets and explained that this syntax meant "search for the absence of what's in the square brackets (other than the caret). I think the ".+" means "one or more of anything", but I'm not sure if that's correct and if it is correct, what distinguishes it from ".*"
In a nutshell: how can I construct a regex to pass to grep to search for lines that do not begin with "//" ?
To be even more specific, I'm trying to search for lines that have "#include" that are not preceeded by "//".
Thank you.
The first line tells you that the problem is from bash (your shell). Bash finds the ! and attempts to inject into your command the last you entered that begins with \/\/. To avoid this you need to escape the ! or use single quotes. For an example of !, try !cat, it will execute the last command beginning with cat that you entered.
You don't need to escape /, it has no special meaning in regular expressions. You also don't need to write a complicated regular expression to invert a match. Rather, just supply the -v argument to grep. Most of the time simple is better. And you also don't need to cat the file to grep. Just give grep the file name. eg.
grep -v "^//" bar.txt | grep "#include"
If you're really hungup on using regular expressions then a simple one would look like (match start of string ^, any number of white space [[:space:]]*, exactly two backslashes /{2}, any number of any characters .*, followed by #include):
grep -E "^[[:space:]]*/{2}.*#include" bar.txt
You're using negative lookahead which is PCRE feature and requires -P option
Your negative lookahead won't work without start anchor
This will of course require gnu-grep.
You must use single quotes to use ! in your regex otherwise history expansion is attempted with the text after ! in your regex, the reason of !\/\/: event not found error.
So you can use:
grep -P '^(?!\h*//)' file
hello
\h matches 0 or more horizontal whitespace.
Without -P or non-gnu grep you can use grep -v:
grep -v '^[[:blank:]]*//' file
hello
To find #include lines that are not preceded by // (or /* …), you can use:
grep '^[[:space:]]*#[[:space:]]*include[[:space:]]*["<]'
The regex looks for start of line, optional spaces, #, optional spaces, include, optional spaces and either " or <. It will find all #include lines except lines such as #include MACRO_NAME, which are legitimate but rare, and screwball cases such as:
#/*comment*/include/*comment*/<stdio.h>
#\
include\
<stdio.h>
If you have to deal with software containing such notations, (a) you have my sympathy and (b) fix the code to a more orthodox style before hunting the #include lines. It will pick up false positives such as:
/* Do not include this:
#include <does-not-exist.h>
*/
You could omit the final [[:space:]]*["<] with minimal chance of confusion, which will then pick up the macro name variant.
To find lines that do not start with a double slash, use -v (to invert the match) and '^//' to look for slashes at the start of a line:
grep -v '^//'
You have to use the -P (perl) option:
cat bar.txt | grep -P '(?!//)'
For the lines not beginning with "//", you could use (^[^/]{2}.*$).
If you don't like grep -v for this then you could just use awk:
awk '!/^\/\//' file
Since awk supports compound conditions instead of just regexps, it's often easier to specify what you want to match with awk than grep, e.g. to search for a and b in any order with grep:
grep -E 'a.*b|b.*a`
while with awk:
awk '/a/ && /b/'

How to use regular expression in Linux to output number?

How do you output only the number of words in /usr/share/dict/words that begin with any letter, let's say j?
I was hoping to use egrep 'J*' /usr/share/dict/words, but does not work well.
If your words are one on each line, then your solution is very close.
grep -ci '^j' /usr/share/dict/words
The ^ symbol means "start of line". -i flag means case insensitive search, -c means only report the count.

How can I match zero or more instances of a pattern in bash?

I'm trying to loop through a bunch of file prefixes looking for a single line matching a given pattern from each file. I have extracted and generalized a couple examples and have used them below to illustrate my question.
I searched for a line that may have some spaces at the beginning, followed by the number 1234, with maybe some more spaces, and then the number 98765. I know the file of interest begins with l76.logsheet and I want to extract the line from the file that ends with one or more numbers. However, I want to make sure I exclude files ending with anything else (of which there are too many options to reasonably use the grep --exclude option). Here's how I did it from the tcsh shell:
tcsh% grep -E '^\s{0,}1234\s+98765' l76.logsheet[0-9]{0,}
l76.logsheet10:1234 98765 y 13:02:44 2
And here's another example where I was again searching for 98765, but with a different number out front and a different file prefix:
tcsh% grep -E '^\s{0,}4321\s+98765' k43.logsheet[0-9]{0,}
k43.logsheet1: 4321 98765 y 13:06:38 14
Works great and returns just what I need.
My problem is with the bash shell. Repeating the same command returns a rather interesting result. With the first line, there are no problems:
bash$ grep -E '^\s{0,}1234\s+98765' l76.logsheet[0-9]{0,}
which returns:
l76.logsheet10:1234 98765 y 13:02:44 2
But the result for the second example only has one digit at the end of the filename. This causes bash to throw an error before providing the correct result:
bash$ grep -E '^\s{0,}4321\s+98765' k43.logsheet[0-9]{0,}
grep: k43.logsheet[0-9]0: No such file or directory
k43.logsheet1: 4321 98765 y 13:06:38 14
My question is, how do I search for files ending in zero or more of the previous pattern from the bash shell? I have a work around, but I'm looking for an actual answer to this question, which may save me (and hopefully others) time in the future.
First, make sure that extglob is set:
shopt -s extglob
Now, we can match zero or more of any pattern with *(...). For example, let's create some files and match them:
$ touch logsheet logsheet2 logsheet23 logsheet234
$ echo logsheet*([0-9])
logsheet logsheet2 logsheet23 logsheet234
Documentation
According to man bash, bash offers the following features with extglob:
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
+(pattern-list)
Matches one or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns

Grep regex to unscramble a word

I want to unscramble a word using the grep command.
I am using below code. I know there are other ways to do it, but I think I'm missing something here:
grep "^[yxusonlia]\{9\}$" /usr/share/dict/words
should produce one output:
anxiously
but it produces:
annulosan
innoxious
and many more. Basically I can't find how I should specify that characters
can only be matched once, so that I get only one output.
I apologise if it seems very simple but I tried a lot and can't find anything.
You can use grep -P (PCRE regex) with negative lookahead
grep -P '^(?:([yxusonlia])(?!.*?\1)){9}$' /usr/share/dict/words
anxiously
Explanation:
This grep regex uses negative lookahead (?!.*?\1) for each character matched by group #1 i.e. \1. Each character is matched only and only when it is not followed by the same character again in the string till the end.
You can use lookaheads to make sure that each letter is matched exactly one time. It is verbose and requires a version of grep that supports lookaheads (e.g. via -P). It may be better to build the search string programmatically.
grep -P "^(?=.*y)(?=.*x)(?=.*u)(?=.*s)(?=.*o)(?=.*n)(?=.*l)(?=.*i)(?=.*a)[yxusonlia]{9}$" /usr/share/dict/words