Searching for multiple matches on one line using Grep and Regex - regex

I'm trying to use Grep with wc -l to print out the number of words in a text file that have 3 or more vowels in a row.
Right now, I'm inputting:
grep -i -E '<\.*[aeiou]{3}.*\>' file.txt | wc -l
but this is not returning the correct number of words, because on some lines there are multiple words that have 3 vowels in a row.
if file.txt contains this :
beautiful courteous
beautiful
courteous
my desired output would be 4, rather than 3, and currently I'm only able to get 3.
I've been looking online for a while for a solution but I just can't seem to figure it out. Can anyone assist?

To get each matching word on a separate line, use the -o option:
$ grep -iEo '[[:alnum:]]*[aeiou]{3}[[:alnum:]]*' file.txt
beautiful
courteous
beautiful
courteous
$ grep -iEo '[[:alnum:]]*[aeiou]{3}[[:alnum:]]*' file.txt | wc -l
4
[[:alnum:]]*[aeiou]{3}[[:alnum:]]* matches words with three consecutive vowels. -o assures that each word is on a separate line.
If you want to be stricter about the definition of a word, you may want instead to use [[:alpha:]]*[aeiou]{3}[[:alpha:]]*.
Documentation
From man grep:
-o, --only-matching Print only the matched (non-empty)
parts of a matching line, with each such part on a separate output
line.
Discussion
Consider:
\<.*[aeiou]{3}.*\>'
In the above, note that . matches any character and .* is greedy: it matches the longest possible match. Thus, \<.*[aeiou]{3} will match from the beginning of the first word on a line to the last occurrence on the line of three vowels in a row. The final .*\> will match from there to the end of the last word on the line. This is not what you need.

You should do it in 2 steps...
First you split the file into words:
tr -s '[[:punct:][:space:]]' '\n' < file.txt > wordsFile.txt
and then you count the matching words:
grep -i -E '.*[aeiou]{3}.*' wordsFile.txt | wc -l

Related

How can I get a list of the words that have six or more consonants in a row using the grep command?

I want to find a list of words that contain six or more consonants in a row from a number of text files.
I'm pretty new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i "[^AEOUIaeoui]{6}"
I use the cat command here because it will otherwise include the file names in the next pipe. I use the second pipe to get a list of all the words in the text files.
The problem is the last pipe, I want to somehow get it to grep 6 consonants in a row, it doesn't need to be the same one. I would know one way of solving the problem, but that would create a command longer that this entire post.
For the last grep you also need the -E switch - or you need to escape the curly braces:
cat *.txt | grep -Eo "\w+" | grep -Ei "[^AEOUIaeoui]{6}"
cat *.txt | grep -Eo "\w+" | grep -i "[^AEOUIaeoui]\{6\}"
I use the cat command here because it will otherwise include the file names in the next pipe
You can disable this using the -h flag:
grep -hEo "\w+" *.txt | grep -Ei "[^AEOUIaeoui]{6}"
You can use
grep -hEio '[[:alpha:]]*[b-df-hj-np-tv-z]{6}[[:alpha:]]*' *.txt
Regex details
[[:alpha:]]* - any zero or more letter
[b-df-hj-np-tv-z]{6} - six English consonant letters on end
[[:alpha:]]* - any zero or more letter.
The grep options make the regex search case insensitive (i) and grep shows the matched texts only (with o) without displaying the filenames (h). The -E option allows the POSIX ERE syntax, else, if you do not specify it, you would need to escape {6} as \{6\},
Use this Perl one-liner:
perl -lne 'print for grep { /[^aeoui]{6}/i } /\b([a-z]+)\b/ig' in_file.txt
Example:
cat > in_file.txt <<EOF
the abcdfghi aBcdfghi.
ABCDFGHI234
abcdEfgh
EOF
perl -lne 'print for grep { /[^aeoui]{6}/i } /\b([a-z]+)\b/ig' in_file.txt
Output:
abcdfghi
aBcdfghi
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
The regex uses these modifiers:
/g : Multiple matches.
/i : Case-insensitive matches.
/\b([a-z]+)\b/ig : Match words that consist of 1 or more letters only ([a-z]+), with words boundary \b on both sides. This way, ABCDFGHI234 does not match, but all 3 words in line 1 (the, abcdfghi, aBcdfghi) match. This may be important for some applications. Note that not all answers in this thread use the word boundary around letters, and thus do not make the distinction shown in this example.
/[^aeoui]{6}/i : Match 6 or more consecutive non-vowels. Non-vowels here resolve exactly to consonants, because the previous regex selected for words made of letters only, that is, vowels and consonants.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start
Get all words containing 6 or more consonants in a row in a given directory
cat *.txt | grep -Eo "\w+" | grep -E "[^AEOUIaeoui]{6,}"
We can use grep -Eo (-E Extended regex, -o output ONLY matching)
cat *.txt will output all of the data from all txt files in the current directory
grep -Eo "\w+" will output all of the words from an input in the form of one word per line
We can use Regex to search for strings that contain a pattern:
[^LISTOFCHARACTERS] Any character but LISTOFCHARACTERS
{6,} 6 or more

Parsing only first regex match in a line with several matches

Is it possible to have a regex that parses only a1bcdea1 from this line a1bcdea1ABCa1DEFa1 ?
This grep command does not work:
$ cat txtfile
a1bcdea1ABCa1DEFa1
$ grep -oE "[A-Z,a-z]1.*?[A-Z,a-z]1" txtfile
a1bcdea1ABCa1DEFa1
I want the output of grep to be only a1bcdea1.
EDIT:
It is obvious that I can just use grep -o "a1bcdea1" for the above line, but consider if one has several thousands of lines and the goal is to match FIRST [A-Z,a-z]1.*?[A-Z,a-z]1 for each single line.
How about using a ^ start anchor and restricting character set used:
grep -o '^[A-Za-z]1[A-Za-z]*1'
See this Bash demo or Regex Pattern at regex101
If you expect more digits or other characters in between, go with this
grep -oP '^[A-Za-z]1.*?[A-Za-z]1'
The lazy matching requires perl compatible mode. For not at line start, go with this
grep -oP '^.*?\K[A-Za-z]1.*?[A-Za-z]1'
\K resets beginning of the reported match and is a PCRE feature as well.
Here is a gnu awk solution using split function:
awk '(n = split($0, a, /[a-zA-Z]1/, b)) > 1 {print b[1] a[2] b[2]}' file
a1bcdea1
This awk command splits each line on regex /[a-zA-Z]1/ and stores split tokens in array a and delimiters in array b.

How to find only the lines that contain two consecutive vowels

how to find lines that contain consecutive vowels
$ (filename) | sed '/[a*e*i*o*u]/!d'
To find lines that contain consecutive vowels you should consider using
sed -n '/[aeiou]\{2,\}/p' file
Here, [aeiou]\{2,\} pattern matches 2 or more occurrences (\{2,\} is an interval quantifier with the minimum occurrence number set to 2) and [aeiou] is a bracket expression matching any char defined in it.
The -n suppresses output, and the p command prints specific lines only (that is, -n with p only outputs the lines that match your pattern).
Or, you may get the same functionality with grep:
grep '[aeiou]\{2,\}' file
grep -E '[aeiou]{2,}' file
Here is an online demo:
s="My boomerang
Text here
Koala there"
sed -n '/[aeiou]\{2,\}/p' <<< "$s"
Output:
My boomerang
Koala there

Regex character repeats n or more times in line with grep

I need to find the regex expression to find a character that repeats 4 or more times with grep.
I know that the expression is {n,}, so if I need to find lines, for example, when the character "g" repeats 4 or more times, in theory with grep man page is:
grep "g{4,}" textsamplefile
But doesn't work. Any help?
The character could have other letters. For example, a valid match is:
gexamplegofgvalidgmatchg
gothergvalidgmatchgisghereg
ggggother
you should change your grep command in:
grep -E 'g{4,}' input_file # --> this will extract only the lines containing chains of 4 or more g
if you want to take all the lines that contain chains of 4 or more identical characters your regex become:
grep -E '(.)\1{3,}' input_file
If you do not need the chains but only line where g appear 4 or more times:
grep -E '([^g]*g){4}' input_file
you can generalize to any char repeating 4 times or more by using:
grep -E '(.)(.*\1){3}' input_file

How can I use grep and regex to match a word with specific length?

I'm using Linux's terminal and i've got a wordlist which has words like:
filers
filing
filler
filter
finance
funky
fun
finally
futuristic
fantasy
fabulous
fill
fine
And I want to do a grep and a regex to match the find words with the first two letters "fi" and only show the word if it's 6 characters in total.
I've tried:
cat wordlist | grep "^fi"
This shows the words beginning with fi.
I've then tried:
cat wordlist | grep -e "^fi{6}"
cat wordlist | grep -e "^fi{0..6}"
and plenty more, but it's not bring back any results. Can anyone point me in the right direction?
It's fi and four more characters:
grep '^fi....$'
or shorter
grep '^fi.\{4\}$'
or
grep -E '^fi.{4}$'
$ matches at the end of line.
Solution:
cat wordlist | grep -e "^fi.{4}$"
Your try:
cat wordlist | grep -e "^fi{6}"
This means f and i six times, the dot added above means any charater, so it's fi and any character 4 times. I've also put an $ to mark the end of the line.
Try this:
grep -P "^fi.{4,}"
Note that since you already have "fi", you only need at least 4 more characters.
. denotes any character, and {4,} is to match that character 4 or more times.
If you write grep -e "^fi{6}" as you did in your example, you're trying to match strings beginning with f, followed by 6 is.