Using Regex to search for several occurrences of a word - regex

How do I search for x or more occurrences of a word using regular expressions and grep in a .txt file in a linux terminal, for example, find all lines with 4 or more "and"s in Sample.txt.

Try this:
egrep "and(.*?and){3}" data.txt
And to match "and" regardless of case ("And" or "AND", ...), but skip an "and" that is a part of another word (or name), try:
egrep -i "\band\b(.*?\band\b){3}" data.txt
The -i makes it ignore case, and the word boundaries, \b, will disregard occurrences like "Anand" and "Anderson".

If you need to match and but not bandit, use something like the following:
egrep '\band\b(.+?\band\b){3}' Sample.txt

Related

Highlight all keys that look like '&name=' in a text with grep console [duplicate]

I want to grep the shortest match and the pattern should be something like:
<car ... model=BMW ...>
...
...
...
</car>
... means any character and the input is multiple lines.
You're looking for a non-greedy (or lazy) match. To get a non-greedy match in regular expressions you need to use the modifier ? after the quantifier. For example you can change .* to .*?.
By default grep doesn't support non-greedy modifiers, but you can use grep -P to use the Perl syntax.
Actualy the .*? only works in perl. I am not sure what the equivalent grep extended regexp syntax would be. Fortunately you can use perl syntax with grep so grep -P would work but grep -E which is same as egrep would not work (it would be greedy).
See also: http://blog.vinceliu.com/2008/02/non-greedy-regular-expression-matching.html
grep
For non-greedy match in grep you could use a negated character class. In other words, try to avoid wildcards.
For example, to fetch all links to jpeg files from the page content, you'd use:
grep -o '"[^" ]\+.jpg"'
To deal with multiple line, pipe the input through xargs first. For performance, use ripgrep.
My grep that works after trying out stuff in this thread:
echo "hi how are you " | grep -shoP ".*? "
Just make sure you append a space to each one of your lines
(Mine was a line by line search to spit out words)
Sorry I am 9 years late, but this might work for the viewers in 2020.
So suppose you have a line like "Hello my name is Jello".
Now you want to find the words that start with 'H' and end with 'o', with any number of characters in between. And we don't want lines we just want words. So for that we can use the expression:
grep "H[^ ]*o" file
This will return all the words. The way this works is that: It will allow all the characters instead of space character in between, this way we can avoid multiple words in the same line.
Now you can replace the space character with any other character you want.
Suppose the initial line was "Hello-my-name-is-Jello", then you can get words using the expression:
grep "H[^-]*o" file
The short answer is using the next regular expression:
(?s)<car .*? model=BMW .*?>.*?</car>
(?s) - this makes a match across multiline
.*? - matches any character, a number of times in a lazy way (minimal
match)
A (little) more complicated answer is:
(?s)<([a-z\-_0-9]+?) .*? model=BMW .*?>.*?</\1>
This will makes possible to match car1 and car2 in the following text
<car1 ... model=BMW ...>
...
...
...
</car1>
<car2 ... model=BMW ...>
...
...
...
</car2>
(..) represents a capturing group
\1 in this context matches the sametext as most recently matched by
capturing group number 1
I know that its a bit of a dead post but I just noticed that this works. It removed both clean-up and cleanup from my output.
> grep -v -e 'clean\-\?up'
> grep --version grep (GNU grep) 2.20

Match a string using grep

I want to match the below string using a regular expression in grep command.
File name is test.txt,
Unknown Unknown
Jessica Patiño
Althea Dubravsky 45622
Monique Outlaw 49473
April Zwearcan 45758
Tania Horne 45467
I want to match the lines containing special characters alone from the above list of lines; the line which I exactly need is 'Jessica Patiño', which contains a non-ASCII character.
I used,
grep '[^0-9a-zA-Z]' test.txt
But it returns all lines.
The following command should return the lines you want:
grep -v '^[0-9a-zA-Z ]*$' test.txt
Explanation
[0-9a-zA-Z ] matches a space or any alphanumeric character.
Adding the asterisk matches any string containing only these characters.
Prepending the pattern with ^ and appending it with $ anchors the string to the beginning and end of line so that the pattern matches only the lines which contain only the desired characters.
Finally, the -v or --invert-match option to grep inverts the sense of matching, i.e., select non-matching lines.
The provided answers should work for the example text given. However, you're likely to come across people with hyphens or apostrophes in their names, etc. To search for all non-ASCII characters, this should do the trick:
grep -P "[\x00-\x1F\x7F-\xFF]" test.txt
-P enables "Perl" mode and allows use of character code searches. \x00-\x1F are control characters, and \x7F-\xFF is everything above 126.
I would use:
grep [^0-9a-zA-Z\s]+ test.txt
live example
Or, even better:
grep -i "[^\da-z\s]" test.txt

using Regex and linux commands(grep or egrep?) to find specific strings

Note: I am not sure that my regex's are correct since my textbook at school does not explain/teach regex's of this form but only of the math form such as for DFA's/NFA
I would appreciate any suggestions or hints
Question:
(a) find all occurrences of three letter words in text that begin with `a' and end with 'e';
(b) find all occurrences of words in text that begin with `m' and end with 'r';
My Approach:
a) ^[a][a-zA-Z][e]$ (how to distinguish between 3 letter words and all words?)
b) ^[m][a-zA-Z][r]$
Also I want to use these regex's in linux so would the following command work?:
grep '^[a][a-zA-Z][e]$' 'usr/dir/.../text.txt'
or should I use egrep in this way:
find . -text "*.txt" -print0 | xargs -0 egrep '^[a][a-zA-Z][e]$'
You can use grep -w with an alternation of regex for both the matches:
grep -w 'a[a-zA-Z]e\|m[a-zA-Z]*r' file.txt
You can use the word boundary \b to match the start and the end of a word:
a) find all occurrences of three letter words in text that begin with `a' and end with 'e';
grep -o '\ba[a-zA-Z]e\b'
The pattern matches a word boundary, then a following a, a single character and a following e and a word boundary.
b) find all occurrences of words in text that begin with `m' and end with 'r';
grep -o '\bm[a-zA-Z]*r\b'
The pattern matches a word boundary, an m zero ore more characters (thorugh the * quantifier), an r and a word boundary again.
Further I'm using the options -o which outputs every match on its own line rather than outputting the whole line of input which contains a match.
Btw, thanks to the option -w - matching only whole words - you can even simplify the above patterns to:
a)
grep -wo 'a[a-zA-Z]e'
and b)
grep -wo 'm[a-zA-Z]*r'
Thanks to #anubhava!
You asked for egrep. egrep can't help to simplify or optimize the patterns. grep is absolutely fine.
In your examples, you're only going to match full lines with three characters, matching the letters you expect.
The '^' indicates the beginning of the line
The '$' indicates the end of the line
In order to pull out only three letter words you're going to have to match on some whitespace. For instance
grep ' a[a-Z]e ' 'usr/dir/.../text.txt'
however this will miss all instances of three letter words at the beginning or end of your line
here is an issue using egrep and grep to match whitespace/start of line
First of all, egrep is extended grep and is the same as calling grep with option -E. Secondly, you don't need to use find and xargs in many cases as there is -r option that will search recursively in files within specified path.
Your regular expression fits basic (not extended) regular expression language supported by grep, therefore egrep is not needed.
I would simplify this to
grep -r '^a[a-zA-Z]e$' /usr/share/dict/
and this
grep -r '^m[a-zA-Z]*r$' /usr/share/dict/

How to match only odd occurrences of a character at the end of the line using grep

For example, I'm matching odd occurrences of 'a'.
So "helloaaa" should match while "helloaaaa" should not match.
I've also tried "(aa)*a$" with and without -E option on bash.
Your problem is that helloaaaa matches because of the last three as:
helloaaaa
===
To avoid this you need to make sure that the previous character is not an a:
grep -E '[^a](aa)*a$' filename
Here I'm assuming that the line isn't entirely as. If the entire line can be as then you can use this regular expression instead:
grep -E '(^|[^a])(aa)*a$' filename

How to do a non-greedy match in grep?

I want to grep the shortest match and the pattern should be something like:
<car ... model=BMW ...>
...
...
...
</car>
... means any character and the input is multiple lines.
You're looking for a non-greedy (or lazy) match. To get a non-greedy match in regular expressions you need to use the modifier ? after the quantifier. For example you can change .* to .*?.
By default grep doesn't support non-greedy modifiers, but you can use grep -P to use the Perl syntax.
Actualy the .*? only works in perl. I am not sure what the equivalent grep extended regexp syntax would be. Fortunately you can use perl syntax with grep so grep -P would work but grep -E which is same as egrep would not work (it would be greedy).
See also: http://blog.vinceliu.com/2008/02/non-greedy-regular-expression-matching.html
grep
For non-greedy match in grep you could use a negated character class. In other words, try to avoid wildcards.
For example, to fetch all links to jpeg files from the page content, you'd use:
grep -o '"[^" ]\+.jpg"'
To deal with multiple line, pipe the input through xargs first. For performance, use ripgrep.
My grep that works after trying out stuff in this thread:
echo "hi how are you " | grep -shoP ".*? "
Just make sure you append a space to each one of your lines
(Mine was a line by line search to spit out words)
Sorry I am 9 years late, but this might work for the viewers in 2020.
So suppose you have a line like "Hello my name is Jello".
Now you want to find the words that start with 'H' and end with 'o', with any number of characters in between. And we don't want lines we just want words. So for that we can use the expression:
grep "H[^ ]*o" file
This will return all the words. The way this works is that: It will allow all the characters instead of space character in between, this way we can avoid multiple words in the same line.
Now you can replace the space character with any other character you want.
Suppose the initial line was "Hello-my-name-is-Jello", then you can get words using the expression:
grep "H[^-]*o" file
The short answer is using the next regular expression:
(?s)<car .*? model=BMW .*?>.*?</car>
(?s) - this makes a match across multiline
.*? - matches any character, a number of times in a lazy way (minimal
match)
A (little) more complicated answer is:
(?s)<([a-z\-_0-9]+?) .*? model=BMW .*?>.*?</\1>
This will makes possible to match car1 and car2 in the following text
<car1 ... model=BMW ...>
...
...
...
</car1>
<car2 ... model=BMW ...>
...
...
...
</car2>
(..) represents a capturing group
\1 in this context matches the sametext as most recently matched by
capturing group number 1
I know that its a bit of a dead post but I just noticed that this works. It removed both clean-up and cleanup from my output.
> grep -v -e 'clean\-\?up'
> grep --version grep (GNU grep) 2.20