grep egrep multiple-strings - regex

Suppose I have several strings: str1 and str2 and str3.
How to find lines that have all the strings?
How to find lines that can have any of them?
And how to find lines that have str1 and either of str2 and str3 [but not both?]?

This looks like three questions. The easiest way to put these sorts of expressions together is with multiple pipes. There's no shame in that, particularly because a regular expression (using egrep) would be ungainly since you seem to imply you want order independence.
So, in order,
grep str1 | grep str2 | grep str3
egrep '(str1|str2|str3)'
grep str1 | egrep '(str2|str3)'
you can do the "and" form in an order independent way using egrep, but I think you'll find it easier to remember to do order independent ands using piped greps and order independent or's using regular expressions.

You can't reasonably do the "all" or "this plus either of those" cases because grep doesn't support lookahead. Use Perl. For the "any" case, it's egrep '(str1|str2|str3)' file.
The unreasonable way to do the "all" case is:
egrep '(str1.*str2.*str3|str3.*str1.*str2|str2.*str1.*str3|str1.*str3.*str2)' file
i.e. you build out the permutations. This is, of course, a ridiculous thing to do.
For the "this plus either of those", similarly:
egrep '(str1.*(str2|str3)|(str2|str3).*str1)' file

grep -E --color "string1|string2|string3...."
for example to find whether our system using AMD(svm) or Intel(vmx) processor and if it is 64bit(lm) lm stands for long mode- that means 64bit...
command example:
grep -E --color "lm|svm|vmx" /proc/cpuinfo
-E is must to find multiple strings

Personally, I do this in perl rather than trying to cobble together something with grep.
For instance, for the first one:
while (<FILE>)
{
next if ! m/pattern1/;
next if ! m/pattern2/;
next if ! m/pattern3/;
print $_;
}

Related

Egrep command hangs when passed a file for Regex patterns

NB: I'm using Cygwin.
Passing in a file into the egrep command to use patterns is running incredibly slowly (to the point where after the 4th word match, it was more than 5 minutes before I gave up).
The command I'm trying to run is:
cat words.txt | egrep ^"[A-Z]" | egrep -f words9.txt
words.txt is a dictionary (390K words), and words9.txt is a file (36,148 words) I created that contains all lowercase 9-letter words from word.txt.
This command should find any 10+ letter words that contain a 9-letter word from words9.txt.
I am new to regex and shell commands so it may be simply that this file dependency is an incredibly inefficient method, (having to search 36148 words for every word in words.txt). Is there a better way of tackling this?
If words9.txt doesn't have regexes try using a fixed string search (fgrep or grep -F) instead of using the extended regex search (egrep).
cat words.txt | egrep "^[A-Z]" | fgrep -f words9.txt
So you want to improve on egrep ^"[A-Z]" words.txt | egrep -f words9.txt
Your words9.txt is not a file of regex patterns, it's only fixed strings, so treating it as such (grep -F) will generally be much faster, as #KurzedMetal said.
Mind you, if its contents had a lot of overlap near-duplicates, you could manually merge them by constructing regexes, here's how you'd do that:
Get a list of all 9-letter words starting with 'inter' (using the Unix builtin word dict)
awk 'length($0)==9' /usr/share/dict/words
now say you wanted to merge all 9-letter words starting with the 5 characters 'inter' into one regex. First let's get them as a list: grep "^inter" | paste -sd ',' - gives:
interalar,interally,interarch,interarmy,interaxal,interaxis,interbank,interbody,intercale,intercalm,intercede,intercept,intercity,interclub,intercome,intercrop,intercurl,interdash,interdict,interdine,interdome,interface,interfere,interflow,interflux,interfold,interfret,interfuse,intergilt,intergrow,interhyal,interject,interjoin,interknit,interknot,interknow,interlace,interlaid,interlake,interlard,interleaf,interline,interlink,interloan,interlock,interloop,interlope,interlude,intermaze,intermeet,intermelt,interment,intermesh,intermine,internals,internist,internode,interpage,interpave,interpeal,interplay,interplea,interpole,interpone,interpose,interpour,interpret,interrace,interroad,interroom,interrule,interrupt,intersale,intersect,intershop,intersole,intertalk,interteam,intertill,intertone,intertown,intertwin,intervale,intervary,intervein,intervene,intervert,interview,interweld,interwind,interwish,interword,interwork,interwove,interwrap,interzone`
The regex would start with: inter(a(l(ar|ly)|r(ch|my)|x(al|is))|b(...)|c(...)|...). We're implementing a tree structure from L-to-R (there are other ways but this is the obvious way).
Testing it: grep "^inter" words9.txt | egrep '^intera(l(ar|ly)|r(ch|my)|x(al|is))'
interalar
interally
interarch
interarmy
interaxal
interaxis
Yay! But it may still be faster to just have a plain list of fixed-strings. Also, this regex will be harder to maintain, brittle etc. Impossible to easily filter or remove specific strings. Anyway you get the point. PS I'm sure there are automated tools out there that construct regexes for such wordlists.

How to use regex OR in grep in Cygwin?

I need to return results for two different matches from a single file.
grep "string1" my.file
correctly returns the single instance of string1 in my.file
grep "string2" my.file
correctly returns the single instance of string2 in my.file
but
grep "string1|string2" my.file
returns nothing
in regex test apps that syntax is correct, so why does it not work for grep in cygwin ?
Using the | character without escaping it in a basic regular expression will only match the | literal. For instance, if you have a file with contents
string1
string2
string1|string2
Using grep "string1|string2" my.file will only match the last line
$ grep "string1|string2" my.file
string1|string2
In order to use the alternation operator |, you could:
Use a basic regular expression (just grep) and escape the | character in the regular expression
grep "string1\|string2" my.file
Use an extended regular expression with egrep or grep -E, as Julian already pointed out in his answer
grep -E "string1|string2" my.file
If it is two different patterns that you want to match, you could also specify them separately in -e options:
grep -e "string1" -e "string2" my.file
You might find the following sections of the grep reference useful:
Basic vs Extended Regular Expressions
Matching Control, where it explains -e
You may need to either use egrep or grep -E. The pipe OR symbol is part of 'extended' grep and may not be supported by the basic Cygwin grep.
Also, you probably need to escape the pipe symbol.
The best and most clear way I've found is:
grep -e REG1 -e REG2 -e REG3 _FILETOGREP_
I never use pipe as it's less evident and very awkward to get working.
You can find this information by reading the fine manual: grep(1), which you can find by running 'man grep'. It describes the difference between grep and egrep, and basic and regular expressions, along with a lot of other useful information about grep.

grep regex multiple replacements

I will probably have done it "manually" by the time I get an answer for this.
I have two variables (varA, varB) I want to replace with (a, b) respectively, this currently requires two separate find and replaces.
with regex grep I know how to do two separate searches using
varA | varB
but there is no replace function that will similarly do a respective replacement
unless you know better? thanks for any insight
grep is used for searching pattern in a given input. You should use sed for text replacements. For multiple replacements in single sed command just use it like this:
sed -e 's/varA/foo/g' -e 's/varB/bar/g' file.txt

Grep regular expression to find words in any order

Context: I want to find a class definition within a lot of source code files, but I do not know the exact name.
Question: I know a number of words which must appear on the line I want to find, but I do not know the order in which they will appear. Is there a quick way to look for a number of words in any order on the same line?
For situations where you need to search on a large number of words, you can use awk as follows:
awk "/word1/&&/word2/&&/word3/" *.c
(If you are a cygwin user, the command is gawk.)
If you're trying to find foo, bar, and baz, you can just do:
grep foo *.c | grep bar | grep baz
That will find anything that has all three in any order. You can use word boundaries if you use egrep, otherwise that will match substrings.
While this is not an exact answer your grep question, but you should check the "ctags" command for generating tags file from the source code. For the source code objects this should help you a much more than an simple grep. check: http://ctags.sourceforge.net/ctags.html
Using standard basic regex recursively match starting from the current directory any .c file with the indicated words (case insesitive, bash flavour):
grep -r -i 'word1\|word2\|word3' ./*.c
Using standard extended regex:
grep -r -i -E 'word1|word2|word3' ./*.c
You can also use perl regex:
grep -r -i -P 'word1|word2|word3' ./*.c
If you need to search with a single grep command (for example, you are searching for multiple pattern alternatives on stdin), you could use:
grep -e 'word1.*word2' -e 'word2.*word1' -e 'alternative-word'
This would find anything which has word1 and word2 in either order, or alternative-word.
(Note that this method gets exponentially complicated as the number of words in arbitrary order increases.)

Unix grep regex containing 'x' but not containing 'y'

I need a single-pass regex for unix grep that contains, say alpha, but does not contain beta.
grep 'alpha' <> | grep -v 'beta'
The other answers here show some ways you can contort different varieties of regex to do this, although I think it does turn out that the answer is, in general, “don’t do that”. Such regular expressions are much harder to read and probably slower to execute than just combining two regular expressions using the boolean logic of whatever language you are using. If you’re using the grep command at a unix shell prompt, just pipe the results of one to the other:
grep "alpha" | grep -v "beta"
I use this kind of construct all the time to winnow down excessive results from grep. If you have an idea of which result set will be smaller, put that one first in the pipeline to get the best performance, as the second command only has to process the output from the first, and not the entire input.
Well as we're all posting answers, here it is in awk ;-)
awk '/x/ && !/y/' infile
I hope this helps.
^((?!beta).)*alpha((?!beta).)*$ would do the trick I think.
I'm pretty sure this isn't possible with true regular expressions. The [^y]*x[^y]* example would match yxy, since the * allows zero or more non-y matches.
EDIT:
Actually, this seems to work: ^[^y]*x[^y]*$. It basically means "match any line that starts with zero or more non-y characters, then has an x, then ends with zero or more non-y characters".
Try using the excludes operator: [^y]*x[^y]*
Q: How to match x but not y in grep without pipe if y is a directory
A: grep x --exclude-dir='y'
Simplest solution:
grep "alpha" * | grep -v "beta"
Please take care of gaps and double quotes.