Using fgrep to find multiple words (korn shell) - regex

Say I have a text file with multiple lines, but I only want fgrep to list those lines which have certain words in the same line. So, for example, if I'm looking for the words "cat" and "dog", how would I supply that information to fgrep?
I understand for one argument it would simply be:
fgrep cat text.txt
but I want to look for lines that contain "dog" as well as "cat" in the same line. How would I go about doing this?

This will work:
fgrep cat text.txt | fgrep dog
You can also use one regex with grep -E, something like:
grep -E "cat.*?dog|dog.*?cat" text.txt
But it is typically too much of brainpower to spend for simple task like that, and I choose first method instead.

Related

"partial grep" to accelerate grep speed?

This is what I am thinking: grep program tries to pattern-match every pattern occurrence in the line, just like:
echo "abc abc abc" | grep abc --color
the result is that the three abc is all red colored, so grep did a full pattern matching to the line.
But think in this scenario, I have many big files to process, but the words that I am interested is very much likely to occur in the first few words. My job is to find the lines without the words in them. So if the grep program can continue to the next line when the words have been found without having to check the rest of the line, it would maybe significantly faster.
Is there a partial match option maybe in grep to do this?
like:
echo abc abc abc | grep --partial abc --color
with only the first abc colored red.
See this nice introduction to grep internals:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
In particular:
GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines
would slow grep down by a factor of several times, because to find the
newlines it would have to look at every byte!
So instead of using line-oriented input, GNU grep reads raw data into
a large buffer, searches the buffer using Boyer-Moore, and only when
it finds a match does it go and look for the bounding newlines.
(Certain command line options like -n disable this optimization.)
So the answer is: No. It is way faster for grep to look for the next occurrence of the search string, rather than to look for a new line.
Edit: Regarding the speculation in the comments to that color=never would do the trick: I had a quick glance at the source code. The variable color_option is not used anywhere near the the actual search for the regex or the previous and upcoming newline in case a match has been found.
It might be that one could save a few CPU cycles when searching those line terminators. Possibly a real world difference shows up with pathological long lines and a very short search string.
If your job is to find the lines without the words in them, you can give sed a try to delete the lines containing the specific word.
sed '/word/d' input_file
Sed will probably continue to the next line when the first occurrence is found on the current line.
If you want to find lines without specific words, you can use grep to do this.
Try grep -v "abc" which means do the inverse. In this case, find lines without the string "abc".
If I have a file that looks like this:
line one abc
line two abc
line three def
Doing grep -v "abc" file.txt will return line three def.

Egrep command hangs when passed a file for Regex patterns

NB: I'm using Cygwin.
Passing in a file into the egrep command to use patterns is running incredibly slowly (to the point where after the 4th word match, it was more than 5 minutes before I gave up).
The command I'm trying to run is:
cat words.txt | egrep ^"[A-Z]" | egrep -f words9.txt
words.txt is a dictionary (390K words), and words9.txt is a file (36,148 words) I created that contains all lowercase 9-letter words from word.txt.
This command should find any 10+ letter words that contain a 9-letter word from words9.txt.
I am new to regex and shell commands so it may be simply that this file dependency is an incredibly inefficient method, (having to search 36148 words for every word in words.txt). Is there a better way of tackling this?
If words9.txt doesn't have regexes try using a fixed string search (fgrep or grep -F) instead of using the extended regex search (egrep).
cat words.txt | egrep "^[A-Z]" | fgrep -f words9.txt
So you want to improve on egrep ^"[A-Z]" words.txt | egrep -f words9.txt
Your words9.txt is not a file of regex patterns, it's only fixed strings, so treating it as such (grep -F) will generally be much faster, as #KurzedMetal said.
Mind you, if its contents had a lot of overlap near-duplicates, you could manually merge them by constructing regexes, here's how you'd do that:
Get a list of all 9-letter words starting with 'inter' (using the Unix builtin word dict)
awk 'length($0)==9' /usr/share/dict/words
now say you wanted to merge all 9-letter words starting with the 5 characters 'inter' into one regex. First let's get them as a list: grep "^inter" | paste -sd ',' - gives:
interalar,interally,interarch,interarmy,interaxal,interaxis,interbank,interbody,intercale,intercalm,intercede,intercept,intercity,interclub,intercome,intercrop,intercurl,interdash,interdict,interdine,interdome,interface,interfere,interflow,interflux,interfold,interfret,interfuse,intergilt,intergrow,interhyal,interject,interjoin,interknit,interknot,interknow,interlace,interlaid,interlake,interlard,interleaf,interline,interlink,interloan,interlock,interloop,interlope,interlude,intermaze,intermeet,intermelt,interment,intermesh,intermine,internals,internist,internode,interpage,interpave,interpeal,interplay,interplea,interpole,interpone,interpose,interpour,interpret,interrace,interroad,interroom,interrule,interrupt,intersale,intersect,intershop,intersole,intertalk,interteam,intertill,intertone,intertown,intertwin,intervale,intervary,intervein,intervene,intervert,interview,interweld,interwind,interwish,interword,interwork,interwove,interwrap,interzone`
The regex would start with: inter(a(l(ar|ly)|r(ch|my)|x(al|is))|b(...)|c(...)|...). We're implementing a tree structure from L-to-R (there are other ways but this is the obvious way).
Testing it: grep "^inter" words9.txt | egrep '^intera(l(ar|ly)|r(ch|my)|x(al|is))'
interalar
interally
interarch
interarmy
interaxal
interaxis
Yay! But it may still be faster to just have a plain list of fixed-strings. Also, this regex will be harder to maintain, brittle etc. Impossible to easily filter or remove specific strings. Anyway you get the point. PS I'm sure there are automated tools out there that construct regexes for such wordlists.

How to use command grep with several lines?

With a shell script I'm looking for a way to make the grep command do one of the following two options:
a) Use the grep command to display the following 10 lines of a match in a file; ie, the command grep "pattern" file.txt will result in all lines of the file that has that pattern:
patternrestoftheline
patternrestofanotherline
patternrestofanotherline
...
So I'm looking for this:
patternrestoftheline
following line
following line
...
until the tenth
patternrestofanotherline
following line
following line
...
until the tenth
b) Use the grep command to display all lines within two patterns as if they were limits
patternA restoftheline
anotherline
anotherline
...
patternB restoftheline
I do not know if another command instead of grep is a better option.
I'm currently using a loop that solves my problem but is line by line, so with extremely large files takes too much time.
I'm looking for the solution working on Solaris.
Any suggestions?
In case (a), What do you expect to happen if the pattern occurs within the 10 lines?
Anyway, here are some awk scripts which should work (untested, though; I don't have Solaris):
# pattern plus 10 lines
awk 'n{--n;print}/PATTERN/{if(n==0)print;n=10}'
# between two patterns
awk '/PATTERN1/,/PATTERN2/{print}'
The second one can also be done similarly with sed
For your first task, use the -A ("after") option of grep:
grep -A 10 'pattern' file.txt
The second task is a typical sed problem:
sed -ne '/patternA/,/patternB/p' file.txt

Grep regular expression to find words in any order

Context: I want to find a class definition within a lot of source code files, but I do not know the exact name.
Question: I know a number of words which must appear on the line I want to find, but I do not know the order in which they will appear. Is there a quick way to look for a number of words in any order on the same line?
For situations where you need to search on a large number of words, you can use awk as follows:
awk "/word1/&&/word2/&&/word3/" *.c
(If you are a cygwin user, the command is gawk.)
If you're trying to find foo, bar, and baz, you can just do:
grep foo *.c | grep bar | grep baz
That will find anything that has all three in any order. You can use word boundaries if you use egrep, otherwise that will match substrings.
While this is not an exact answer your grep question, but you should check the "ctags" command for generating tags file from the source code. For the source code objects this should help you a much more than an simple grep. check: http://ctags.sourceforge.net/ctags.html
Using standard basic regex recursively match starting from the current directory any .c file with the indicated words (case insesitive, bash flavour):
grep -r -i 'word1\|word2\|word3' ./*.c
Using standard extended regex:
grep -r -i -E 'word1|word2|word3' ./*.c
You can also use perl regex:
grep -r -i -P 'word1|word2|word3' ./*.c
If you need to search with a single grep command (for example, you are searching for multiple pattern alternatives on stdin), you could use:
grep -e 'word1.*word2' -e 'word2.*word1' -e 'alternative-word'
This would find anything which has word1 and word2 in either order, or alternative-word.
(Note that this method gets exponentially complicated as the number of words in arbitrary order increases.)

grep egrep multiple-strings

Suppose I have several strings: str1 and str2 and str3.
How to find lines that have all the strings?
How to find lines that can have any of them?
And how to find lines that have str1 and either of str2 and str3 [but not both?]?
This looks like three questions. The easiest way to put these sorts of expressions together is with multiple pipes. There's no shame in that, particularly because a regular expression (using egrep) would be ungainly since you seem to imply you want order independence.
So, in order,
grep str1 | grep str2 | grep str3
egrep '(str1|str2|str3)'
grep str1 | egrep '(str2|str3)'
you can do the "and" form in an order independent way using egrep, but I think you'll find it easier to remember to do order independent ands using piped greps and order independent or's using regular expressions.
You can't reasonably do the "all" or "this plus either of those" cases because grep doesn't support lookahead. Use Perl. For the "any" case, it's egrep '(str1|str2|str3)' file.
The unreasonable way to do the "all" case is:
egrep '(str1.*str2.*str3|str3.*str1.*str2|str2.*str1.*str3|str1.*str3.*str2)' file
i.e. you build out the permutations. This is, of course, a ridiculous thing to do.
For the "this plus either of those", similarly:
egrep '(str1.*(str2|str3)|(str2|str3).*str1)' file
grep -E --color "string1|string2|string3...."
for example to find whether our system using AMD(svm) or Intel(vmx) processor and if it is 64bit(lm) lm stands for long mode- that means 64bit...
command example:
grep -E --color "lm|svm|vmx" /proc/cpuinfo
-E is must to find multiple strings
Personally, I do this in perl rather than trying to cobble together something with grep.
For instance, for the first one:
while (<FILE>)
{
next if ! m/pattern1/;
next if ! m/pattern2/;
next if ! m/pattern3/;
print $_;
}