"partial grep" to accelerate grep speed? - regex

This is what I am thinking: grep program tries to pattern-match every pattern occurrence in the line, just like:
echo "abc abc abc" | grep abc --color
the result is that the three abc is all red colored, so grep did a full pattern matching to the line.
But think in this scenario, I have many big files to process, but the words that I am interested is very much likely to occur in the first few words. My job is to find the lines without the words in them. So if the grep program can continue to the next line when the words have been found without having to check the rest of the line, it would maybe significantly faster.
Is there a partial match option maybe in grep to do this?
like:
echo abc abc abc | grep --partial abc --color
with only the first abc colored red.

See this nice introduction to grep internals:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
In particular:
GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines
would slow grep down by a factor of several times, because to find the
newlines it would have to look at every byte!
So instead of using line-oriented input, GNU grep reads raw data into
a large buffer, searches the buffer using Boyer-Moore, and only when
it finds a match does it go and look for the bounding newlines.
(Certain command line options like -n disable this optimization.)
So the answer is: No. It is way faster for grep to look for the next occurrence of the search string, rather than to look for a new line.
Edit: Regarding the speculation in the comments to that color=never would do the trick: I had a quick glance at the source code. The variable color_option is not used anywhere near the the actual search for the regex or the previous and upcoming newline in case a match has been found.
It might be that one could save a few CPU cycles when searching those line terminators. Possibly a real world difference shows up with pathological long lines and a very short search string.

If your job is to find the lines without the words in them, you can give sed a try to delete the lines containing the specific word.
sed '/word/d' input_file
Sed will probably continue to the next line when the first occurrence is found on the current line.

If you want to find lines without specific words, you can use grep to do this.
Try grep -v "abc" which means do the inverse. In this case, find lines without the string "abc".
If I have a file that looks like this:
line one abc
line two abc
line three def
Doing grep -v "abc" file.txt will return line three def.

Related

Why do these two grep commands produce different results?

$ grep "^底线$" query_20220922 | wc -l
95701
$ grep -iF "底线" query_20220922 | wc -l
796591
Shouldn't the count be exactly the same? I want to count the exact match of the string.
-F matches a fixed string anywhere in a line. ^xyz$ matches lines which contain "xyz" exactly (nothing else).
You are looking for -x/--line-regexp and not -F/--fixed-strings.
To match lines which contain your search text exactly, without anything else and without interpreting your search text as regular expression, combine the two flags: grep -xF 'findme' file.txt.
Also, case-insensitive matching (-i) can match more lines too than case-sensitive matching (the default).
No, they do different things. The first uses a regular expression to search for "底线" alone on an input line (^ in a regular expression means beginning of line, and $ means end of line).
The second searches for the string anywhere on an input line. The -i flag does nothing at all here (it selects case-insensitive matching, but this is not well-defined for CJK character sets, so basically a no-op) and -F says to search literally (which makes the search faster for internal reasons, but doesn't change the semantics of a search string which doesn't contain any regex metacharacters).
It should be easy to see how they differ. For a large input file, it might be a bit challenging to find the differences if they are not conveniently mixed; but for a quick start, try
diff -u <(grep -m5 "^底线$" query_20220922) <(grep -m5Fi "底线" query_20220922)
where -m5 picks out the first five matches. (Try a different range, perhaps with tail, if the differences are all near the end of the file, for example.)
Tangentially, you usually want to replace the pipe to wc -l with grep -c; also,you might want to try grep -Fx "底线" as a faster alternative to the first search.

Get list of strings between certain strings in bash

Given a text file (.tex) which may contain strings of the form "\cite{alice}", "\cite{bob}", and so on, I would like to write a bash script that stores the content within brackets of each such string ("alice" and "bob") in a new text file (say, .txt).
In the output file I would like to have one line for each such content, and I would also like to avoid repetitions.
Attempts:
I thought about combining grep and cut.
From other questions and answers that I have seen on Stack Exchange I think that (modulo reading up on cut a bit more) I could manage to get at least one such content per line, but I do not know how to get all occurences of a single line if there are several such strings in it and I have not seen any question or answer giving hints in this direction.
I have tried using sed as well. Yesterday I read this guide to see if I was missing some basic sed command, but I did not see any straightforward way to do what I want (the guide did mention that sed is Turing complete, so I am sure there is a way to do this only with sed, but I do not see how).
What about:
grep -oP '(?<=\\cite{)[^}]+(?=})' sample.tex | sort -u > cites.txt
-P with GNU grep interprets the regexp as a Perl-compatible one (for lookbehind and lookahead groups)
-o "prints only the matched (non-empty) parts of a matching line, with each such part on a separate output line" (see manual)
The regexp matches a curly-brace-free text preceded by \cite{ (positive lookbehind group (?<=\\cite{)) and followed by a right curly brace (positive lookafter group (?=})).
sort -u sorts and remove duplicates
For more details about lookahead and lookbehind groups, see Regular-Expressions.info dedicated page.
You can use grep -o and postprocess its output:
grep -o '\\cite{[^{}]*}' file.tex |
sed 's/\\cite{\([^{}]*\)}/\1/'
If there can only ever be a single \cite on an input line, just a sed script suffices.
sed -n 's/.*\\cite{\([^{}]*\)}.*/\1/p' file.tex
(It's by no means impossible to refactor this into a script which extracts multiple occurrences per line; but good luck understanding your code six weeks from now.)
As usual, add sort -u to remove any repetitions.
Here's a brief Awk attempt:
awk -v RS='\' '/^cite\{/ {
split($0, g, /[{}]/)
cite[g[2]]++ }
END { for (cit in cite) print cit }' file.tex
This conveniently does not print any duplicates, and trivially handles multiple citations per line.

Ignoring strings without using the -v flag

I am trying to use egrep to find lines in a file that contain a certain word, but dont start with that word.
I am currently doing as so...
egrep '^word|word' file.txt
I tried putting it in brackets with the ^ not symbol, but brackets specifiy each letter individually and not a word as a whole.
egrep'^[^word]|word' file.txt
How can I do this, to ignore a certain first word, for example I ignore every The that is at the beginning of a sentence but spot the other ones. Without using the v-flag.
All you need is:
grep '..*word' file
or:
grep -E '.+word' file
to find lines that contain word at a location other than the start of a line.

Grep for lines not beginning with "//"

I'm trying but failing to write a regex to grep for lines that do not begin with "//" (i.e. C++-style comments). I'm aware of the "grep -v" option, but I am trying to learn how to pull this off with regex alone.
I've searched and found various answers on grepping for lines that don't begin with a character, and even one on how to grep for lines that don't begin with a string, but I'm unable to adapt those answers to my case, and I don't understand what my error is.
> cat bar.txt
hello
//world
> cat bar.txt | grep "(?!\/\/)"
-bash: !\/\/: event not found
I'm not sure what this "event not found" is about. One of the answers I found used paren-question mark-exclamation-string-paren, which I've done here, and which still fails.
> cat bar.txt | grep "^[^\/\/].+"
(no output)
Another answer I found used a caret within square brackets and explained that this syntax meant "search for the absence of what's in the square brackets (other than the caret). I think the ".+" means "one or more of anything", but I'm not sure if that's correct and if it is correct, what distinguishes it from ".*"
In a nutshell: how can I construct a regex to pass to grep to search for lines that do not begin with "//" ?
To be even more specific, I'm trying to search for lines that have "#include" that are not preceeded by "//".
Thank you.
The first line tells you that the problem is from bash (your shell). Bash finds the ! and attempts to inject into your command the last you entered that begins with \/\/. To avoid this you need to escape the ! or use single quotes. For an example of !, try !cat, it will execute the last command beginning with cat that you entered.
You don't need to escape /, it has no special meaning in regular expressions. You also don't need to write a complicated regular expression to invert a match. Rather, just supply the -v argument to grep. Most of the time simple is better. And you also don't need to cat the file to grep. Just give grep the file name. eg.
grep -v "^//" bar.txt | grep "#include"
If you're really hungup on using regular expressions then a simple one would look like (match start of string ^, any number of white space [[:space:]]*, exactly two backslashes /{2}, any number of any characters .*, followed by #include):
grep -E "^[[:space:]]*/{2}.*#include" bar.txt
You're using negative lookahead which is PCRE feature and requires -P option
Your negative lookahead won't work without start anchor
This will of course require gnu-grep.
You must use single quotes to use ! in your regex otherwise history expansion is attempted with the text after ! in your regex, the reason of !\/\/: event not found error.
So you can use:
grep -P '^(?!\h*//)' file
hello
\h matches 0 or more horizontal whitespace.
Without -P or non-gnu grep you can use grep -v:
grep -v '^[[:blank:]]*//' file
hello
To find #include lines that are not preceded by // (or /* …), you can use:
grep '^[[:space:]]*#[[:space:]]*include[[:space:]]*["<]'
The regex looks for start of line, optional spaces, #, optional spaces, include, optional spaces and either " or <. It will find all #include lines except lines such as #include MACRO_NAME, which are legitimate but rare, and screwball cases such as:
#/*comment*/include/*comment*/<stdio.h>
#\
include\
<stdio.h>
If you have to deal with software containing such notations, (a) you have my sympathy and (b) fix the code to a more orthodox style before hunting the #include lines. It will pick up false positives such as:
/* Do not include this:
#include <does-not-exist.h>
*/
You could omit the final [[:space:]]*["<] with minimal chance of confusion, which will then pick up the macro name variant.
To find lines that do not start with a double slash, use -v (to invert the match) and '^//' to look for slashes at the start of a line:
grep -v '^//'
You have to use the -P (perl) option:
cat bar.txt | grep -P '(?!//)'
For the lines not beginning with "//", you could use (^[^/]{2}.*$).
If you don't like grep -v for this then you could just use awk:
awk '!/^\/\//' file
Since awk supports compound conditions instead of just regexps, it's often easier to specify what you want to match with awk than grep, e.g. to search for a and b in any order with grep:
grep -E 'a.*b|b.*a`
while with awk:
awk '/a/ && /b/'

Suppress the match itself in grep

Suppose I'have lots of files in the form of
First Line Name
Second Line Surname Adress
Third Line etc
etc
Now I'm using grep to match the first line. But I'm doing this actually to find the second line. The second line is not a pattern that can be matched (it's just depend on the first line). My regex pattern works and the command I'm using is
grep -rHIin pattern . -A 1 -m 1
Now the -A option print the line after a match. The -m option stops after 1 match( because there are other line that matches my pattern, but I'm interested just for the first match, anyway...)
This actually works but the output is like that:
./example_file:1: First Line Name
./example_file-2- Second Line Surname Adress
I've read the manual but couldn't fidn any clue or info about that. Now here is the question.
How can I suppress the match itself ? The output should be in the form of:
./example_file-2- Second Line Surname Adress
sed to the rescue:
sed -n '2,${p;n;}'
The particular sed command here starts with line 2 of its input and prints every other line. Pipe the output of grep into that and you'll only get the even-numbered lines out of the grep output.
An explanation of the sed command itself:
2,$ - the range of lines from line 2 to the last line of the file
{p;n;} - print the current line, then ignore the next line (this then gets repeated)
(In this special case of all even lines, an alternative way of writing this would be sed -n 'n;p;' since we don't actually need to special-case any leading lines. If you wanted to skip the first 5 lines of the file, this wouldn't be possible, you'd have to use the 6,$ syntax.)
You can use sed to print the line after each match:
sed -n '/<pattern>/{n;p}' <file>
To get recursion and the file names, you will need something like:
find . -type f -exec sed -n '/<pattern>/{n;s/^/{}:/;p}' \;
If you have already read a book on grep, you could also read a manual on awk, another common Unix tool.
In awk, your task will be solved with a nice simple code. (As for me, I always have to refresh my knowledge of awk's syntax by going to the manual (info awk) when I want to use it.)
Or, you could come up with a solution combining find (to iterate over your files) and grep (to select the lines) and head/tail (to discard for each individual file the lines you don't want). The complication with find is to be able to work with each file individually, discarding a line per file.
You could pipe results though grep -v pattern