Why do these two grep commands produce different results? - regex

$ grep "^底线$" query_20220922 | wc -l
95701
$ grep -iF "底线" query_20220922 | wc -l
796591
Shouldn't the count be exactly the same? I want to count the exact match of the string.

-F matches a fixed string anywhere in a line. ^xyz$ matches lines which contain "xyz" exactly (nothing else).
You are looking for -x/--line-regexp and not -F/--fixed-strings.
To match lines which contain your search text exactly, without anything else and without interpreting your search text as regular expression, combine the two flags: grep -xF 'findme' file.txt.
Also, case-insensitive matching (-i) can match more lines too than case-sensitive matching (the default).

No, they do different things. The first uses a regular expression to search for "底线" alone on an input line (^ in a regular expression means beginning of line, and $ means end of line).
The second searches for the string anywhere on an input line. The -i flag does nothing at all here (it selects case-insensitive matching, but this is not well-defined for CJK character sets, so basically a no-op) and -F says to search literally (which makes the search faster for internal reasons, but doesn't change the semantics of a search string which doesn't contain any regex metacharacters).
It should be easy to see how they differ. For a large input file, it might be a bit challenging to find the differences if they are not conveniently mixed; but for a quick start, try
diff -u <(grep -m5 "^底线$" query_20220922) <(grep -m5Fi "底线" query_20220922)
where -m5 picks out the first five matches. (Try a different range, perhaps with tail, if the differences are all near the end of the file, for example.)
Tangentially, you usually want to replace the pipe to wc -l with grep -c; also,you might want to try grep -Fx "底线" as a faster alternative to the first search.

Related

grep and regex stored in string

my question is quite short:
a="'[0-9]*'"
grep -E '[0-9]*' #for example, line containing 000 will be recognized and printed
but
grep -E $a #line containing 000 WILL NOT be printed, why is that?
Does substitution for grep regex change the command's behaviour or have I missed something from a syntactic point of view? In other words, how do I make it so that grep accepts regex from a string stored in a variable.
Thank you in advance.
Quotes go around data, not in data. That means, when you store data (in this case, a regex expression) in a variable, don't embed quotes in the variable; instead, put double-quotes around the variable when you use it:
a="[0-9]*"
grep -E "$a"
You can sometimes get away with leaving the double-quotes off when using variables (as in Avinash Raj's comment), but it's not generally safe. In this case, it'll work fine provided there are no files or subdirectories in the current working directories with names that happen to start with a digit. You see, without double-quotes around $a, the shell will take its value, try to split it into multiple words (not a problem here), try to expand each word that contains shell wildcards into a list of matching files (potential problem here), and pass that to the command (grep) as its list of arguments. That means that if you happen to have files that start with digits in the current directory, grep thinks you ran a command like this:
grep -E 1file.txt 2file.jpg 3file.etc
... and it treats the first filename as the pattern to search for, and any other filenames as files to be searched. And you'll be scratching your head wondering why your script works or fails depending on which directory you happen to be in.
Note: the pattern [0-9]* is a valid regular expression, and a valid shell glob (wildcard) pattern, but it means very different things in the two contexts. As a regex, it means 0 or more digits in a row. As a shell glob, it means something that starts with a digit. Speaking of which, grep -E '[0-9]*' is not actually going to be very useful, since everything contains strings of 0 or more digits, so it'll match every line of every file you feed it.

Grep for lines not beginning with "//"

I'm trying but failing to write a regex to grep for lines that do not begin with "//" (i.e. C++-style comments). I'm aware of the "grep -v" option, but I am trying to learn how to pull this off with regex alone.
I've searched and found various answers on grepping for lines that don't begin with a character, and even one on how to grep for lines that don't begin with a string, but I'm unable to adapt those answers to my case, and I don't understand what my error is.
> cat bar.txt
hello
//world
> cat bar.txt | grep "(?!\/\/)"
-bash: !\/\/: event not found
I'm not sure what this "event not found" is about. One of the answers I found used paren-question mark-exclamation-string-paren, which I've done here, and which still fails.
> cat bar.txt | grep "^[^\/\/].+"
(no output)
Another answer I found used a caret within square brackets and explained that this syntax meant "search for the absence of what's in the square brackets (other than the caret). I think the ".+" means "one or more of anything", but I'm not sure if that's correct and if it is correct, what distinguishes it from ".*"
In a nutshell: how can I construct a regex to pass to grep to search for lines that do not begin with "//" ?
To be even more specific, I'm trying to search for lines that have "#include" that are not preceeded by "//".
Thank you.
The first line tells you that the problem is from bash (your shell). Bash finds the ! and attempts to inject into your command the last you entered that begins with \/\/. To avoid this you need to escape the ! or use single quotes. For an example of !, try !cat, it will execute the last command beginning with cat that you entered.
You don't need to escape /, it has no special meaning in regular expressions. You also don't need to write a complicated regular expression to invert a match. Rather, just supply the -v argument to grep. Most of the time simple is better. And you also don't need to cat the file to grep. Just give grep the file name. eg.
grep -v "^//" bar.txt | grep "#include"
If you're really hungup on using regular expressions then a simple one would look like (match start of string ^, any number of white space [[:space:]]*, exactly two backslashes /{2}, any number of any characters .*, followed by #include):
grep -E "^[[:space:]]*/{2}.*#include" bar.txt
You're using negative lookahead which is PCRE feature and requires -P option
Your negative lookahead won't work without start anchor
This will of course require gnu-grep.
You must use single quotes to use ! in your regex otherwise history expansion is attempted with the text after ! in your regex, the reason of !\/\/: event not found error.
So you can use:
grep -P '^(?!\h*//)' file
hello
\h matches 0 or more horizontal whitespace.
Without -P or non-gnu grep you can use grep -v:
grep -v '^[[:blank:]]*//' file
hello
To find #include lines that are not preceded by // (or /* …), you can use:
grep '^[[:space:]]*#[[:space:]]*include[[:space:]]*["<]'
The regex looks for start of line, optional spaces, #, optional spaces, include, optional spaces and either " or <. It will find all #include lines except lines such as #include MACRO_NAME, which are legitimate but rare, and screwball cases such as:
#/*comment*/include/*comment*/<stdio.h>
#\
include\
<stdio.h>
If you have to deal with software containing such notations, (a) you have my sympathy and (b) fix the code to a more orthodox style before hunting the #include lines. It will pick up false positives such as:
/* Do not include this:
#include <does-not-exist.h>
*/
You could omit the final [[:space:]]*["<] with minimal chance of confusion, which will then pick up the macro name variant.
To find lines that do not start with a double slash, use -v (to invert the match) and '^//' to look for slashes at the start of a line:
grep -v '^//'
You have to use the -P (perl) option:
cat bar.txt | grep -P '(?!//)'
For the lines not beginning with "//", you could use (^[^/]{2}.*$).
If you don't like grep -v for this then you could just use awk:
awk '!/^\/\//' file
Since awk supports compound conditions instead of just regexps, it's often easier to specify what you want to match with awk than grep, e.g. to search for a and b in any order with grep:
grep -E 'a.*b|b.*a`
while with awk:
awk '/a/ && /b/'

Does wildcard characters behave differently?

I want to understand if behavior of wildcard characters is same or not, for example:
1) On unix prompt, if i type ls *.xml, then it lists all files ending with .xml, for example: 1.xml, first.xml etc.
--> So as it appears, * matches any character.
2) Now, i was trying to find some text using grep in a file, and i executed following command:
grep -i "*.xml" first.txt
To my utter surprise there was no results returned, even though first.txt had contents like: first.xml, second.xml.
If i do grep -i "xml" first.xml, then I get the results.
This behavior is causing confusion, how does * matches any text in case (1) and in case (2) it is failing?
Does this behave different in different situations, and if so, where to find this info.
Shell uses globbing. https://en.wikipedia.org/wiki/Glob_(programming)
Grep uses regular expressions. https://en.wikipedia.org/wiki/Regular_expression
They are two different languages, and for grep, * is not a wild card. It means the previous character, repeated zero or more times
You want
grep '.*\.xml'
which means any character (.) repeated zero or more times (*) followed by a literal '.' (\.) xml
(For practical purposes you would just use fgrep .xml or grep '\.xml$' of course)

Grep Search Specific Character Trouble

I have searched extensively and cannot figure out what I am doing wrong here. I have a text file that may contain a string similar to the following:
/dev/dir1/dir2 200G 22G 179G 11% /usr/dir3/dir4
I generally know what the sting will look like up until the disk percentage indicator (i.e. 11%), but in the final part of the string I need to figure out if it ends in the usr (or sub) directories.
I want to use grep to do this search but am having problems. For example, the following command gives me output, but once i replace any of the "." characters where the "G" or "%" would be, or if I try to add "/usr/.*" at the end it refuses to return anything.
$ egrep ^/dev/dir1/dir2\s*\d*.\s*\d*.\s*\d*.\s*\d*.\s*.*$ testfile
/dev/dir1/dir2 200G 22G 179G 11% /usr/dir3/dir4
grep's extended regular expressions do not support using \d to match digits. Instead, use [0-9] or [:digit:]. You can use the following grep command:
egrep '^/dev/dir1/dir2\s*[0-9]*G\s*[0-9]*G\s*[0-9]*G\s*[0-9]*%\s*.*$'
You can also pass grep the -P option to enable Perl compatible regular expressions, which do support \d:
grep -P '^/dev/dir1/dir2\s*\d*G\s*\d*G\s*\d*G\s*\d*%\s*.*$'
Note the use of grep instead of egrep in the above command; -P is incompatible with egrep.
As a side note, I prefer to use + instead of * when I can, because it is stricter and can cause errors to become apparent sooner. For example, I assume there will always be at least one space and one digit in each place in the input, so you can use \s+ and [0-9]+ (or \d+). If your original pattern had used +, it would not have matched at all in the first place (whether it was quoted or not), and you would have known you had a problem even before adding the G or % to it. A working example is
egrep '^/dev/dir1/dir2\s+[0-9]+.\s+[0-9]+.\s+[0-9]+.\s+[0-9]+.\s+.+$'

How can I match zero or more instances of a pattern in bash?

I'm trying to loop through a bunch of file prefixes looking for a single line matching a given pattern from each file. I have extracted and generalized a couple examples and have used them below to illustrate my question.
I searched for a line that may have some spaces at the beginning, followed by the number 1234, with maybe some more spaces, and then the number 98765. I know the file of interest begins with l76.logsheet and I want to extract the line from the file that ends with one or more numbers. However, I want to make sure I exclude files ending with anything else (of which there are too many options to reasonably use the grep --exclude option). Here's how I did it from the tcsh shell:
tcsh% grep -E '^\s{0,}1234\s+98765' l76.logsheet[0-9]{0,}
l76.logsheet10:1234 98765 y 13:02:44 2
And here's another example where I was again searching for 98765, but with a different number out front and a different file prefix:
tcsh% grep -E '^\s{0,}4321\s+98765' k43.logsheet[0-9]{0,}
k43.logsheet1: 4321 98765 y 13:06:38 14
Works great and returns just what I need.
My problem is with the bash shell. Repeating the same command returns a rather interesting result. With the first line, there are no problems:
bash$ grep -E '^\s{0,}1234\s+98765' l76.logsheet[0-9]{0,}
which returns:
l76.logsheet10:1234 98765 y 13:02:44 2
But the result for the second example only has one digit at the end of the filename. This causes bash to throw an error before providing the correct result:
bash$ grep -E '^\s{0,}4321\s+98765' k43.logsheet[0-9]{0,}
grep: k43.logsheet[0-9]0: No such file or directory
k43.logsheet1: 4321 98765 y 13:06:38 14
My question is, how do I search for files ending in zero or more of the previous pattern from the bash shell? I have a work around, but I'm looking for an actual answer to this question, which may save me (and hopefully others) time in the future.
First, make sure that extglob is set:
shopt -s extglob
Now, we can match zero or more of any pattern with *(...). For example, let's create some files and match them:
$ touch logsheet logsheet2 logsheet23 logsheet234
$ echo logsheet*([0-9])
logsheet logsheet2 logsheet23 logsheet234
Documentation
According to man bash, bash offers the following features with extglob:
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
+(pattern-list)
Matches one or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns