Grep Search Specific Character Trouble - regex

I have searched extensively and cannot figure out what I am doing wrong here. I have a text file that may contain a string similar to the following:
/dev/dir1/dir2 200G 22G 179G 11% /usr/dir3/dir4
I generally know what the sting will look like up until the disk percentage indicator (i.e. 11%), but in the final part of the string I need to figure out if it ends in the usr (or sub) directories.
I want to use grep to do this search but am having problems. For example, the following command gives me output, but once i replace any of the "." characters where the "G" or "%" would be, or if I try to add "/usr/.*" at the end it refuses to return anything.
$ egrep ^/dev/dir1/dir2\s*\d*.\s*\d*.\s*\d*.\s*\d*.\s*.*$ testfile
/dev/dir1/dir2 200G 22G 179G 11% /usr/dir3/dir4

grep's extended regular expressions do not support using \d to match digits. Instead, use [0-9] or [:digit:]. You can use the following grep command:
egrep '^/dev/dir1/dir2\s*[0-9]*G\s*[0-9]*G\s*[0-9]*G\s*[0-9]*%\s*.*$'
You can also pass grep the -P option to enable Perl compatible regular expressions, which do support \d:
grep -P '^/dev/dir1/dir2\s*\d*G\s*\d*G\s*\d*G\s*\d*%\s*.*$'
Note the use of grep instead of egrep in the above command; -P is incompatible with egrep.
As a side note, I prefer to use + instead of * when I can, because it is stricter and can cause errors to become apparent sooner. For example, I assume there will always be at least one space and one digit in each place in the input, so you can use \s+ and [0-9]+ (or \d+). If your original pattern had used +, it would not have matched at all in the first place (whether it was quoted or not), and you would have known you had a problem even before adding the G or % to it. A working example is
egrep '^/dev/dir1/dir2\s+[0-9]+.\s+[0-9]+.\s+[0-9]+.\s+[0-9]+.\s+.+$'

Related

Why do these two grep commands produce different results?

$ grep "^底线$" query_20220922 | wc -l
95701
$ grep -iF "底线" query_20220922 | wc -l
796591
Shouldn't the count be exactly the same? I want to count the exact match of the string.
-F matches a fixed string anywhere in a line. ^xyz$ matches lines which contain "xyz" exactly (nothing else).
You are looking for -x/--line-regexp and not -F/--fixed-strings.
To match lines which contain your search text exactly, without anything else and without interpreting your search text as regular expression, combine the two flags: grep -xF 'findme' file.txt.
Also, case-insensitive matching (-i) can match more lines too than case-sensitive matching (the default).
No, they do different things. The first uses a regular expression to search for "底线" alone on an input line (^ in a regular expression means beginning of line, and $ means end of line).
The second searches for the string anywhere on an input line. The -i flag does nothing at all here (it selects case-insensitive matching, but this is not well-defined for CJK character sets, so basically a no-op) and -F says to search literally (which makes the search faster for internal reasons, but doesn't change the semantics of a search string which doesn't contain any regex metacharacters).
It should be easy to see how they differ. For a large input file, it might be a bit challenging to find the differences if they are not conveniently mixed; but for a quick start, try
diff -u <(grep -m5 "^底线$" query_20220922) <(grep -m5Fi "底线" query_20220922)
where -m5 picks out the first five matches. (Try a different range, perhaps with tail, if the differences are all near the end of the file, for example.)
Tangentially, you usually want to replace the pipe to wc -l with grep -c; also,you might want to try grep -Fx "底线" as a faster alternative to the first search.

Regex whitespace before character [duplicate]

I am attempting to grep for all instances of Ui\. not followed by Line or even just the letter L
What is the proper way to write a regex for finding all instances of a particular string NOT followed by another string?
Using lookaheads
grep "Ui\.(?!L)" *
bash: !L: event not found
grep "Ui\.(?!(Line))" *
nothing
Negative lookahead, which is what you're after, requires a more powerful tool than the standard grep. You need a PCRE-enabled grep.
If you have GNU grep, the current version supports options -P or --perl-regexp and you can then use the regex you wanted.
If you don't have (a sufficiently recent version of) GNU grep, then consider getting ack.
The answer to part of your problem is here, and ack would behave the same way:
Ack & negative lookahead giving errors
You are using double-quotes for grep, which permits bash to "interpret ! as history expand command."
You need to wrap your pattern in SINGLE-QUOTES:
grep 'Ui\.(?!L)' *
However, see #JonathanLeffler's answer to address the issues with negative lookaheads in standard grep!
You probably cant perform standard negative lookaheads using grep, but usually you should be able to get equivalent behaviour using the "inverse" switch '-v'. Using that you can construct a regex for the complement of what you want to match and then pipe it through 2 greps.
For the regex in question you might do something like
grep 'Ui\.' * | grep -v 'Ui\.L'
(Edit: this is not as strong as a true lookahead, but can often be used to work around the problem.)
If you need to use a regex implementation that doesn't support negative lookaheads and you don't mind matching extra character(s)*, then you can use negated character classes [^L], alternation |, and the end of string anchor $.
In your case grep 'Ui\.\([^L]\|$\)' * does the job.
Ui\. matches the string you're interested in
\([^L]\|$\) matches any single character other than L or it matches the end of the line: [^L] or $.
If you want to exclude more than just one character, then you just need to throw more alternation and negation at it. To find a not followed by bc:
grep 'a\(\([^b]\|$\)\|\(b\([^c]\|$\)\)\)' *
Which is either (a followed by not b or followed by the end of the line: a then [^b] or $) or (a followed by b which is either followed by not c or is followed by the end of the line: a then b, then [^c] or $.
This kind of expression gets to be pretty unwieldy and error prone with even a short string. You could write something to generate the expressions for you, but it'd probably be easier to just use a regex implementation that supports negative lookaheads.
*If your implementation supports non-capturing groups then you can avoid capturing extra characters.
If your grep doesn't support -P or --perl-regexp, and you can install PCRE-enabled grep, e.g. "pcregrep", than it won't need any command-line options like GNU grep to accept Perl-compatible regular expressions, you just run
pcregrep "Ui\.(?!Line)"
You don't need another nested group for "Line" as in your example "Ui.(?!(Line))" -- the outer group is sufficient, like I've shown above.
Let me give you another example of looking negative assertions: when you have list of lines, returned by "ipset", each line showing number of packets in a middle of the line, and you don't need lines with zero packets, you just run:
ipset list | pcregrep "packets(?! 0 )"
If you like perl-compatible regular expressions and have perl but don't have pcregrep or your grep doesn't support --perl-regexp, you can you one-line perl scripts that work the same way like grep:
perl -e "while (<>) {if (/Ui\.(?!Lines)/){print;};}"
Perl accepts stdin the same way like grep, e.g.
ipset list | perl -e "while (<>) {if (/packets(?! 0 )/){print;};}"
At least for the case of not wanting an 'L' character after the "Ui." you don't really need PCRE.
grep -E 'Ui\.($|[^L])' *
Here I've made sure to match the special case of the "Ui." at the end of the line.

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!
You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.
Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

Regex for prosodically-defined words: working in Atom but not grep

I'm trying to search a .txt dictionary for all trisyllabic roots, and then have the matching roots passed to a new .txt file. The dictionary in question is a raw text version of Heath's Nunggubuyu dictionary. When I search the file in Atom (my preferred text editor), the following string does a pretty good job of singling out the desired roots and eliminating any material from the definitions below the headwords (which begin with whitespace), as well as any English words, and any trisyllabic strings interrupted by a hyphen or equals sign (which mean they are not monomorphemic roots). Forgive me if it looks clunky; I'm an absolute beginner. (In this orthography, vowel length is indicated with a ':', and there are only three vowels 'a,i,u'. None of the headwords have uppercase letters.)
^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b
However, I need the matched strings to be output to a new file. When I try using this same string in grep (on a Mac), nothing is matched. I use the syntax
grep -o "^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b" Dict-nofrontmatter.txt > output.txt
I've been searching for hours trying to figure out how to translate from Atom's regex dialect to grep (Mac), to no avail. Whenever I do manage to get matches, the results looks wildly different to what I expect, and what I get from Atom. I've also looked at some apparent grep tools for Atom, but the documentation is virtually non-existent so I can't work out what they even do. What am I getting wrong here? Should I try an alternative to grep?
grep supports different regex styles. From man re_format:
Regular expressions ("RE"s), as defined in POSIX.2, come in two
forms:
modern REs (roughly those of egrep; POSIX.2 calls these extended REs) and
obsolete REs (roughly those of ed(1); POSIX.2 basic REs).
Grep has switches to choose which variant is used. Sorted from less to many features:
fixed string: grep -F or fgrep
No regex at all. Plain text search.
basic regex: grep -G or just grep
|, +, and ? are ordinary characters. | has no equivalent. Parentheses must be escaped to work as sub-expressions.
extended regex: grep -E or egrep
"Normal" regexes with |, +, ? bounds and so on.
perl regex: grep -P (for GNU grep, not pre-installed on Mac)
Most powerful regexes. Supports lookaheads and other features.
In your case you should try grep -Eo "^\S....
Possibly the only thing missing from your grep command is the -E option:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
grep -Eo "$regex" Dict-nofrontmatter.txt > output.txt
-E activates support for extended (modern) regular expressions, which work as one expects nowadays (duplication symbols + and ? work as expected, ( and ) form capture groups, | is alternation).
Without -E (or with -G) basic regular expressions are assumed - a limited legacy form that differs in syntax. Given that -E is part of POSIX, there's no reason not to use it.
On macOS, grep does understand character-class shortcuts such as \S and \W, and also word-boundary assertions such as \b - this is in contrast with the other BSD utilities that macOS comes with, notably sed and awk.
It doesn't look like you need it, but PRCEs (Perl-compatible Regular Expressions) would provide additional features, such as look-around assertions.
macOS grep doesn't support them, but GNU grep does, via the -P option. You can install GNU grep on macOS via Homebrew.
Alternatively, you can simply use perl directly; the equivalent of the above command would be:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
perl -lne "print for m/$regex/g" Dict-nofrontmatter.txt > output.txt

Grep for lines not beginning with "//"

I'm trying but failing to write a regex to grep for lines that do not begin with "//" (i.e. C++-style comments). I'm aware of the "grep -v" option, but I am trying to learn how to pull this off with regex alone.
I've searched and found various answers on grepping for lines that don't begin with a character, and even one on how to grep for lines that don't begin with a string, but I'm unable to adapt those answers to my case, and I don't understand what my error is.
> cat bar.txt
hello
//world
> cat bar.txt | grep "(?!\/\/)"
-bash: !\/\/: event not found
I'm not sure what this "event not found" is about. One of the answers I found used paren-question mark-exclamation-string-paren, which I've done here, and which still fails.
> cat bar.txt | grep "^[^\/\/].+"
(no output)
Another answer I found used a caret within square brackets and explained that this syntax meant "search for the absence of what's in the square brackets (other than the caret). I think the ".+" means "one or more of anything", but I'm not sure if that's correct and if it is correct, what distinguishes it from ".*"
In a nutshell: how can I construct a regex to pass to grep to search for lines that do not begin with "//" ?
To be even more specific, I'm trying to search for lines that have "#include" that are not preceeded by "//".
Thank you.
The first line tells you that the problem is from bash (your shell). Bash finds the ! and attempts to inject into your command the last you entered that begins with \/\/. To avoid this you need to escape the ! or use single quotes. For an example of !, try !cat, it will execute the last command beginning with cat that you entered.
You don't need to escape /, it has no special meaning in regular expressions. You also don't need to write a complicated regular expression to invert a match. Rather, just supply the -v argument to grep. Most of the time simple is better. And you also don't need to cat the file to grep. Just give grep the file name. eg.
grep -v "^//" bar.txt | grep "#include"
If you're really hungup on using regular expressions then a simple one would look like (match start of string ^, any number of white space [[:space:]]*, exactly two backslashes /{2}, any number of any characters .*, followed by #include):
grep -E "^[[:space:]]*/{2}.*#include" bar.txt
You're using negative lookahead which is PCRE feature and requires -P option
Your negative lookahead won't work without start anchor
This will of course require gnu-grep.
You must use single quotes to use ! in your regex otherwise history expansion is attempted with the text after ! in your regex, the reason of !\/\/: event not found error.
So you can use:
grep -P '^(?!\h*//)' file
hello
\h matches 0 or more horizontal whitespace.
Without -P or non-gnu grep you can use grep -v:
grep -v '^[[:blank:]]*//' file
hello
To find #include lines that are not preceded by // (or /* …), you can use:
grep '^[[:space:]]*#[[:space:]]*include[[:space:]]*["<]'
The regex looks for start of line, optional spaces, #, optional spaces, include, optional spaces and either " or <. It will find all #include lines except lines such as #include MACRO_NAME, which are legitimate but rare, and screwball cases such as:
#/*comment*/include/*comment*/<stdio.h>
#\
include\
<stdio.h>
If you have to deal with software containing such notations, (a) you have my sympathy and (b) fix the code to a more orthodox style before hunting the #include lines. It will pick up false positives such as:
/* Do not include this:
#include <does-not-exist.h>
*/
You could omit the final [[:space:]]*["<] with minimal chance of confusion, which will then pick up the macro name variant.
To find lines that do not start with a double slash, use -v (to invert the match) and '^//' to look for slashes at the start of a line:
grep -v '^//'
You have to use the -P (perl) option:
cat bar.txt | grep -P '(?!//)'
For the lines not beginning with "//", you could use (^[^/]{2}.*$).
If you don't like grep -v for this then you could just use awk:
awk '!/^\/\//' file
Since awk supports compound conditions instead of just regexps, it's often easier to specify what you want to match with awk than grep, e.g. to search for a and b in any order with grep:
grep -E 'a.*b|b.*a`
while with awk:
awk '/a/ && /b/'