Difference between using grep regex pattern with or without quotes? - regex

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!

You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.

Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

Related

Regex for prosodically-defined words: working in Atom but not grep

I'm trying to search a .txt dictionary for all trisyllabic roots, and then have the matching roots passed to a new .txt file. The dictionary in question is a raw text version of Heath's Nunggubuyu dictionary. When I search the file in Atom (my preferred text editor), the following string does a pretty good job of singling out the desired roots and eliminating any material from the definitions below the headwords (which begin with whitespace), as well as any English words, and any trisyllabic strings interrupted by a hyphen or equals sign (which mean they are not monomorphemic roots). Forgive me if it looks clunky; I'm an absolute beginner. (In this orthography, vowel length is indicated with a ':', and there are only three vowels 'a,i,u'. None of the headwords have uppercase letters.)
^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b
However, I need the matched strings to be output to a new file. When I try using this same string in grep (on a Mac), nothing is matched. I use the syntax
grep -o "^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b" Dict-nofrontmatter.txt > output.txt
I've been searching for hours trying to figure out how to translate from Atom's regex dialect to grep (Mac), to no avail. Whenever I do manage to get matches, the results looks wildly different to what I expect, and what I get from Atom. I've also looked at some apparent grep tools for Atom, but the documentation is virtually non-existent so I can't work out what they even do. What am I getting wrong here? Should I try an alternative to grep?
grep supports different regex styles. From man re_format:
Regular expressions ("RE"s), as defined in POSIX.2, come in two
forms:
modern REs (roughly those of egrep; POSIX.2 calls these extended REs) and
obsolete REs (roughly those of ed(1); POSIX.2 basic REs).
Grep has switches to choose which variant is used. Sorted from less to many features:
fixed string: grep -F or fgrep
No regex at all. Plain text search.
basic regex: grep -G or just grep
|, +, and ? are ordinary characters. | has no equivalent. Parentheses must be escaped to work as sub-expressions.
extended regex: grep -E or egrep
"Normal" regexes with |, +, ? bounds and so on.
perl regex: grep -P (for GNU grep, not pre-installed on Mac)
Most powerful regexes. Supports lookaheads and other features.
In your case you should try grep -Eo "^\S....
Possibly the only thing missing from your grep command is the -E option:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
grep -Eo "$regex" Dict-nofrontmatter.txt > output.txt
-E activates support for extended (modern) regular expressions, which work as one expects nowadays (duplication symbols + and ? work as expected, ( and ) form capture groups, | is alternation).
Without -E (or with -G) basic regular expressions are assumed - a limited legacy form that differs in syntax. Given that -E is part of POSIX, there's no reason not to use it.
On macOS, grep does understand character-class shortcuts such as \S and \W, and also word-boundary assertions such as \b - this is in contrast with the other BSD utilities that macOS comes with, notably sed and awk.
It doesn't look like you need it, but PRCEs (Perl-compatible Regular Expressions) would provide additional features, such as look-around assertions.
macOS grep doesn't support them, but GNU grep does, via the -P option. You can install GNU grep on macOS via Homebrew.
Alternatively, you can simply use perl directly; the equivalent of the above command would be:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
perl -lne "print for m/$regex/g" Dict-nofrontmatter.txt > output.txt

How to use regex in grep shell command

I am not able to use [-]+[\n][-]+ this regex in shell. Can somebody help to use it with escape characters in grep. I want to find lines like '----\n----' from the file
Plain old grep uses Basic Regular Expression so you don't have immediate access to the + quantifier. In modern (POSIX) grep you can use this functionality, but it requires a backslash before the quantifier, like \+:
grep '-\+\\n-\+' file
(notice also the single quotes to prevent the shell from doing its own backslash substitution; and pay attention to how n matches just itself and a literal backslash is matched either with a double \\ to escape it or in a character class [\]); or you can use grep -E (aka egrep) which supports Extended Regular Expression:
grep -E '-+\\n-+' file
For the record, [\n] matches a single character which can be either a literal backslash or a literal n. This construct is called a character class; inexplicably, newcomers seem to want to try to use square brackets for pretty much anything just because they "look like regex", apparently.

Grep Search Specific Character Trouble

I have searched extensively and cannot figure out what I am doing wrong here. I have a text file that may contain a string similar to the following:
/dev/dir1/dir2 200G 22G 179G 11% /usr/dir3/dir4
I generally know what the sting will look like up until the disk percentage indicator (i.e. 11%), but in the final part of the string I need to figure out if it ends in the usr (or sub) directories.
I want to use grep to do this search but am having problems. For example, the following command gives me output, but once i replace any of the "." characters where the "G" or "%" would be, or if I try to add "/usr/.*" at the end it refuses to return anything.
$ egrep ^/dev/dir1/dir2\s*\d*.\s*\d*.\s*\d*.\s*\d*.\s*.*$ testfile
/dev/dir1/dir2 200G 22G 179G 11% /usr/dir3/dir4
grep's extended regular expressions do not support using \d to match digits. Instead, use [0-9] or [:digit:]. You can use the following grep command:
egrep '^/dev/dir1/dir2\s*[0-9]*G\s*[0-9]*G\s*[0-9]*G\s*[0-9]*%\s*.*$'
You can also pass grep the -P option to enable Perl compatible regular expressions, which do support \d:
grep -P '^/dev/dir1/dir2\s*\d*G\s*\d*G\s*\d*G\s*\d*%\s*.*$'
Note the use of grep instead of egrep in the above command; -P is incompatible with egrep.
As a side note, I prefer to use + instead of * when I can, because it is stricter and can cause errors to become apparent sooner. For example, I assume there will always be at least one space and one digit in each place in the input, so you can use \s+ and [0-9]+ (or \d+). If your original pattern had used +, it would not have matched at all in the first place (whether it was quoted or not), and you would have known you had a problem even before adding the G or % to it. A working example is
egrep '^/dev/dir1/dir2\s+[0-9]+.\s+[0-9]+.\s+[0-9]+.\s+[0-9]+.\s+.+$'

Grep for multiple strings with escaped pipe in each

I'm using Gitbash within Windows. I want to grep for a set of strings, each of which ends with a |
I think I can do each one singly with a backslash to escape the pipe:
grep abcdef\| filename.tsv
But to do them all together I end up with:
grep 'abcdef\|\|uvwxyz\|' filename.tsv
which fails. Any ideas?
I could just do each string individually and then concatenate the resulting files, but it would take days.
In basic posix regexes - which are used by grep - you must not escape the literal |. However you need to escape the | if it is used as a regex syntax element to specify alternatives.
The following expression should work:
grep 'abcdef|\|uvwxyz|' filename.tsv
An ERE might be the way to go, for easier readability.
egrep '(abcdef|uvwxyz)[|]' filename.tsv
This lets you manage your string list a little more easily, and "escapes" the trailing vertical bar by putting it inside a range. (This works for dots, asterisks, etc, as well.)
If egrep isn't available on your system, you can check to see if your existing grep includes a -E option for extended regexes.
There are two competing effects here which you may be confusing. Firstly, the | must be escaped or quoted so that it is not interpreted by the shell. Secondly, depending on which regex mode you are using, escaping/unescaping the pipe changes whether it is a literal character or a metacharacter.
I would suggest that you change your pattern to this:
grep 'abcdef|\|uvwxyz|' file
In basic regex mode, an escaped pipe \| is a regex OR, so this matches either pattern followed by a literal pipe.
Alternatively, if all your patterns end in a pipe and you have more than just two, perhaps you could use this:
grep -E '(abc|def|ghi)\|' file
In extended mode, escaping the pipe has the opposite effect, so this pattern matches any of the sequences of letters followed by a literal pipe.

Making regular expressions look nice in shell scripts

I often use grep and sed in my bash scripts.
For example, I use a script to remove comments from a template
In this example the comments look like:
/*# my comments contain text and ascii art:
*#
*# [box1] ------> [box2]o
*#
#*/
My sed chain to remove these lines looks like:
sed '/^\/\*#/d' | sed '/^\s*\*#/d' | sed '/^\s*#\*\//d'
I my scripts, I have to escape chars such as \ and /, which makes the code less readable. Therefore, my question is: How can I write nice-to-read regular expressions for sed in bash scripts?
One way, I can think of, is by using another separator instead of /, as in vim where you can natively use %s#search/text#replace/text#gc (using # the as separator) and therefore allow / as unescaped character. Defining an alternative escape char would also help. I would be interested in how you solve this problem. I am also open for alternative tools in case you think it is only a sed problem.
You can specify different separators, as detailed here.
Note that Perl allows you to do this too, along with splitting your regexp across several lines for better readability.
I think trying to make regex (which a lot of times is a sequence of symbols) nice to read is pretty hard.
However there are a few things you can do:
Use -r (or -E in some systems) so that you don't have to escape regex operators (), {}, +, ?
Use alternative separators, e.g. for s command
sed 's#regex#replacement#' file
For address ranges (you'll need '\')
sed '\#pattern# d' file
Leave spaces between address range and command (like d above).
Leave comments explaining what the regex matches (you can even include an example).
3 and 4 are more of an indirect approach but they should help.
Anyway what you are doing can be done in a single sed expression:
sed '\:^/\*#:,\:^#\*/: d' file
In addition to using alternative separators you may use extended regular expressions where appropriate, they invert the escaping rules so you have to write square brackets as "\[\]" to give them the special meaning.