reg exp: "if" and single "=" - regex

I need a regular expression (grep -e "__"), which matching all lines containing if and just one = (ignoring lines containing ==)
I tried this:
grep -e "if.*=[^=]"
but = is not a character class, so it doesn't work.

The problem is .* may contain an =.
I'd suggest
grep -e "if[^=]*=[^=]"
If your goal is to find lines of code with an if containing an erroneous assignment instead of a comparison, I'd suggest to use a linter (which would be based on a robust parser instead of just regexes). The linter to use depends on the language of the code, of course (for example I use this one in Javascript).

Related

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!
You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.
Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

Can OR expressions be used in ${var//OLD/NEW} replacements?

I was testing some string manipulation stuff in a bash script and I've quickly realized it doesn't understand regular expressions (at least not with the syntax I'm using for string operations), then I've tried some glob expressions and it seems to understand some of them, some not. To be specific:
FINAL_STRING=${FINAL_STRING//<title>/$(get_title)}
is the main operation I'm trying to use and the above line works, replacing all occurrences of <title> with $(get_title) on $FINAL_STRING... and
local field=${1/#*:::/}
works, assigning $1 with everything from the beginning to the first occurrence of ::: replaced by nothing (removed). However # do what I'd expect ^ to do. Plus when I've tried to use the {,,} glob expression here:
FINAL_STRING=${FINAL_STRING//{<suffix>,<extension>}/${SUFFIX}}
to replace any occurrence of <suffix> OR <extension> by ${SUFFIX} , it works not.
So I see it doesn't take regex and it also doesn't take glob patterns... so what Does it take? Are there any exhaustive listing of what symbols/expressions are understood by plain bash string operations (particularly substring replacement)? Or are *, ?, #, ##, % and %% the only valid stuff?
(I'm trying to rely only on plain bash, without calling sed or grep to do what I want)
The gory details can be found in the bash manual, Shell Expansions section. The complete picture is surprisingly complex.
What you're doing is described in the Shell Parameter Expansion section. You'll see that the pattern in
${parameter/pattern/string}
uses the Filename Expansion rules, and those don't include Brace Expansion - that is done earlier when processing the command line arguments. Filename expansion "only" does ?, * and [...] matching (unless extglob is set).
But parameter expansion does a bit more than just filename expansion, notably the anchoring you noticed with # or %.
bash does in fact handle regex; specifically, the [[ =~ ]] operator, which you can then assign to a variable using the magic variable $BASH_REMATCH. It's funky, but it works.
See: http://www.linuxjournal.com/content/bash-regular-expressions
Note this is a bash-only hack feature.
For code that works in shells besides bash as well, the old school way of doing something like this is indeed to use #/##/%/%% along with a loop around a case statement (which supports basic * glob matching).

Using GREP and regular expressions to search for multiple strings

I have spent the past few hours to trying to get a regular expression string right and have had no luck. The strings function would be to search through a file list and pull the ones which have any of the following in them:(OL####,DE####,DEA####,OLA####). Thus far I have gotten the following to sort of work.
grep "\<[DE\b|DEA\b|OL\b|OLA\b]\+[0-9]"
However it still finds things such as "E1" and pulls those lines out. What am I missing? I am very new to regular expressions and am trying to learn as I go.
Try this:
grep -oE '\b(OL|DE|DEA|OLA)[0-9]+\b' file
You can't use alternation inside of a character class. A character class defines a set of characters. Saying — "match one character specified by the class". Use a grouping construct instead:
I would try the following to match the lines:
grep -E '\b(DEA?|OLA?)[0-9]+'
If you only want the substring, use the following:
grep -Eo '\b(DEA?|OLA?)[0-9]+'
You need to replace your square brackets with round ones and remove the +:
grep -P "<(DE|DEA|OL|OLA)[0-9]"
Also note that angle brackets don't need escaping. I'm assuming you intended to have the < there, since it's not in your example strings.

Making regular expressions look nice in shell scripts

I often use grep and sed in my bash scripts.
For example, I use a script to remove comments from a template
In this example the comments look like:
/*# my comments contain text and ascii art:
*#
*# [box1] ------> [box2]o
*#
#*/
My sed chain to remove these lines looks like:
sed '/^\/\*#/d' | sed '/^\s*\*#/d' | sed '/^\s*#\*\//d'
I my scripts, I have to escape chars such as \ and /, which makes the code less readable. Therefore, my question is: How can I write nice-to-read regular expressions for sed in bash scripts?
One way, I can think of, is by using another separator instead of /, as in vim where you can natively use %s#search/text#replace/text#gc (using # the as separator) and therefore allow / as unescaped character. Defining an alternative escape char would also help. I would be interested in how you solve this problem. I am also open for alternative tools in case you think it is only a sed problem.
You can specify different separators, as detailed here.
Note that Perl allows you to do this too, along with splitting your regexp across several lines for better readability.
I think trying to make regex (which a lot of times is a sequence of symbols) nice to read is pretty hard.
However there are a few things you can do:
Use -r (or -E in some systems) so that you don't have to escape regex operators (), {}, +, ?
Use alternative separators, e.g. for s command
sed 's#regex#replacement#' file
For address ranges (you'll need '\')
sed '\#pattern# d' file
Leave spaces between address range and command (like d above).
Leave comments explaining what the regex matches (you can even include an example).
3 and 4 are more of an indirect approach but they should help.
Anyway what you are doing can be done in a single sed expression:
sed '\:^/\*#:,\:^#\*/: d' file
In addition to using alternative separators you may use extended regular expressions where appropriate, they invert the escaping rules so you have to write square brackets as "\[\]" to give them the special meaning.

Replacing .at() with [] throughout my code

I'm a C++ user and got some code that uses .at() to get bound checking on the STL vectors. Now I'd like to change them to standard []. Does anyone know of a script that could do this? It doesn't have to be a super general script — most of the cases are .at(i) or perhaps .at(a*i+j) — but there are too many of them to do by hand.
Use this Perl operator:
s/\.at\(([^)]+)\)/[$1]/g
The s/// operator in Perl is a "substitute" (find/replace). In the first set of //, you specify the regular expression to match. The second // is the text to replace or substitute that match with.
In this case, I'm finding any instance of ".at(anything-but-a-close-paren)" and replace it with "[what-was-in-those-parens]".
As a one-liner,
perl -pe's/\.at\(([^)]+)\)/[$1]/g' in.cpp > out.cpp
If you use Visual Studio, do this in the Find/Replace prompt:
Find What: \.at\({[^)]+}\)
Replace with: \[\1\]
Enable Regular Expressions and you're good to go.
sed -i 's,\.at(\([^\)]*\)),[\1],g' *.h *.cpp
should work for most simple expressions. However, if you use parentheses inside the parameter to at(), this will not work.
grep 'at(.*).*)' *.h *.cpp
helps you to identify these cases and convert them before running said sed script.
P.S. Keep a backup around (e.g. via a VCS) if you let sed operate in-place like here.
EDIT: Should have tested that sed script before posting. Fixed now, and tested.
sed -e 's/\.at(\([^)]*\))/\[\1]/g