Special meaning of "$a$" in grep command - regex

I am trying to learn shell scripting, and I am trying different things. While I was practicing I came across grep "$a$" file1 and couldn't understand the output.
I know the difference that single quotes takes the literal meaning and the double quoting tries to find the special meaning like the $a is supposed to be variable $.
I have file1 with content
#!/bin/sh
a=1
b=1
echo $a
echo $b
for I in 1 2 3 4 5 6 7 8
do
c=a
b=$a
b=$(($a+$c))
echo $b
done
grep "$a$" file1 gives me the whole file1 as output where
grep '$a$' file1 gives me output as it is supposed to give, like the line which ends in $a.
Please explain why it gives the whole file content as output when grep "$a$" is used.

To begin with ^ and $ in a grep pattern symbolizes beginning and end respectively.
In
grep "$a$" file1
$a undergoes expansion because it is inside double quotes. In your case $a should be undefined. So the The result of "$a$" is $ which matches the end of every line, so you will get the entire file as output. To verify this assign some test value to a and then run grep.
a="TestValue" # Before running grep
grep "$a$" file1
I bet you'll get nothing as output.
Now if you want to have a literal '$' inside the double quote. then you need to do
grep "\$a$" file1 # See the first $ is escaped
Above command will give you all the lines that end with $a
echo $a
b=$a
Now if you're tired of escaping you may very well use single quotes as below
grep '$a$' file1
# The first $ is literal $, and the last one symbolizes end of file.
# More over variable-expansion doesn't take place inside single quotes.
Note : The a=1 inside the file1 has no influence on the grep result as the file1 is just an input to grep.

There is no special meaning.
I don't see any grep in your code. I assume your are grepping from the command line and that the variable a is undefined. Consequently, grep "$a$" expands to grep "$" ($a expands to nothing) and grep $ matches every line since $ matches the end of line.
[update] By 'expand', I mean shell variable expansion. Because $a is between double quotes, the shell is replacing $a with the value of variable a (which is undefined). Your grep '$a$' yields the expected result because any string between single quotes is always left untouched by the shell. Try echo "$a$" vs echo '$a$'.

"$a$"
The string "$a$" is expanded to the value of $a shell variable followed by literal $ (why? see below). If the $a variable is unset, its value is interpreted as an empty string, and the result of expansion is "$" (only the dollar sign).
Why the last dollar sign is not expanded?**
The basic form of parameter expansion is "${PARAMETER}". The braces are not required, if PARAMETER is not a positional parameter with more than one digit (10, 11, etc.), and if PARAMETER is not followed by a character that is not to be interpreted as part of its name. Since $ begins next parameter expansion, command substitution, or arithmetic expansion, it is not interpreted as a part of the a parameter name in our case.
Since nothing is followed by the last $, it doesn't conform to any kind of expansion, and left intact.
How Grep interprets the pattern
Since the match-end-of-line operator ($) matches the empty string either at the end of the string or before a newline character in the string, the pattern $ matches any line.
Therefore, grep "$a$" file will match and print all lines in file, if $a is empty.
'$a$'
Single quotes protect the string from any expansion (interpretation of special characters). That is, the string is passed to the command as is.
In the case of grep '$a$' file invocation, the pattern matches "$a" string at the end of the line. The following describes why.
The last $ means the end of the line, as we know. The rest of dollar signs are interpreted depending on the pattern. It can be interpreted either as literal dollar sign, or the end of the line.
In the following cases, $ represents the end-of-line operator. Otherwise, $ is ordinary.
If the $ is last in the pattern, as in foo$.
The syntax bit RE_CONTEXT_INDEP_ANCHORS is set, and is outside a bracket expression.
It ends an open-group or alternation operator expression, e.g. '\(b$\)', or '\(b$\)\|a'.
Since
the first $ in our pattern is obviously not last,
you are using the basic RE_SYNTAX_GREP syntax, i.e. BRE,
and the open-group/alternation is not used,
the first dollar sign is ordinary in $a$.
When the first $ is not ordinary?
It is not ordinary, for example, in the case of extended regular expressions. That is, with the -E or -P options, or with egrep or pgrep commands, for instance (where the RE_CONTEXT_INDEP_ANCHORS syntax bit is set).
Escaping the dollar sign
Consider escaping the dollar sign even though it is interpreted as literal $ in the basic regular expressions. It will develop a good habit.
To escape the dollar sign within single quotes use a backslash: '\$a$'.
Within double quotes the backslash has special meaning, so you need to escape the backslash itself: "\\\$a\$".

Related

what is '/[A-Z]/ s| |/|gp' meaning?

I am reading a sed tutorial at https://riptutorial.com/sed/example/13753/lines-matching-regular-expression-pattern.
Looks like
$ sed -n '/[A-Z]/ s| |/|gp' ip.txt
is filtering 'Add Sub Mul Div' out of the file, and convert it to 'Add/Sub/Mul/Div'
I really don't understand the regex considering I just read https://www.tldp.org/LDP/abs/html/x23170.html.
It does not even match the print syntax which is:
[address-range]/p
and is the pipe sign '|' here alternation?
Could anyone explain:
'/[A-Z]/ s| |/|gp'
in English?
Edit
I also found that the extra empty space before 's' and after '/' is allowed and does not do anything. the correct syntax should be:
[address-range]/s/pattern1/pattern2/
the syntax check of sed pattern is not strict, and confusing
-n option turns off automatic printing
sed allows to qualify commands with an address filtering, which could be regex or line addresses
for example, /foo/ d will delete lines containing foo
and /foo/ s/baz/123/ will change baz to 123 only if the line also contains foo
/[A-Z]/ match only lines containing at least one uppercase alphabet
if such a line is matched:
s| |/|gp perform this substitution and print
s command allows delimiter other than / too (see Using different delimiters in sed commands and range addresses)
in this case, using | allows you to use / as a normal character instead of having to escape it

How to replace all dollar signs before all variables inside a double-quoted string with sed?

I have problems replacing the variables that are inside strings in bash. For example, I want to replace
"test$FOO1=$FOO2" $BAR
with:
"test" .. FOO1 .. "=" .. FOO2 .. "" $BAR
I tried:
sed 's/\$\([A-Z0-9_]\+\)\b/" .. \1 .. "/g'
But I don't want to replace variables the same way outside of double-quoted strings, e.g. like:
if [ $VARIABLE = 1 ]; then
Has to be replaced by just
if VARIABLE then
Is there a way to replace only inside of double-quotes?
Background:
I want to convert a bash script into Lua script.
I am aware, that it will not be easily possible to convert all possible shell scripts this way, but what I want to achieve is to replace all basic language constructs with Lua commands and replace all variables and conditionals. An automation here will save much work when translating bash into Lua by hand
This with GNU awk for multi-char RS, RT, and gensub() shows one way to separate and then manipulate quoted (in RT) and unquoted (in $0) strings as a starting point:
$ cat tst.awk
BEGIN { RS="\"[^\"]*\""; ORS="" }
{
$0 = gensub(/\[\s+[$]([[:alnum:]_]+)\s+=\s+\S+\s+];/,"\\1","g",$0)
RT = gensub(/[$]([[:alnum:]_]+)"/,"\" .. \\1","g",RT)
RT = gensub(/[$]([[:alnum:]_]+)/,"\" .. \\1 .. \"","g",RT)
print $0 RT
}
$ awk -f tst.awk file
"count: " .. FOO .. " times " .. BAR
if VARIABLE then
The above was run on this input file:
$ cat file
"count: $FOO times $BAR"
if [ $VARIABLE = 1 ]; then
NOTE: this approach of matching strings with regexps will always just be a best effort based on the samples provided, you'd need a shell language parser to do the job robustly.
bash lexer for shell!?
I'm so sorry: I just post this answer to warn you about a wrong way!
Reading language is a job for a consistant lexer not for sed nor any regex based tool!!!
See GNU Bison, Berkeley Yacc (byacc).
You could have a look at bash's sources in order to see how scripts are read!
Persisting in this way will bring you quickly to big script, then further to unsolvable problems.
using group and recursive
sed -e ':a' -e 's/^\(\([^"]*\("[^"]*"\)*\)*\)\("[^$"]*\)[$]\([A-Z0-9_]\{1,\}\)/\1\4 .. \5 .. /;t a'
isolate in string from previous part with
^\(\([^"]*\("[^"]*"\)*\)*\) in group 1
select the var content in the string isolated with s\("[^$"]*\)[$]\([A-Z0-9_]\{1,\}\)' in group 4 (prefix) and 5 (var name)
change like you want with \1\4 .. \5 ..
repeat this operation while a change is occuring :a and t a
with a gnu sed you can reduce the command to (no -e needed to target the label a):
sed ':a;s/^\(\([^"]*\("[^"]*"\)*\)*\)\("[^$"]*\)[$]\([A-Z0-9_]\{1,\}\)/\1\4 .. \5 .. /;t a'
Assuming there is no quote (escaped one) in string. If so a first pass is needed to change them and put them back after main modification.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"$]*"[^"]*)*"[^"$]*)\$([^" ]*) /\1" .. \3 .. " /;ta;s/^([^"]*("[^"$]*"[^"]*)*"[^"$]*)\$([^"]*)"/\1" .. \3/;ta' file
When changing things within double quotes, first we must sail passed any double quoted strings that do not need changing. This means anchoring the regexp to the start of the line using the ^ metacharacter and iterating the regexp until all cases cease to exist.
First, eliminate zero or more characters which are not double quotes from the start of the line.
Second, eliminate double quoted strings which do not contain the character of interest (TCOI) i.e. $, followed by zero or more characters which are not double quotes, zero or more times.
Third, eliminate double quotes followed by zero or more characters which are not double quotes or TCOI i.e. $.
The following character (if it exists) must be TCOI. Group the entire collection of strings before in a back reference \1.
Following TCOI, one or more conditions may be grouped. In the above example the first condition is when a variable (beginning with TCOI) is followed by a space. The second condition is when the variable is followed directly by ". Hence this entails two substitution commands, the ta command, branches to the loop identified a when the substitution was successful.
N.B. The if [ $VARIABLE = 1 ]; then situation can be treated in the same vien, here the [ is the opening double quote and the ] is the closing double quote.
P.S. TCOI was $ and this is also a metacharacter in regexp that represents the end of a line, it therefore must be quoted e.g.\$
P.P.S. Don't forget to quote the ['s and ]'s too. If quotings not your thing, then enclose the character in [x] where x is the character to be quoted.
EDIT:
sed -E ':a;s/^([^"]*("[^"$]*"[^"]*)*"[^"$]*)\$([[:alnum:]]*)/\1" .. \3 .. "/;ta' file
Since the original example has been replace by the OP here is a solution based on the new example.

Grep with reg ex

Trying to use regex with grep in the command line to give me lines that start with either a whitespace or lowercase int followed by a space. From there, they must end with either a semi colon or a o.
I tried
grep ^[\s\|int]\s+[\;\o]$ fileName
but I don't get what I'm looking for. I also tried
grep ^\s*int\s+([a-z][a-zA-Z]*,\s*)*[a-z]A-Z]*\s*;
but nothing.
Let's consider this test file:
$ cat file
keep marco
polo
int keep;
int x
If I understand your rules correctly, two of the lines in the above should be kept and the other two discarded.
Let's try grep:
$ grep -E '^(\s|int\s).*[;o]$' file
keep marco
int keep;
The above uses \s to mean space. \s is supported by GNU grep. For other greps, we can use a POSIX character class instead. After reorganizing the code slightly to reduce typing:
grep -E '^(|int)[[:blank:]].*[;o]$' file
How it works
In a Unix shell, the single quotes in the command are critical: they stop the shell from interpreting or expanding any character inside the single quotes.
-E tells grep to use extended regular expressions. Thus reduces the need for backslashes.
Let's examine the regular expression, one piece at a time:
^ matches at the beginning of a line.
(\s|int\s) This matches either a space or int followed by a space.
.* matches zero or more of any character.
[;o] matches any character in the square brackets which means that it matches either ; or o.
$ matches at the end of a line.

Egrep expression: how to unescape single quotes when reading from file?

I need to use egrep to obtain an entry in an index file.
In order to find the entry, I use the following command:
egrep "^$var_name" index
$var_name is the variable read from a var list file:
while read var_name; do
egrep "^$var_name" index
done < list
One of the possible keys comes usually in this format:
$ERROR['SOME_VAR']
My index file is in the form:
$ERROR['SOME_VAR'] --> n
Where n is the line where the variable is found.
The problem is that $var_name is automatically escaped when read. When I enable the debug mode, I get the following command being executed:
+ egrep '^$ERRORS['\''SELECT_COUNTRY'\'']' index
The command above doesn't work, because egrep will try to interpret the pattern.
If I don't use the extended version, using grep or fgrep, the command will work only if I remove the ^ anchor:
grep -F "$var_name" index # this actually works
The problem is that I need to ensure that the match is made at the beginning of the line.
Ideas?
set -x shows the command being executed in shell notation.
The backslashes you see do not become part of the argument, they're just printed by set -x to show the executed command in a copypastable format.
Your problem is not too much escaping, but too little: $ in regex means "end of line", so ^$ERROR will never match anything. Similarly, [ ] is a character range, and will not match literal square brackets.
The correct regex to match your pattern would be ^\$ERROR\['SOME VAR'], equivalent to the shell argument in egrep "^\\\$ERROR\['SOME_VAR']".
Your options to fix this are:
If you expect to be able to use regex in your input file, you need to include regex escapes like above, so that your patterns are valid.
If you expect to be able to use arbitrary, literal strings, use a tool that can match flexibly and literally. This requires jumping through some hoops, since UNIX tools for legacy reasons are very sloppy.
Here's one with awk:
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
It passes the string in through the environment (because -v is sloppy) and then matches literally against the string from the start of the input.
Here's an example invocation:
$ cat script
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
$ cat list
$ERRORS['SOME_VAR']
\E and \Q
'"'%##%*'
$ cat index
hello world
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
etc
$ bash script
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
You can use printf "%q":
while read -r var_name; do
egrep "^$(printf "%q\n" "$var_name")" index
done < list
Update: You can also do:
while read -r var_name; do
egrep "^\Q$var_name\E" index
done < list
Here \Q and \E are used to make string in between a literal string removing all special meaning of regex symbols.

Why this single and double quote make so much difference in output

Why outputs of these two commands differ?
cat config.xml|perl -ne 'print $1,"\n" if /([0-9\.]+):161/'
cat config.xml|perl -ne "print $1,"\n" if /([0-9\.]+):161/"
First works as expected printing out matched group while seconds prints whole line.
I see two main things wrong with your command.
First off, double quotes allow shell interpolation, and $1 will be taken for a shell variable and replaced. Since it unlikely exists, it will be replaced with an empty string. So instead of print $1, you get print, which is shorthand for print $_, and is probably why the entire line prints.
Second, you have unescaped double quotes inside your command, so you are in fact passing three strings to Perl:
print ,
\n
if /(....)/
As for why or how this works with your shell, I don't know, since I do not have access to your OS, nor know which one it is. In Windows, I get a Perl bareword warning for n (Unquoted string "n" may clash with future reserved word at -e line 1.) which means that the \n is interpreted as a string. Now, here's the tricky part. What we get is this:
print , \n if /.../
Which means that \n is no longer an argument to print, it is a statement that comes after print and it is in void context, so it gets ignored. We can see this by this warning (which I had to fake in my shell):
Useless use of single ref constructor in void context at -e line 1.
(Note that you do not get these warnings as you do not use warnings -- the -w switch)
So what we are left with is
print if /.../
Which is exactly the code for the behaviour you described: It prints the whole line when a match is found.
What you can do to visualize the problem in your shell is add the -MO=Deparse switch to your one-liner, as shown here:
C:\perl>perl -MO=Deparse -ne"print ,"\n" if /a/"
LINE: while (defined($_ = <ARGV>)) {
print($_), \'n' if /a/;
}
-e syntax OK
Now we can clearly see that the print statement is separated from the newline, and that the newline is a reference to a string.
Solution:
However, your code has other problems, and if done right you can avoid all the shell difficulties. First, you have a UUOC (Useless Use of Cat). A file argument can be given to perl when using the -n switch on the command line. Secondly, you do not need to use variables for this, you can simply print the return value of your regex:
perl -nlwe 'print for /(...)/' config.xml
The -l switch will handle newlines for you, and in this case add newline to the print. The for is necessary to avoid printing empty matches.
Inside double quote, some stuffs are substituted ($variable, `command`, ..). While inside single quote, they are remained as is.
$ echo "$HOME"
/home/falsetru
$ echo '$HOME'
$HOME
$ echo "`echo 1`"
1
$ echo '`echo 1`'
`echo 1`
Nested quotes:
$ echo ""hello""
hello
$ echo '"hello"'
"hello"
$ echo "\"hello\""
"hello"
Escape double quotes, $ to get same result:
cat config.xml | perl -ne "print \$1,\"\n\" if /([0-9\.]+):161/"
Two things:
Nested quotes.
Variables expand differently.
The first command has one string that happens to contain some double quotes. The variable is not expanded.
The second command has two strings with an unquoted \n in between. The variable is expanded.
Let's say $1 contains "blah"
The first passes this string to perl:
print $1,"\n" if /([0-9\.]+):161/
the second, this:
print blah,\n if /([0-9\.]+):161/