Why this single and double quote make so much difference in output - regex

Why outputs of these two commands differ?
cat config.xml|perl -ne 'print $1,"\n" if /([0-9\.]+):161/'
cat config.xml|perl -ne "print $1,"\n" if /([0-9\.]+):161/"
First works as expected printing out matched group while seconds prints whole line.

I see two main things wrong with your command.
First off, double quotes allow shell interpolation, and $1 will be taken for a shell variable and replaced. Since it unlikely exists, it will be replaced with an empty string. So instead of print $1, you get print, which is shorthand for print $_, and is probably why the entire line prints.
Second, you have unescaped double quotes inside your command, so you are in fact passing three strings to Perl:
print ,
\n
if /(....)/
As for why or how this works with your shell, I don't know, since I do not have access to your OS, nor know which one it is. In Windows, I get a Perl bareword warning for n (Unquoted string "n" may clash with future reserved word at -e line 1.) which means that the \n is interpreted as a string. Now, here's the tricky part. What we get is this:
print , \n if /.../
Which means that \n is no longer an argument to print, it is a statement that comes after print and it is in void context, so it gets ignored. We can see this by this warning (which I had to fake in my shell):
Useless use of single ref constructor in void context at -e line 1.
(Note that you do not get these warnings as you do not use warnings -- the -w switch)
So what we are left with is
print if /.../
Which is exactly the code for the behaviour you described: It prints the whole line when a match is found.
What you can do to visualize the problem in your shell is add the -MO=Deparse switch to your one-liner, as shown here:
C:\perl>perl -MO=Deparse -ne"print ,"\n" if /a/"
LINE: while (defined($_ = <ARGV>)) {
print($_), \'n' if /a/;
}
-e syntax OK
Now we can clearly see that the print statement is separated from the newline, and that the newline is a reference to a string.
Solution:
However, your code has other problems, and if done right you can avoid all the shell difficulties. First, you have a UUOC (Useless Use of Cat). A file argument can be given to perl when using the -n switch on the command line. Secondly, you do not need to use variables for this, you can simply print the return value of your regex:
perl -nlwe 'print for /(...)/' config.xml
The -l switch will handle newlines for you, and in this case add newline to the print. The for is necessary to avoid printing empty matches.

Inside double quote, some stuffs are substituted ($variable, `command`, ..). While inside single quote, they are remained as is.
$ echo "$HOME"
/home/falsetru
$ echo '$HOME'
$HOME
$ echo "`echo 1`"
1
$ echo '`echo 1`'
`echo 1`
Nested quotes:
$ echo ""hello""
hello
$ echo '"hello"'
"hello"
$ echo "\"hello\""
"hello"
Escape double quotes, $ to get same result:
cat config.xml | perl -ne "print \$1,\"\n\" if /([0-9\.]+):161/"

Two things:
Nested quotes.
Variables expand differently.
The first command has one string that happens to contain some double quotes. The variable is not expanded.
The second command has two strings with an unquoted \n in between. The variable is expanded.
Let's say $1 contains "blah"
The first passes this string to perl:
print $1,"\n" if /([0-9\.]+):161/
the second, this:
print blah,\n if /([0-9\.]+):161/

Related

Can not replace multiple empty lines with one

Why does the following not replace multiple empty lines with one?
$ cat some_random_text.txt
foo
bar
test
and this does not work:
$ cat some_random_text.txt | perl -pe "s/\n+/\n/g"
foo
bar
test
I am trying to replace the multiple new lines (i.e. empty lines) to a single empty new line but the regex I use for that does not work as you can see in the example snippet.
What am I messing up?
Expected outcome is:
foo
bar
test
The reason it doesn't work is that -p tells perl to process the input line by line, and there's never more than one \n in a single line.
Better idea:
perl -00 -lpe 1
-00: Enable paragraph mode (input records are terminated by any sequence of 2+ newlines).
-l: Enable autochomp mode (the input record separators are trimmed automatically, so since we're in paragraph mode, all trailing newlines are removed, and output records get "\n\n" added).
-p: Enable automatic input/output (the main code is executed for each input record; anything left in $_ is printed automatically).
-e 1: Use a dummy main program that does nothing.
Taken all together this does nothing except normalize paragraph terminators to exactly two newlines.
You are executing the following program:
LINE: while (<>) {
s/\n+/\n/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Since you are reading one line at at time, and since a line is a sequence of characters that aren't line feeds terminated by a line feed, your pattern will never match more than one newline.
The simple fix is to tell Perl to treat the entire file as one line. Also, you don't want to replace every line feed, but just those found in sequence of two or more, and you want to replace the sequence with two line feeds.
perl -0777pe's/\n\n\K\n+//g; s^\n+//; s/\n\K\n\z//' some_random_text.txt
The second and third substitutions ensure there are no blank lines at the start and end of the file.
While reading the entire file into memory is easy, it's not necessary. The desired output can also be achieved by maintaining a flag that indicates whether the previous line was blank or not.
perl -ne'if (/\S/) { print "\n" if $f; print; $f=0 } else { $f=1 }' some_random_text.txt
This solution also removes blank lines from the start and end of the file.
Given:
$ echo "$txt"
foo
bar
test
You can use sed to reduce the runs of blank lines to a single \n:
$ echo "$txt" | sed '/^$/N;/^\n$/D'
foo
bar
test
Even easier, you can use cat -s:
$ echo "$txt" | cat -s # same output
In perl the idiomatic 1 liner is to use -00 for paragraph mode:
$ echo "$txt" | perl -00pe0 # same output
And in awk you have the flexibility of using paragraph mode by setting RS= and then set ORS= to what you want the replacement for runs of \n to be:
$ echo "$txt" | awk '1' RS= ORS="\n\n" # same output
ikegami correctly states that printf 'a\n\n' | ... will produce two trailing spaces with these solutions. That may or may not be an issue.

Special meaning of "$a$" in grep command

I am trying to learn shell scripting, and I am trying different things. While I was practicing I came across grep "$a$" file1 and couldn't understand the output.
I know the difference that single quotes takes the literal meaning and the double quoting tries to find the special meaning like the $a is supposed to be variable $.
I have file1 with content
#!/bin/sh
a=1
b=1
echo $a
echo $b
for I in 1 2 3 4 5 6 7 8
do
c=a
b=$a
b=$(($a+$c))
echo $b
done
grep "$a$" file1 gives me the whole file1 as output where
grep '$a$' file1 gives me output as it is supposed to give, like the line which ends in $a.
Please explain why it gives the whole file content as output when grep "$a$" is used.
To begin with ^ and $ in a grep pattern symbolizes beginning and end respectively.
In
grep "$a$" file1
$a undergoes expansion because it is inside double quotes. In your case $a should be undefined. So the The result of "$a$" is $ which matches the end of every line, so you will get the entire file as output. To verify this assign some test value to a and then run grep.
a="TestValue" # Before running grep
grep "$a$" file1
I bet you'll get nothing as output.
Now if you want to have a literal '$' inside the double quote. then you need to do
grep "\$a$" file1 # See the first $ is escaped
Above command will give you all the lines that end with $a
echo $a
b=$a
Now if you're tired of escaping you may very well use single quotes as below
grep '$a$' file1
# The first $ is literal $, and the last one symbolizes end of file.
# More over variable-expansion doesn't take place inside single quotes.
Note : The a=1 inside the file1 has no influence on the grep result as the file1 is just an input to grep.
There is no special meaning.
I don't see any grep in your code. I assume your are grepping from the command line and that the variable a is undefined. Consequently, grep "$a$" expands to grep "$" ($a expands to nothing) and grep $ matches every line since $ matches the end of line.
[update] By 'expand', I mean shell variable expansion. Because $a is between double quotes, the shell is replacing $a with the value of variable a (which is undefined). Your grep '$a$' yields the expected result because any string between single quotes is always left untouched by the shell. Try echo "$a$" vs echo '$a$'.
"$a$"
The string "$a$" is expanded to the value of $a shell variable followed by literal $ (why? see below). If the $a variable is unset, its value is interpreted as an empty string, and the result of expansion is "$" (only the dollar sign).
Why the last dollar sign is not expanded?**
The basic form of parameter expansion is "${PARAMETER}". The braces are not required, if PARAMETER is not a positional parameter with more than one digit (10, 11, etc.), and if PARAMETER is not followed by a character that is not to be interpreted as part of its name. Since $ begins next parameter expansion, command substitution, or arithmetic expansion, it is not interpreted as a part of the a parameter name in our case.
Since nothing is followed by the last $, it doesn't conform to any kind of expansion, and left intact.
How Grep interprets the pattern
Since the match-end-of-line operator ($) matches the empty string either at the end of the string or before a newline character in the string, the pattern $ matches any line.
Therefore, grep "$a$" file will match and print all lines in file, if $a is empty.
'$a$'
Single quotes protect the string from any expansion (interpretation of special characters). That is, the string is passed to the command as is.
In the case of grep '$a$' file invocation, the pattern matches "$a" string at the end of the line. The following describes why.
The last $ means the end of the line, as we know. The rest of dollar signs are interpreted depending on the pattern. It can be interpreted either as literal dollar sign, or the end of the line.
In the following cases, $ represents the end-of-line operator. Otherwise, $ is ordinary.
If the $ is last in the pattern, as in foo$.
The syntax bit RE_CONTEXT_INDEP_ANCHORS is set, and is outside a bracket expression.
It ends an open-group or alternation operator expression, e.g. '\(b$\)', or '\(b$\)\|a'.
Since
the first $ in our pattern is obviously not last,
you are using the basic RE_SYNTAX_GREP syntax, i.e. BRE,
and the open-group/alternation is not used,
the first dollar sign is ordinary in $a$.
When the first $ is not ordinary?
It is not ordinary, for example, in the case of extended regular expressions. That is, with the -E or -P options, or with egrep or pgrep commands, for instance (where the RE_CONTEXT_INDEP_ANCHORS syntax bit is set).
Escaping the dollar sign
Consider escaping the dollar sign even though it is interpreted as literal $ in the basic regular expressions. It will develop a good habit.
To escape the dollar sign within single quotes use a backslash: '\$a$'.
Within double quotes the backslash has special meaning, so you need to escape the backslash itself: "\\\$a\$".

Bash how to replace comment backslash with nothing

I have a strign in bash with the following format:
// comment.
I want to obtain a new variable with comment alone (no backslash) and I don't want to depend on the // begin the first two characters in the string. How can I do this?
I have tried this:
nline=${line/%/////}
echo $nline
To use string substitution but it doesn't work.
Perhaps you want the # substitution?
$ a='// this is a comment'
$ printf "%s\n" "${a#// }"
this is a comment
$ a='not a comment'
$ printf "%s\n" "${a#// }"
not a comment
And as SergA pointed out, a little better patterns for our variable extraction can save us the need for the sed solution below:
$ a="first //a comment"
$ printf "%s" "${a##*//}"
If you just want to get the comment part of a line anywhere it is you could use sed like so:
$ a="first //a comment"
$ printf "%s\n" "$a" | sed -e 's,^.*// \?,,'
a comment
which of course you could store in another variable:
nline=$(printf "%s" "$a" | sed -e 's,^.*// \?,,')
(note also that I remved the \n from the printf)
Remove first two characters:
echo ${nline:2}
% matches the end of the string. # matches the beginning of the string.
Since you said you wanted neither of those you don't want either % or # in there.
Also you need to escape / in a /-delimited pattern.
nline=${line/\/\/}
echo "$nline"
This will remove the first // from the string no matter where it is or what comes before it. So foo // comment will become foo comment, etc.
If you want to also remove any surrounding spaces from the // string then you need to do a bit more work and can't so easily use string substitution for it.

bash substitutions don't appear to work when matching newline characters

I have an executable file called test.script containing this simple bash script:
#!/bin/bash
temp=${1//$'\n'/}
output=${temp//$'\r'/}
printf "$output" > output.txt
When I run
sudo ./test.script "^\r\r\n\n\r\n\r\n\r\n\r\n\n\r\n\rHello World\n\n\r\r\r\n\n\r\r\r$"
in the same directory as test.script, I expect to end up with an output.txt looking like this:
^Hello World$
but when I take a look I instead see this:
^
Hello World
$
Clearly I have a misunderstanding about regex in bash.
Please explain to me what I am missing, then show me how to write the bash so that all newline characters are removed from the string before said string is written to a file. Thanks in advance.
You can "fix" your script like this (although I must say this isn't typical usage of printf):
#!/bin/bash
temp=${1//'\n'/}
output=${temp//'\r'/}
printf "$output"
The argument to your script $1 doesn't contain real newlines or carriage returns, which is what $'\n' and $'\r' are for. Instead, it looks like you just want to remove the literal strings '\n' and '\r'.
To elaborate on my point about printf, normally two (or more) arguments are passed: the format specifier and the variables that are to be inserted. For example, to print a single string you would use something like printf '%s' "$output". In your script, the variable $output is being treated as the format specifer; you're relying on printf to expand your \n and \r into newlines and carriage returns.
You're not actually using regular expressions here by the way; the syntax ${var//match/replace} is a substring replacement, where // means that all occurrences of the substring match in $var are replaced. As you haven't specified anything to replace the substring with, the substring is replaced with nothing (i.e. removed).

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"