Bash - use variable in perl regex together with matching groups - regex

this is my first post on stackoverflow, please forgive me if I missed something important.
I am currently stuck with the follwing issue. The goal is, to replace port numbers dynamically based on a filelist I prepared with find. All of the ports in those files, start with the number "4" and have 5 digits.
Now the tricky part, I am replacing only digit #2 and #3, and keep positions 1, 4 and 5. Examples:
old port in file: 40380, 40381
new port in file: 41580, 40381
I am working on Sun Solaris 5.10 therefore I prefer perl for inline replacements
Finally the key question: how can I combine $1 (group 1) + $PIN_PINNO + $3 (group 3) so that the result would be: 41580
NEW_PINNO=15
LOGI=$HOME/filelist.txt
# port replacement
for file in `cat $LOGI`
do
perl -pe 's/[\:\>\=]\s*(4)(\d{2})(\d{2})\b/$1${NEW_PINNO}$3/g' $file
done
many thanks in advance

perl -pse 's/ [:>=]\s* \K (\d)\d\d(\d\d) \b/$1$pin$2/gx' -- -pin="$new_pinno" file
Your regex will match the [colon, greater than, equal sign] and the spaces, but you don't include them in the substitution. I'm using the \K directive to match those characters but then forget about them (ref: http://perldoc.perl.org/perlre.html#Lookaround-Assertions)
I'm using the -s option to enable "rudimentary switch parsing" to pass the shell variable into perl without playing quoting games. (ref: http://perldoc.perl.org/perlrun.html)
Testing
new_pinno=15
perl -pse 's/ [:>=]\s* \K (\d)\d\d(\d\d) \b/$1$pin$2/gx' -- -pin="$new_pinno" <<END
var1=40380
var2=40381
END
var1=41580
var2=41581
Notes
you should not use ALL_CAPS_VARNAMES in the shell, leave those to be reserved by the shell. One day, you'll use PATH=something and then wonder why your script is broken.
and #123's comment is valid. This is the safe way to read lines from a file:
while read -r file; do
perl ... "$file"
done < "$LOGI"
ref: http://mywiki.wooledge.org/DontReadLinesWithFor

perl -pe 's/[\:\>\=]\s*(4)(\d{2})(\d{2})\b/$1'${NEW_PINNO}'$3/g' $file
is ok.

The difference between single and double quotes is how bash treats its variables. In single quotes it won't expand them, but when enclosed in double quotes it will. You can open & close quotes as much as you want in a command line argument. It's only if there is a space that is outside of a pair of quotes (single or double) that determines the start of a new argument.
So you close the single quote, have the bash variable which will be expanded and then re-open the single quote. Enclosing the variable in double quotes ensures that if there are any spaces in the variable they'll not split the argument.
perl -pe 's/[\:\>\=]\s*(4)(\d{2})(\d{2})\b/$1'"${NEW_PINNO}"'$3/g' $file

Related

Perl Regex Command Line Issue

I'm trying to use a negative lookahead in perl in command line:
echo 1.41.1 | perl -pe "s/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g"
to get an incremented version that looks like this:
1.41.2
but its just returning me:
![0-9]+\.[0-9]+\.: event not found
i've tried it in regex101 (PCRE) and it works fine, so im not sure why it doesn't work here
In Bash, ! is the "history expansion character", except when escaped with a backslash or single-quotes. (Double-quotes do not disable this; that is, history expansion is supported inside double-quotes. See Difference between single and double quotes in Bash)
So, just change your double-quotes to single-quotes:
echo 1.41.1 | perl -pe 's/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g'
and voilĂ :
1.41.2
I'm guessing that this expression also might work:
([0-9.]+)\.([0-9]+)
Test
perl -e'
my $name = "1.41.1";
$name =~ s/([0-9.]+)\.([0-9]+)/$1\.2/;
print "$name\n";
'
Output
1.41.2
Please see the demo here.
If you want to "increment" a number then you can't hard-code the new value but need to capture what is there and increment that
echo "1.41.1" | perl -pe's/[0-9]+\.[0-9]+\.\K([0-9]+)/$1+1/e'
Here /e modifier makes it so that the replacement side is evaluated as code, and we can +1 the captured number, what is then substituted. The \K drops previous matches so we don't need to put them back; see "Lookaround Assertions" in Extended Patterns in perlre.
The lookarounds are sometimes just the thing you want, but they increase the regex complexity (just by being there), can be tricky to get right, and hurt efficiency. They aren't needed here.
The strange output you get is because the double quotes used around the Perl program "invite" the shell to look at what's inside whereby it interprets the ! as history expansion and runs that, as explained in ruakh's post.
As an alternate to lookahead, we can use capture groups, e.g. the following will capture the version number into 3 capture groups.
(\d+)\.(\d+)\.(\d+)
If you wanted to output the captured version number as is, it would be:
\1.\2.\3
And to just replace the 3rd part with the number "2" would be:
\1.\2.2
To adapt this to the OP's question, it would be:
$ echo 1.14.1 | perl -pe 's/(\d+)\.(\d+)\.(\d+)/\1.\2.2/'
1.14.2
$

BASH escaping double quotes within single quotes

I'm trying to write a bash function that would escape all double quotes within single quotes, eg:
'I need to escape "these" quotes with backslashes'
would become
'I need to escape \"these\" quotes with backslashes'
My take on it was:
Find pairs of single quotes in the input and extract them with grep
Pipe into sed, escape double quotes
Sed again the whole input and replace grep match with sedded match
I managed to get it working to the part of having correctly escaped quotes section, but replacing it in the whole input fails.
The script code copypaste:
# $1 - Full name, $2 - minified name
adjust_quotes ()
{
SINGLE_QUOTES=`grep -Eo "'.*'" $2`
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
sed -r "s|'.*'|$ESCAPED_QUOTES|g" "$2" > "$2.escaped"
mv "$2.escaped" $2
echo "Quotes escaped within single quotes on $2"
}
Random additional questions:
In the console, escaping the quote with only two backslashes works, but when code is put in the script - I need four. I'd love to know
Could I modify this code into a loop to escape all pairs of single quotes, one after another until EOF?
Thanks!
P.S. I know this would probably be easier to do in eg. python, but I really need to keep it in bash.
Using BASH string replacement:
s='I need to escape "these" quotes with backslashes'
r="${s//\"/\\\"}"
echo "$r"
I need to escape \"these\" quotes with backslashes
Here's a pure bash solution, which does the transformation on stdin, printing to stdout. It reads the entire input into memory, so it won't work with really enormous files.
escape_enclosed_quotes() (
IFS=\'
read -d '' -r -a fields
for ((i=1; i<${#fields[#]}; i+=2)); do
fields[i]=${fields[i]//\"/\\\"}
done
printf %s "${fields[*]}"
)
I deliberately enclosed the body of the function in parentheses rather than braces, in order to force the body to run in a subshell. That limits the modification of IFS to the body, as well as implicitly making the variables used local.
The function uses the read builtin to read the entire input (since the line delimiter is set to NUL with -d '') into an array (-a) using a single quote as the field separator (IFS=\'). The result is that the parts of the input surrounded with single quotes are in the odd positions of the array, so the function loops over the odd indices to do the substitution only for those fields. I use bash's find-and-replace syntax instead of deferring to an external utility like sed.
This being bash, there are a couple of gotchas:
If the file contains a NUL, the rest of the file will be ignored.
If the last line of the file does not end with a newline, and the last character of that line is a single quote, it will not be output.
Both of the above conditions are impossible in a portable text file, so it's probably OK. All the same, worth taking note.
The supplementary question: why are the extra backslashes needed in
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
Answer: It has nothing to do with that line being in a script. It has to do with your use of backticks (...) for command substitution, and the idiosyncratic and often unpredictable handling of backslashes inside backticks. This syntax is deprecated. Do not use it. (Not even if you see someone else using it in some random example on the internet.) If you had used the recommended $(...) syntax for command substitution, it would have worked as expected:
ESCAPED_QUOTES=$(echo $SINGLE_QUOTES | sed 's|"|\\"|g')
(More information is in the Bash FAQ linked above.)

Edit within multi-line sed match

I have a very large file, containing the following blocks of lines throughout:
start :234
modify 123 directory1/directory2/file.txt
delete directory3/file2.txt
modify 899 directory4/file3.txt
Each block starts with the pattern "start : #" and ends with a blank line. Within the block, every line starts with "modify # " or "delete ".
I need to modify the path in each line, specifically appending a directory to the front. I would just use a general regex to cover the entire file for "modify #" or "delete ", but due to the enormous amount of other data in that file, there will likely be other matches to this somewhat vague pattern. So I need to use multi-line matching to find the entire block, and then perform edits within that block. This will likely result in >10,000 modifications in a single pass, so I'm also trying to keep the execution down to less than 30 minutes.
My current attempt is a sed one-liner:
sed '/^start :[0-9]\+$/ { :a /^[modify|delete] .*$/ { N; ba }; s/modify [0-9]\+ /&Appended_DIR\//g; s/delete /&Appended_DIR\//g }' file_to_edit
Which is intended to find the "start" line, loop while the lines either start with a "modify" or a "delete," and then apply the sed replacements.
However, when I execute this command, no changes are made, and the output is the same as the original file.
Is there an issue with the command I have formed? Would this be easier/more efficient to do in perl? Any help would be greatly appreciated, and I will clarify where I can.
I think you would be better off with perl
Specifically because you can work 'per record' by setting $/ - if you're records are delimited by blank lines, setting it to \n\n.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
while (<>) {
#multi-lines of text one at a time here.
if (m/^start :\d+/) {
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
Each iteration of the loop will pick out a blank line delimited chunk, check if it starts with a pattern, and if it does, apply some transforms.
It'll take data from STDIN via a pipe, or myscript.pl somefile.
Output is to STDOUT and you can redirect that in the normal way.
Your limiting factor on processing files in this way are typically:
Data transfer from disk
pattern complexity
The more complex a pattern, and especially if it has variable matching going on, the more backtracking the regex engine has to do, which can get expensive. Your transforms are simple, so packaging them doesn't make very much difference, and your limiting factor will be likely disk IO.
(If you want to do an in place edit, you can with this approach)
If - as noted - you can't rely on a record separator, then what you can use instead is perls range operator (other answers already do this, I'm just expanding it out a bit:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
if ( /^start :/ .. /^$/)
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
We don't change $/ any more, and so it remains on it's default of 'each line'. What we add though is a range operator that tests "am I currently within these two regular expressions" that's toggled true when you hit a "start" and false when you hit a blank line (assuming that's where you would want to stop?).
It applies the pattern transformation if this condition is true, and it ... ignores and carries on printing if it is not.
sed's pattern ranges are your friend here:
sed -r '/^start :[0-9]+$/,/^$/ s/^(delete |modify [0-9]+ )/&prepended_dir\//' filename
The core of this trick is /^start :[0-9]+$/,/^$/, which is to be read as a condition under which the s command that follows it is executed. The condition is true if sed currently finds itself in a range of lines of which the first matches the opening pattern ^start:[0-9]+$ and the last matches the closing pattern ^$ (an empty line). -r is for extended regex syntax (-E for old BSD seds), which makes the regex more pleasant to write.
I would also suggest using perl. Although I would try to keep it in one-liner form:
perl -i -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit
Or you can use redirection of stdout:
perl -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit > new_file
with gnu sed (with BRE syntax):
sed '/^start :[0-9][0-9]*$/{:a;n;/./{s/^\(modify [0-9][0-9]* \|delete \)/\1NewDir\//;ba}}' file.txt
The approach here is not to store the whole block and to proceed to the replacements. Here, when the start of the block is found the next line is loaded in pattern space, if the line is not empty, replacements are performed and the next line is loaded, etc. until the end of the block.
Note: gnu sed has the alternation feature | available, it may not be the case for some other sed versions.
a way with awk:
awk '/^start :[0-9]+$/,/^$/{if ($1=="modify"){$3="newdirMod/"$3;} else if ($1=="delete"){$2="newdirDel/"$2};}{print}' file.txt
This is very simple in Perl, and probably much faster than the sed equivalent
This one-line program inserts Appended_DIR/ after any occurrence of modify 999 or delete at the start of a line. It uses the range operator to restrict those changes to blocks of text starting with start :999 and ending with a line containing no printable characters
perl -pe"s<^(?:modify\s+\d+|delete)\s+\K><Appended_DIR/> if /^start\s+:\d+$/ .. not /\S/" file_to_edit
Good grief. sed is for simple substitutions on individual lines, that is all. Once you start using constructs other than s, g, and p (with -n) you are using the wrong tool. Just use awk:
awk '
/^start :[0-9]+$/ { inBlock=1 }
inBlock { sub(/^(modify [0-9]+|delete) /,"&Appended_DIR/") }
/^$/ { inBlock=0 }
{ print }
' file
start :234
modify 123 Appended_DIR/directory1/directory2/file.txt
delete Appended_DIR/directory3/file2.txt
modify 899 Appended_DIR/directory4/file3.txt
There's various ways you can do the above in awk but I wrote it in the above style for clarity over brevity since I assume you aren't familiar with awk but should have no trouble following that since it reuses your own sed scripts regexps and replacement text.

Find all text within square brackets using regex

I have a problem that because of PHP version, I need to change my code from $array[stringindex] to $array['stringindex'];
So I want to find all the text using regex, and replace them all. How to find all strings that look like this? $array[stringindex].
Here's a solution in PHP:
$re = "/(\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\])/";
$str = "here is \$array[stringindex] but not \$array['stringindex'] nor \$3array[stringindex] nor \$array[4stringindex]";
$subst = "$1'$2'$3";
$result = preg_replace($re, $subst, $str);
You can try it out interactively here. I search for variables beginning with a letter, otherwise things like $foo[42] would be converted to $foo['42'], which might not be desirable.
Note that all the solutions here will not handle every case correctly.
Looking at the Sublime Text regex help, it would seem you could just paste (\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\]) into the Search box and $1'$2'$3 into the Replace field.
It depends of the tool you want to use to do the replacement.
with sed for exemple, it would be something like that:
sed "s/\(\$array\)\[\([^]]*\)\]/\1['\2']/g"
If sed is allowed you could simply do:
sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" file
Explanation:
sed "s/pattern/replace/g" is a sed command which searches for pattern and replaces it with replace. The g options means replace multiple times per line.
(\$[^[]*[)([^]]*)] this pattern consists of two groups (in between brackets). The first is a dollar followed by a series of non [ chars. Then an opening square bracket follows, followed by a series of non closing brackets which is then followed by a closing square bracket.
\1'\2'] the replacement string: \1 means insert the first captured group (analogous for \2. Basically we wrap \2 in quotes (which is what you wanted).
the -i options means that the changes should be applied to the original file, which is supplied at the end.
For more information, see man sed.
This can be combined with the find command, as follows:
find . -name '*.php' -exec sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" '{}' \;
This will apply the sed command to all php files found.

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"