defaulting unmatched backrefrences in perl or sed - regex

Is there a way to default back referenced variables $1, $2 and $3 here ?
start="a" hi="1" bye="2"
start="b" bye="3"
start="c" hi="4"
I am using this command to filter out:
perl-ne 'print if s/.*start="([^"]+).*?hi="([^"]+).*?bye="([^"]+).*/$1 $2 $3/g'
a 1 2
Is there a way to generate below result :
a 1 2
b null 3
c 4 null
I also searched for defaulting a back referenced variable but no working solution about it on that front. Eg, in bash we use ${var:-null} to default the var to a string null.

The special number variables ($1 etc) get introduced for capture groups even if their subpatterns fail to match, when the capture groups are optional (otherwise the whole match fails if any one subpattern fails). Those without a match stay undef.
For example, if a pattern has three optional capture groups, like (...)?, then after the regex (or after the matching part in a substitution operator) there will exist all $1,$2,$3 variables, some possibly being undef if their subpattern didn't match (that ? still made those formally match, by there being zero occurrences of that pattern).
Then test each $N and if undef replace it with a desired phrase ('null' here)
perl -wnE'/
(?: start \s*=\s* "([^"]+)"\s* )?
(?: hi \s*=\s* "([^"]+)"\s* )?
(?: bye \s*=\s* "([^"]+)"\s* )? /x;
say join " ", map { $_//"null" } ($1,$2,$3)
' file
(broken over lines and spaced-out for readability)   Since each term has the same structure the pattern can be prepared far more flexibly from a list of expected words.†
For the given sample file this prints
a 1 2
b null 3
c 4 null
† This is an overkill for a specific case and in a one-liner but is useful in a more rounded script which may be used with different keyword sets, since all hard-coded input is in the definition of the input array (#w)
perl -wnE'
BEGIN {
#w = qw(start hi bye); # keywords to form a pattern with
$re = join " ",
map { q{(?:} . $_ . q{\s*=\s*"([^"]+)"\s*)?} } #w;
};
#m = /$re/x;
say join " ", map { $_//"null" } #m
' file
This prints the same for the given input file. In bash shell it can simply be copy-pasted as it stands; in other shells you may need to make it back into one line, and remove comments. (Given as a command-line program, "one"-liner, for easy testing.)

Something like:
$ perl -nE 'my %vals=();
while (m/(\w+)="([^"]+)"/g) { $vals{$1} = $2 }
printf "%s %s %s\n", $vals{start}, $vals{hi}//"null", $vals{bye}//"null"
' input.txt
a 1 2
b null 3
c 4 null
Splits up the input into individual key/value pairs, saves them in a hash table, and then prints out the values using the // operator, which returns the left hand argument if it's defined, otherwise the right hand argument.
Variation if start, hi and bye are the only keys you can have and they always appear in that order:
$ perl -ne 'm/start="([^"]+)"(?:\s+hi="([^"]+)")?(?:\s+bye="([^"]+)")?/;
printf "%s %s %s\n", $1, $2//"null", $3//"null"' input.txt
a 1 2
b null 3
c 4 null
Uglier regular expression that makes the hi and bye parts optional matches.

This might work for you (GNU sed):
sed -E 's/^/start=null hi=null bye=null\n/ # insert a template
:a # loop name
s/(\S+=)\S+(.*\n.*)\1"(\S+)"/\3\2/ # replace lookup with value
ta # repeat till failure
s/\S+=//g # remove any template
P # print
d' file # delete debris
Insert a template and loop replacing matches with original values.
When no more matches, remove any unmatched template keys and debris from the original line.

Related

Perl regex - print only modified line (like sed -n 's///p')

I have a command that outputs text in the following format:
misc1=poiuyt
var1=qwerty
var2=asdfgh
var3=zxcvbn
misc2=lkjhgf
etc. I need to get the values for var1, var2, and var3 into variables in a perl script.
If I were writing a shell script, I'd do this:
OUTPUT=$(command | grep '^var-')
VAR1=$(echo "${OUTPUT}" | sed -ne 's/^var1=\(.*\)$/\1/p')
VAR2=$(echo "${OUTPUT}" | sed -ne 's/^var2=\(.*\)$/\1/p')
VAR3=$(echo "${OUTPUT}" | sed -ne 's/^var3=\(.*\)$/\1/p')
That populates OUTPUT with the basic content that I want (so I don't have to run the original command multiple times), and then I can pull out each value using sed VAR1 = 'qwerty', etc.
I've worked with perl in the past, but I'm pretty rusty. Here's the best I've been able to come up with:
my $output = `command | grep '^var'`;
(my $var1 = $output) =~ s/\bvar1=(.*)\b/$1/m;
print $var1
This correctly matches and references the value for var1, but it also returns the unmatched lines, so $var1 equals this:
qwerty
var2=asdfgh
var3=zxcvbn
With sed I'm able to tell it to print only the modified lines. Is there a way to do something similar with in perl? I can't find the equivalent of sed's p modifier in perl.
Conversely, is there a better way to extract those substrings from each line? I'm sure I could match match each line and split the contents or something like that, but was trying to stick with regex since that's how I'd typically solve this outside of perl.
Appreciate any guidance. I'm sure I'm missing something relatively simple.
One way
my #values = map { /\bvar(?:1|2|3)\s*=\s*(.*)/ ? $1 : () } qx(command);
The qx operator ("backticks") returns a list of all lines of output when used in list context, here imposed by map. (In a scalar context it returns all output in a string, possibly multiline.) Then map extracts wanted values: the ternary operator in it returns the capture, or an empty list when there is no match (so filtering out such lines). Please adjust the regex as suitable.
Or one can break this up, taking all output, then filtering needed lines, then parsing them. That allows for more nuanced, staged processing. And then there are libraries for managing external commands that make more involved work much nicer.
A comment on the Perl attempt shown in the question
Since the backticks is assigned to a scalar it is in scalar context and thus returns all output in a string, here multiline. Then the following regex, which replaces var1=(.*) with $1, leaves the next two lines since . does not match a newline so .* stops at the first newline character.
So you'd need to amend that regex to match all the rest so to replace it all with the capture $1. But then for other variables the pattern would have to be different. Or, could replace the input string with all three var-values, but then you'd have a string with those three values in it.
So altogether: using the substitution here (s///) isn't suitable -- just use matching, m//.
Since in list context the match operator also returns all matches another way is
my #values = qx(command) =~ /\bvar(?:1|2|3)\s*=\s*(.*)/g;
Now being bound to a regex, qx is in scalar context and so it returns a (here multiline) string, which is then matched by regex. With /g modifier the pattern keeps being matched through that string, capturing all wanted values (and nothing else). The fact that . doesn't match a newline so .* stops at the first newline character is now useful.
Again, please adjust the regex as suitable to yoru real problem.
Another need came up, to capture both the actual names of variables and their values. Then add capturing parens around names, and assign to a hash
my %val = map { /\b(var(?:1|2|3))\s*=\s*(.*)/ ? ($1, $2) : () } qx(command);
or
my %val = qx(command) =~ /\b(var(?:1|2|3))\s*=\s*(.*)/g;
Now the map for each line of output from command returns a pair of var-name + value, and a list of such pairs can be assigned to a hash. The same goes with subsequent matches (under /g) in the second case..
In scalar context, s/// and s///g return whether it found a match or not. So you can use
print $s if $s =~ s///;

Gawk - Regexp - unable to get results

I have a two column file named names.csv. Field 1 has names with alphabet characters in them. I am trying to find out names where a character repeats e.g. Viijay (and not Vijay)
The command below works and returns all the rows in Field 1
gawk "$1 ~ /[a-z]/ {print $0}" names.csv
To meet the requirement stated above (viz. repeating characters), I have actually used the command below, which does not return any rows
gawk "$1 ~ /[a-z]{1,}/ {print $0}" names.csv
What is the correction needed to get what I am looking for?
To further elaborate, if the values in Column 1/Field 1 are Vijay, Viijay and Vijayini, i want only Viijay to be returned. That is, only values where a character ("i" in the example here) is repeated (not "recurring" as in Vijayini where the character "i" is recurring in the string but not clustered together.)
Requested sample data is:
Vijay 1
Viijay 2
Vijayini 3
and the expected output:
Viijay 2
As awk regex doesn't support backreferences in matching, you need to find the duplicated characters some other way. This one duplicates every character in $1 and adds them to a variable which is then matched against the original string in, ie. Viijay -> re="(VV|ii|ii|jj|aa|yy)"; if($1~re)... (notice, that it does not test if the entry is already in re, you might want to consider adding some checking, more checking considerations in the comments):
$ awk '
{ # you should test for empty $1
re="(" # reset re
for(i=1;i<=length($1);i++) # for each char in $1
re=re (i==1?"":"|") (b=substr($1,i,1)) b # generate dublicated re entry
re=re ")" # terminating )
if($1~re) # match
print # and print if needed
}' file
Output:
Viijay 2
Ironically or exemplarily it fails on Busybox awk—in which the backreferences can be used Ɑ:
$ busybox awk '$1~"(.)\\1" {print $0}' file
Viijay,2
Since awk doesn't support backreferences in a regexp you're better off using grep or sed for this:
$ grep '^[^[:space:]]*\([a-z]\)\1' file
Viijay 2
$ sed -n '/^[^[:space:]]*\([a-z]\)\1/p' file
Viijay 2
That might be GNU-only, google to check.
With awk you'd have to do something like the following to first create a regexp that matches 2 repetitions of any character in your specific character set of a-z:
$ awk '{re=$1; gsub(/[^a-z]/,"",re); gsub(/./,"&{2}|",re); sub(/\|$/,"",re)} $1 ~ re' file
Viijay 2
FYI to create a regexp from $1 that would match 2 repetitions of any character it contains, not just a-z, would be:
re=$1; gsub(/[^\\^]/,"[&]{2}|",re); gsub(/[\\^]/,"\\\\&{2}|",re); sub(/\|$/,"",re);
You have to handle ^ differently from other characters as that's the only character that has a different meaning than literal when it's the first character in a bracket expression (i.e. negation) so you have to escape it with a backslash rather than putting it inside a bracket expression to make it literal. You have to handle \ different because [\] means the same as [] which is an unterminated bracket expression because [ is the start but ] is just the first character inside the bracket expression, it's not the ] needed to terminate it.

How to replace all dollar signs before all variables inside a double-quoted string with sed?

I have problems replacing the variables that are inside strings in bash. For example, I want to replace
"test$FOO1=$FOO2" $BAR
with:
"test" .. FOO1 .. "=" .. FOO2 .. "" $BAR
I tried:
sed 's/\$\([A-Z0-9_]\+\)\b/" .. \1 .. "/g'
But I don't want to replace variables the same way outside of double-quoted strings, e.g. like:
if [ $VARIABLE = 1 ]; then
Has to be replaced by just
if VARIABLE then
Is there a way to replace only inside of double-quotes?
Background:
I want to convert a bash script into Lua script.
I am aware, that it will not be easily possible to convert all possible shell scripts this way, but what I want to achieve is to replace all basic language constructs with Lua commands and replace all variables and conditionals. An automation here will save much work when translating bash into Lua by hand
This with GNU awk for multi-char RS, RT, and gensub() shows one way to separate and then manipulate quoted (in RT) and unquoted (in $0) strings as a starting point:
$ cat tst.awk
BEGIN { RS="\"[^\"]*\""; ORS="" }
{
$0 = gensub(/\[\s+[$]([[:alnum:]_]+)\s+=\s+\S+\s+];/,"\\1","g",$0)
RT = gensub(/[$]([[:alnum:]_]+)"/,"\" .. \\1","g",RT)
RT = gensub(/[$]([[:alnum:]_]+)/,"\" .. \\1 .. \"","g",RT)
print $0 RT
}
$ awk -f tst.awk file
"count: " .. FOO .. " times " .. BAR
if VARIABLE then
The above was run on this input file:
$ cat file
"count: $FOO times $BAR"
if [ $VARIABLE = 1 ]; then
NOTE: this approach of matching strings with regexps will always just be a best effort based on the samples provided, you'd need a shell language parser to do the job robustly.
bash lexer for shell!?
I'm so sorry: I just post this answer to warn you about a wrong way!
Reading language is a job for a consistant lexer not for sed nor any regex based tool!!!
See GNU Bison, Berkeley Yacc (byacc).
You could have a look at bash's sources in order to see how scripts are read!
Persisting in this way will bring you quickly to big script, then further to unsolvable problems.
using group and recursive
sed -e ':a' -e 's/^\(\([^"]*\("[^"]*"\)*\)*\)\("[^$"]*\)[$]\([A-Z0-9_]\{1,\}\)/\1\4 .. \5 .. /;t a'
isolate in string from previous part with
^\(\([^"]*\("[^"]*"\)*\)*\) in group 1
select the var content in the string isolated with s\("[^$"]*\)[$]\([A-Z0-9_]\{1,\}\)' in group 4 (prefix) and 5 (var name)
change like you want with \1\4 .. \5 ..
repeat this operation while a change is occuring :a and t a
with a gnu sed you can reduce the command to (no -e needed to target the label a):
sed ':a;s/^\(\([^"]*\("[^"]*"\)*\)*\)\("[^$"]*\)[$]\([A-Z0-9_]\{1,\}\)/\1\4 .. \5 .. /;t a'
Assuming there is no quote (escaped one) in string. If so a first pass is needed to change them and put them back after main modification.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"$]*"[^"]*)*"[^"$]*)\$([^" ]*) /\1" .. \3 .. " /;ta;s/^([^"]*("[^"$]*"[^"]*)*"[^"$]*)\$([^"]*)"/\1" .. \3/;ta' file
When changing things within double quotes, first we must sail passed any double quoted strings that do not need changing. This means anchoring the regexp to the start of the line using the ^ metacharacter and iterating the regexp until all cases cease to exist.
First, eliminate zero or more characters which are not double quotes from the start of the line.
Second, eliminate double quoted strings which do not contain the character of interest (TCOI) i.e. $, followed by zero or more characters which are not double quotes, zero or more times.
Third, eliminate double quotes followed by zero or more characters which are not double quotes or TCOI i.e. $.
The following character (if it exists) must be TCOI. Group the entire collection of strings before in a back reference \1.
Following TCOI, one or more conditions may be grouped. In the above example the first condition is when a variable (beginning with TCOI) is followed by a space. The second condition is when the variable is followed directly by ". Hence this entails two substitution commands, the ta command, branches to the loop identified a when the substitution was successful.
N.B. The if [ $VARIABLE = 1 ]; then situation can be treated in the same vien, here the [ is the opening double quote and the ] is the closing double quote.
P.S. TCOI was $ and this is also a metacharacter in regexp that represents the end of a line, it therefore must be quoted e.g.\$
P.P.S. Don't forget to quote the ['s and ]'s too. If quotings not your thing, then enclose the character in [x] where x is the character to be quoted.
EDIT:
sed -E ':a;s/^([^"]*("[^"$]*"[^"]*)*"[^"$]*)\$([[:alnum:]]*)/\1" .. \3 .. "/;ta' file
Since the original example has been replace by the OP here is a solution based on the new example.

Regular Expression: Capture character pattern zero or one positions from start of string

I have a series of entries, which can be represented by this string:
my_string="-D-K4_NNNN_M116_R1_001.gz _D-K4_NNNN_M56_R1_001.gz R-K4_NNNN_KQ9_R1_001.gz D-K4_NNNN_M987_R1_001.gz _R-K4_NNNN_M987_R1_001.gz"
For each entry, I need to return whether it starts with 'R' or 'D'. In order to do this, I need to ignore any character that comes before it. So, I wrote this regular expression:
for i in $my_string; do echo $i | grep -E -o "^*?[RD]"; done
However, this is only returning R or D for entries which are not preceded by a character.
How do I get this regex to return the R or D value in every case, whether there is a character in front of it or not? Keep in mind that the only thing which can be 'hard-coded' into the expression is the pattern to be matched.
It will be easy if you use sed:
sed -r 's/^.?([RD]).*$/\1/'
i.e.
for i in $my_string; do echo $i | sed -r 's/^.?([RD]).*$/\1/'; done
Update:
Here is what each part of the command means:
-r : extended regular expression, although I think -e should work but
turns out that during my testing, in order to use capturing group
in regex, I need -r. Anyway, not the main point
The script can be read as:
s/XXXX/YYYY/ : substitude from XXXX to YYYY
The "from" pattern (XXXX) means:
^ : start with
.? : zero or one occurence of any character
( : start of group
[RD] : either R or D
) : end of group (which means, the group will contains either R or D
.* : any number of any character
$ : till the end
the "to" pattern (YYYY):
\1 : content of capture group 1 in the "from" pattern (which is the "R or D")
Use a parameter expansion to remove the prefix before using grep:
for i in $my_string; do echo ${i#[^RD]} | grep -o "^[RD]" ; done
or use a simple test without grep (since you already know that each item starts with a R or a D):
for i in $my_string; do
if [[ $i =~ ^[^D]?R ]] ; then
echo 'R'
else
echo 'D'
fi
done
This regex worked in my local tests. Please have a try:
^.?[RD]
I can't think of a way to ONLY return the letter you want. I'd have a command after to detect whether the returned string is greater than 1 character long, and if so, I'd return only the second character.
I'm not 100% sure of what you are asking ( i understood you want to match only R and D at the beginning of a filename, whatever the character before it, if there is one ), but I think you should use lookbehind, in php you would do
$re = "/(?<=^\S|\s\S|\s)[RD]/";
$str = "-D-K4_NNNN_M116_R1_001.gz _D-K4_NNNN_M56_R1_001.gz R-K4_NNNN_KQ9_R1_001.gz D-K4_NNNN_M987_R1_001.gz _R-K4_NNNN_M987_R1_001.gz";
preg_match_all($re, $str, $matches);
You can see the output here.
To use Perl syntax in bash you must enable it. https://unix.stackexchange.com/questions/84477/forcing-bash-to-use-perl-regex-engine
You can test your regexp here if you need https://regex101.com/r/vV3nS3/1
This does it when using the modifier 'g' for global: (^| ).?(R|D)
See the regex101 here

Multiple replacement in sed

Is there a way to replace multiple captured groups and replace it with the value of the captured groups from a key-value format (delimited by =) in sed?
Sorry, that question is confusing so here is an example
What I have:
aaa="src is $src$ user is $user$!" src="over there" user="jason"
What I want in the end:
aaa="src is over there user is jason!"
I don't want to hardcode the position of the $var$ because they could change.
sed ':again
s/\$\([[:alnum:]]\{1,\}\)\$\(.*\) \1="\([^"]*\)"/\3\2/g
t again
' YourFile
As you see, sed is absolut not interesting doing this kind of task ... even with element on several line it can work with few modification and it doesn not need a quick a dirty 6 complex line of higher powerfull languages.
principe:
create a label (for a futur goto)
search an occurence of a patterne between $ $ and take his associate content and replace pattern with it and following string but without the pattern content definition
if it occur, try once again by restarting at the label reference of the script. If not print and treat next line
This is a quick & dirty way to solve it using perl. It could fail in some ways (spaces, escapes double quotes, ...), but it would get the job done for most simple cases:
perl -ne '
## Get each key and value.
#captures = m/(\S+)=("[^"]+")/g;
## Extract first two elements as in the original string.
$output = join q|=|, splice #captures, 0, 2;
## Use a hash for a better look-up, and remove double quotes
## from values.
%replacements = #captures;
%replacements = map { $_ => substr $replacements{$_}, 1, -1 } keys %replacements;
## Use a regex to look-up into the hash for the replacements strings.
$output =~ s/\$([^\$]+)\$/$replacements{$1}/g;
printf qq|%s\n|, $output;
' infile
It yields:
aaa="src is over there user is jason!"