Recursively replacement on a variable - regex

Given this associative array:
declare -A variables=(
[prefix]='/usr'
[exec_prefix]='#prefix#'
[libdir]='#exec_prefix#/lib'
)
I would like to replace any occurrence of the pattern #([^#/]+)# (e.g. #prefix#, with prefix being the capture) with the value that's associated to the capture (e.g. /usr for prefix) in all its values, such that the substitution be performed recursively until there are no more occurrences. The steps for each key in the array would be:
Retrieve the value associated to it and perform (2) on it.
Check if there is a match of the pattern in the given string.
If there isn't any, return the given string.
If there's a match:
Perform (1) on the capture and keep the result.
Replace the match by the result.
Perform (2) on the resulting string.
Drop the previous value associated to the key and associate to it the last string returned.
Whatever the approach, the desired result is:
prefix=/usr
exec_prefix=/usr
libdir=/usr/lib
Additional requirements:
Self references (e.g. prefix=#prefix#) will not occur.
If possible, use only Bash builtins.
Example in Lua:
local variables={
prefix="/usr",
exec_prefix="#prefix#",
includedir="#prefix#/include",
libdir="#exec_prefix#/lib",
random_one_to_show_off_fancy_recursion="#prefix##libdir##includedir#"
}
function replacer( variable )
return compute_value(variables[variable])
end
function compute_value( s )
return s:gsub('#([^#/]+)#',replacer)
end
local variable, value = next(variables)
while variable do
variables[variable] = compute_value(value)
print( string.format('%-39s\t%s', variable, variables[variable]) )
variable, value = next(variables,variable)
end

The (pure Bash) code below assumes that '##' is left unchanged and '#xyz#' is left unchanged when 'xyz' is not a variable. It also attempts to detect recursive variable definitions, including indirect ones (e.g. [a]=#b# [b]=#c# [c]=#a#).
# Regular expression for a string with an embedded expansion
# For a string of the form 'u#v#w', where 'u' and 'v' do not contain '#':
# u -> BASH_REMATCH[1]
# v -> BASH_REMATCH[2]
# w -> BASH_REMATCH[3]
readonly EXPANSION_RX='^([^#]*)#([^#]*)#(.*)$'
# First pass tries to expand all variables
vars_to_expand=( "${!variables[#]}" )
while (( ${#vars_to_expand[*]} > 0 )) ; do
old_vars_to_expand=( "${vars_to_expand[#]}" )
vars_to_expand=()
for var in "${old_vars_to_expand[#]}" ; do
val=${variables[$var]}
unexpanded=$val
newval=
while [[ $unexpanded =~ $EXPANSION_RX ]] ; do
newval+=${BASH_REMATCH[1]}
v=${BASH_REMATCH[2]}
unexpanded=${BASH_REMATCH[3]}
if [[ $v == "$var" ]] ; then
echo "ERROR - Expanding '#$var#' in '$var'" >&2
exit 1
elif [[ -z $v ]] ; then
# The empty string can not be a hash key (Duh!)
newval+=#$v#
else
newval+=${variables[$v]-#$v#}
fi
done
newval+=$unexpanded
if [[ $newval != "$val" ]] ; then
# An expansion has occurred.
# Update the variable value
variables[$var]=$newval
# Further expansions may be possible, so add the variable to the
# list of variables to be expanded again
vars_to_expand+=( "$var" )
fi
done
done

Related

Jenkinsfile/Groovy: how to use variables in regex pattern find-counts?

In the following declarative syntax pipeline:
pipeline {
agent any
stages {
stage( "1" ) {
steps {
script {
orig = "/path/to/file"
two_lev_down = (orig =~ /^(?:\/[^\/]*){2}(.*)/)[0][1]
echo "${two_lev_down}"
depth = 2
two_lev_down = (orig =~ /^(?:\/[^\/]*){depth}(.*)/)[0][1]
echo "${two_lev_down}"
}
}
}
}
}
...the regex is meant to match everything after the third instance of "/".
The first, i.e. (orig =~ /^(?:\/[^\/]*){2}(.*)/)[0][1] works.
But the second, (orig =~ /^(?:\/[^\/]*){depth}(.*)/)[0][1] does not. It generates this error:
java.util.regex.PatternSyntaxException: Illegal repetition near index 10
^(?:/[^/]*){depth}(.*)
I assume the problem is the use of the variable depth instead of a hardcoded integer, since that's the only difference between the working code and error-generating code.
How can I use a Groovy variable in a regex pattern find-count? Or what is the Groovy-language idiomatic way to write a regex that returns everything after the nth occurrence of a pattern?
You are missing the $ in front of your variable. It should be:
orig = "/path/to/file"
depth = 2
two_lev_down = (orig =~ /^(?:\/[^\/]*){$depth}(.*)/)[0][1]
assert '/file' == two_lev_down
Why?
In Groovy the String-interpolation (over GString) works for 3 String literals:
usual double quotes: "Hello $world, my name is ${name.toUpperCase()}"
Slashy-strings used usually as regexp-literals: /.{$depth}/
Multi-line double-quoted Strings:
def email = """
Dear ${user}.
Thank your for blablah.
"""

Comparing filenames and determine their incremental digits

Imagine i have a sequence of files, e.g.:
...
segment8_400_av.ts
segment9_400_av.ts
segment10_400_av.ts
segment11_400_av.ts
segment12_400_av.ts
...
When the filenames are known, i can match against the filenames with a regular expression like:
/segment(\d+)_400_av\.ts/
Because i know the incremental pattern.
But what would be a generic approach to this? I mean how can i take two file names out of the list, compare them and find out where in the file name the counting part is, taking into account any other digits that can occur in the filename (the 400 in this case)?
Goal: What i want to do is to run the script against various file sequences to check for example for missing files, so this should be the first step to find out the numbering scheme. File sequences can occur in many different fashions, e.g.:
test_1.jpg (simple counting suffix)
test_2.jpg
...
or
segment9_400_av.ts (counting part inbetween, with other static digits)
segment10_400_av.ts
...
or
01_trees_00008.dpx (padded with zeros)
01_trees_00009.dpx
01_trees_00010.dpx
Edit 2: Probably my problem can be described more simple: With a given set of files, i want to:
Find out, if they are a numbered sequence of files, with the rules below
Get the first file number, get the last file number and file count
Detect missing files (gaps in the sequence)
Rules:
As melpomene summarized in his answer, the file names only differ in one substring, which consists only of digits
The counting digits can occur anywhere in the filename
The digits can be padded with 0's (see example above)
I can do #2 and #3, what i am struggling with is #1 as a starting point.
You tagged this question regex, so here's a regex-based solution:
use strict;
use warnings;
my $name1 = 'segment12_400_av.ts';
my $name2 = 'segment10_400_av.ts';
if (
"$name1\0$name2" =~ m{
\A
( \D*+ (?: \d++ \D++ )* ) # prefix
( \d++ ) # numeric segment 1
( [^\0]* ) # suffix
\0 # separator
\1 # prefix
( \d++ ) # numeric segment 2
\3 # suffix
\z
}xa
) {
print <<_EOT_;
Result of comparing "$name1" and "$name2"
Common prefix: $1
Common suffix: $3
Varying numeric parts: $2 / $4
Position of varying numeric part: $-[2]
_EOT_
}
Output:
Result of comparing "segment12_400_av.ts" and "segment10_400_av.ts"
Common prefix: segment
Common suffix: _400_av.ts
Varying numeric parts: 12 / 10
Position of varying numeric part: 7
It assumes that
the strings are different (guard the condition with $name1 ne $name2 && ... if that's not guaranteed)
there's only one substring that's different between the input strings (otherwise it won't find any match)
the differing substring consists of digits only
all digits surrounding the first point of difference are part of the varying increment (e.g. the example above recognizes segment as the common prefix, not segment1)
The idea is to combine the two names into a single string (separated by NUL, which is unambiguous because filenames can't contain \0), then let the regex engine do the hard work of finding the longest common prefix (using greediness and backtracking).
Because we're in a regex, we can get a bit more fancy than just finding the longest common prefix: We can make sure that the prefix doesn't end with a digit (see the segment1 vs. segment case above) and we can verify that the suffix is also the same.
See if this works for you:
use strict;
use warnings;
sub compare {
my ( $f1, $f2 ) = #_;
my #f1 = split /(\d+)/sxm, $f1;
my #f2 = split /(\d+)/sxm, $f2;
my $i = 0;
my $out1 = q{};
my $out2 = q{};
foreach my $p (#f1) {
if ( $p eq $f2[$i] ) {
$out1 .= $p;
$out2 .= $p;
}
else {
$out1 .= sprintf ' ((%s)) ', $p;
$out2 .= sprintf ' ((%s)) ', $f2[$i];
}
$i++;
}
print $out1 . "\n";
print $out2 . "\n";
return;
}
print "Test1:\n";
compare( 'segment8_400_av.ts', 'segment9_400_av.ts' );
print "\n\nTest2:\n";
compare( 'segment999_8_400_av.ts', 'segment999_9_400_av.ts' );
You basically split strings by starting/ending digits, the loop through the items and compare each of the 'pieces'. If they are equal, you accumulate. If not, then you highlight the differences and accumulate.
Output (I'm using ((number)) for the highlight)
Test1:
segment ((8)) _400_av.ts
segment ((9)) _400_av.ts
Test2:
segment999_ ((8)) _400_av.ts
segment999_ ((9)) _400_av.ts
I assume that only the counter differs across the strings
use warnings;
use strict;
use feature 'say';
my ($fn1, $fn2) = ('segment8_400_av.ts', 'segment12_400_av.ts');
# Collect all numbers from all strings
my #nums = map { [ /([0-9]+)/g ] } ($fn1, $fn2);
my ($n, $pos); # which number in the string, at what position
# Find which differ
NUMS:
for my $j (1..$#nums) { # strings
for my $i (0..$#{$nums[0]}) { # numbers in a string
if ($nums[$j]->[$i] != $nums[0]->[$i]) { # it is i-th number
$n = $i;
$fn1 =~ /($nums[0]->[$i])/g; # to find position
$pos = $-[$i];
say "It is $i-th number in a string. Position: $pos";
last NUMS;
}
}
}
We loop over the array with arrayrefs of numbers found in each string, and over elements of each arrayref (eg [8, 400]). Each number in a string (0th or 1st or ...) is compared to its counterpart in the 0-th string (array element); all other numbers are the same.
The number of interest is the one that differs and we record which number in a string it is ($n-th).
Then its position in the string is found by matching it again and using #- regex variable with (the just established) index $n, so the offset of the start of the n-th match. This part may be unneeded; while question edits helped I am still unsure whether the position may or not be useful.
Prints, with position counting from 0
It is 0-th number in a string. Position: 7
Note that, once it is found that it is the $i-th number, we can't use index to find its position; an number earlier in strings may happen to be the same as the $i-th one, in this string.
To test, modify input strings by adding the same number to each, before the one of interest.
Per question update, to examine the sequence (for missing files for instance), with the above findings you can collect counters for all strings in an array with hashrefs (num => filename)
use Data::Dump qw(dd);
my #seq = map { { $num[$_]->[$n] => $fnames[$_] } } 0..$#fnames;
dd \#seq;
where #fnames contains filenames (like two picked for the example above, $fn1 and $fn2). This assumes that the file list was sorted to begin with, or add the sort if it wasn't
my #seq =
sort { (keys %$a)[0] <=> (keys %$b)[0] }
map { { $num[$_]->[$n] => $fnames[$_] } }
0..$#fnames;
The order is maintained by array.
Adding this to the above example (with two strings) adds to the print
[
{ 8 => "segment8_400_av.ts" },
{ 12 => "segment12_400_av.ts" },
]
With this all goals in "Edit 2" should be straighforward.
I suggest that you build a regex pattern by changing all digit sequences to (\d+) and then see which captured values have changed
For instance, with segment8_400_av.ts and
segment9_400_av.ts you would generate a pattern /segment(\d+)_(\d+)_av\.ts/. Note that s/\d+/(\d+)/g will return the number of numeric fields, which you will need for the subsequent check
The first would capture 8 and 400 which the second would capture 9 and 400. 8 is different from 9, so it is in that region of the string where the number varies
I can't really write much code as you don't say what sort of result you want from this process

Keep track of matches and check against condition

I have $entire_line = "if varC > 0: varB = varC + 2"
I would like my regex to find the following: varC, varB, varB in the $entire_line
These matches then need to be checked to see whether they exist in a HashMap. If so, a $ should be appended to the match.
Hence the output should be:
"if $varC > 0: $varB = $varC + 2"
NOTE: 0 and 2 don't appear in the HashMap.
Currently, I have:
$entire_line =~ s/(\w+)/\$$1/g if (exists($variable_hash{$1}));
However, this does not work as intended as the $1 in exists($variable_hash{$1}) does not refer to the previous regex: $entire_line =~ s/(\w+)/\$$1/g
Is there a proper way to go about this?
Thanks for your help.
Use the /e modifier and put the code into the replacement part:
$entire_line =~ s/(\w+)/exists $variable_hash{$1} ? $variable_hash{$1} : $1/ge;
If I got your question correctly and you don't need to perform variable value substitution (as in #choroba's answer), but only append $ character to known variables, and if the %variables_hash is not very long, how about concatenating all the keys of %variables_hash with a | character to get a regex matching all known variables?
my %variable_hash = (
varA => 1,
# varB => 1, # commented out to check that it will not be replaced
varC => 1,
);
my $entire_line = "if varC > 0: varB = varC + 2;";
my $key_regex = join('|', map { quotemeta $_; } keys %variable_hash);
# $key_regex will contain "varA|varC"
$entire_line =~ s/\b($key_regex)\b/\$$1/g;
# prefix all matching substrings with $ character
print "$entire_line\n";
Also check my comment to #choroba's answer.

In Perl, how many groups are in the matched regex?

I would like to tell the difference between a number 1 and string '1'.
The reason that I want to do this is because I want to determine the number of capturing parentheses in a regular expression after a successful match. According the perlop doc, a list (1) is returned when there are no capturing groups in the pattern. So if I get a successful match and a list (1) then I cannot tell if the pattern has no parens or it has one paren and it matched a '1'. I can resolve that ambiguity if there is a difference between number 1 and string '1'.
You can tell how many capturing groups are in the last successful match by using the special #+ array. $#+ is the number of capturing groups. If that's 0, then there were no capturing parentheses.
For example, bitwise operators behave differently for strings and integers:
~1 = 18446744073709551614
~'1' = Î ('1' = 0x31, ~'1' = ~0x31 = 0xce = 'Î')
#!/usr/bin/perl
($b) = ('1' =~ /(1)/);
print isstring($b) ? "string\n" : "int\n";
($b) = ('1' =~ /1/);
print isstring($b) ? "string\n" : "int\n";
sub isstring() {
return ($_[0] & ~$_[0]);
}
isstring returns either 0 (as a result of numeric bitwise op) which is false, or "\0" (as a result of bitwise string ops, set perldoc perlop) which is true as it is a non-empty string.
If you want to know the number of capture groups a regex matched, just count them. Don't look at the values they return, which appears to be your problem:
You can get the count by looking at the result of the list assignment, which returns the number of items on the right hand side of the list assignment:
my $count = my #array = $string =~ m/.../g;
If you don't need to keep the capture buffers, assign to an empty list:
my $count = () = $string =~ m/.../g;
Or do it in two steps:
my #array = $string =~ m/.../g;
my $count = #array;
You can also use the #+ or #- variables, using some of the tricks I show in the first pages of Mastering Perl. These arrays have the starting and ending positions of each of the capture buffers. The values in index 0 apply to the entire pattern, the values in index 1 are for $1, and so on. The last index, then, is the total number of capture buffers. See perlvar.
Perl converts between strings and numbers automatically as needed. Internally, it tracks the values separately. You can use Devel::Peek to see this in action:
use Devel::Peek;
$x = 1;
$y = '1';
Dump($x);
Dump($y);
The output is:
SV = IV(0x3073f40) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 1
SV = PV(0x30698cc) at 0x3073484
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x3079bb4 "1"\0
CUR = 1
LEN = 4
Note that the dump of $x has a value for the IV slot, while the dump of $y doesn't but does have a value in the PV slot. Also note that simply using the values in a different context can trigger stringification or nummification and populate the other slots. e.g. if you did $x . '' or $y + 0 before peeking at the value, you'd get this:
SV = PVIV(0x2b30b74) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 1
PV = 0x3079c5c "1"\0
CUR = 1
LEN = 4
At which point 1 and '1' are no longer distinguishable at all.
Check for the definedness of $1 after a successful match. The logic goes like this:
If the list is empty then the pattern match failed
Else if $1 is defined then the list contains all the catpured substrings
Else the match was successful, but there were no captures
Your question doesn't make a lot of sense, but it appears you want to know the difference between:
$a = "foo";
#f = $a =~ /foo/;
and
$a = "foo1";
#f = $a =~ /foo(1)?/;
Since they both return the same thing regardless if a capture was made.
The answer is: Don't try and use the returned array. Check to see if $1 is not equal to ""

How to get multiple occurrence in a regular expression when a pattern contains the same pattern within itself?

I would like to get first occurrence in a regular expression, but not embedded ones.
For example, Regular Expression is:
\bTest\b\s*\(\s*\".*\"\s*,\s*\".*\"\s*\)
Sample text is
x == Test("123" , "ABC") || x == Test ("123" , "DEF")
Result:
Test("123" , "ABC") || x == Test ("123" , "DEF")
Using any regular expression tool (Expresso, for example), I am getting the whole text as the result, as it satisfies the regular expression. Is there a way to get the result in two parts as shown below.
Test("123" , "ABC")
and
Test ("123" , "DEF")
Are you trying to parse code with regex? This is always going to be a fairly brittle solution, and you should consider using an actual parser.
That said, to solve your immediate problem, you want to use non-greedy matching - the *? quantifier instead of just the *.
Like so:
\bTest\b\s*\(\s*\".*?\"\s*,\s*\".*?\"\s*\)
A poor mans C function parser, in Perl.
## ===============================================
## C_FunctionParser_v3.pl # 3/21/09
## -------------------------------
## C/C++ Style Function Parser
## Idea - To parse out C/C++ style functions
## that have parenthetical closures (some don't).
## - sln
## ===============================================
my $VERSION = 3.0;
$|=1;
use strict;
use warnings;
# Prototype's
sub Find_Function(\$\#);
# File-scoped variables
my ($FxParse, $FName, $Preamble);
# Set function name, () gets all functions
SetFunctionName('Test'); # Test case, function 'Test'
## --------
# Source file
my $Source = join '', <DATA>;
# Extended, possibly non-compliant,
# function name - pattern examples:
# (no capture groups in function names strings or regex!)
# - - -
# SetFunctionName( qr/_T/ );
# SetFunctionName( qr/\(\s*void\s*\)\s*function/ );
# SetFunctionName( "\\(\\s*void\\s*\\)\\s*function" );
# Parse some functions
my #Funct = ();
Find_Function( $Source, #Funct );
# Print functions found
# (segments can be modified and/or collated)
if ( !#Funct ) {
print "Function name pattern: '$FName' not found!\n";
} else {
print "\nFound ".#Funct." matches.\nFunction pattern: '$FName' \n";
}
for my $ref (#Funct) {
# Format; #: Line number - function
printf "\n\#: %6d - %s\n", $$ref[3], substr($Source, $$ref[0], $$ref[2] - $$ref[0]);
}
exit;
## End
# ---------
# Set the parser's function regex pattern
#
sub SetFunctionName
{
if (!#_) {
$FName = "_*[a-zA-Z][\\w]*"; # Matches all compliant function names (default)
} else {
$FName = shift; # No capture groups in function names please
}
$Preamble = "\\s*\\(";
# Compile function parser regular expression
# Regex condensed:
# $FxParse = qr!//(?:[^\\]|\\\n?)*?\n|/\*.*?\*/|\\.|'["()]'|(")|($FName$Preamble)|(\()|(\))!s;
# | | | |1 1|2 2|3 3|4 4
# Note - Non-Captured, matching items, are meant to consume!
# -----------------------------------------------------------
# Regex /xpanded (with commentary):
$FxParse = # Regex Precedence (items MUST be in this order):
qr! # -----------------------------------------------
// # comment - //
(?: # grouping
[^\\] # any non-continuation character ^\
| # or
\\\n? # any continuation character followed by 0-1 newline \n
)*? # to be done 0-many times, stopping at the first end of comment
\n # end of comment - //
| /\*.*?\*/ # or, comment - /* + anything + */
| \\. # or, escaped char - backslash + ANY character
| '["()]' # or, single quote char - quote then one of ", (, or ), then quote
| (") # or, capture $1 - double quote as a flag
| ($FName$Preamble) # or, capture $2 - $FName + $Preamble
| (\() # or, capture $3 - ( as a flag
| (\)) # or, capture $4 - ) as a flag
!xs;
}
# Procedure that finds C/C++ style functions
# (the engine)
# Notes:
# - This is not a syntax checker !!!
# - Nested functions index and closure are cached. The search is single pass.
# - Parenthetical closures are determined via cached counter.
# - This precedence avoids all ambigous paranthetical open/close conditions:
# 1. Dual comment styles.
# 2. Escapes.
# 3. Single quoted characters.
# 4. Double quotes, fip-flopped to determine closure.
# - Improper closures are reported, with the last one reliably being the likely culprit
# (this would be a syntax error, ie: the code won't complie, but it is reported as a closure error).
#
sub Find_Function(\$\#)
{
my ($src, $Funct) = #_;
my #Ndx = ();
my #Closure = ();
my ($Lines, $offset, $closure, $dquotes) = (1,0,0,0);
while ($$src =~ /$FxParse/xg)
{
if (defined $1) # double quote "
{
$dquotes = !$dquotes;
}
next if ($dquotes);
if (defined $2) # 'function name'
{
# ------------------------------------
# Placeholder for exclusions......
# ------------------------------------
# Cache the current function index and current closure
push #Ndx, scalar(#$Funct);
push #Closure, $closure;
my ($funcpos, $parampos) = ( $-[0], pos($$src) );
# Get newlines since last function
$Lines += substr ($$src, $offset, $funcpos - $offset) =~ tr/\n//;
# print $Lines,"\n";
# Save positions: function( parms )
push #$Funct , [$funcpos, $parampos, 0, $Lines];
# Asign new offset
$offset = $funcpos;
# Closure is now 1 because of preamble '('
$closure = 1;
}
elsif (defined $3) # '('
{
++$closure;
}
elsif (defined $4) # ')'
{
--$closure;
if ($closure <= 0)
{
$closure = 0;
if (#Ndx)
{
# Pop index and closure, store position
$$Funct[pop #Ndx][2] = pos($$src);
$closure = pop #Closure;
}
}
}
}
# To test an error, either take off the closure of a function in its source,
# or force it this way (pseudo error, make sure you have data in #$Funct):
# push #Ndx, 1;
# Its an error if index stack has elements.
# The last one reported is the likely culprit.
if (#Ndx)
{
## BAD, RETURN ...
## All elements in stack have to be fixed up
while ( #Ndx ) {
my $func_index = shift #Ndx;
my $ref = $$Funct[$func_index];
$$ref[2] = $$ref[1];
print STDERR "** Bad return, index = $func_index\n";
print "** Error! Unclosed function [$func_index], line ".
$$ref[3].": '".substr ($$src, $$ref[0], $$ref[2] - $$ref[0] )."'\n";
}
return 0;
}
return 1
}
__DATA__
x == Test("123" , "ABC") || x == Test ("123" , "DEF")
Test("123" , Test ("123" , "GHI"))?
Test("123" , "ABC(JKL)") || x == Test ("123" , "MNO")
Output (line # - function):
Found 6 matches.
Function pattern: 'Test'
#: 1 - Test("123" , "ABC")
#: 1 - Test ("123" , "DEF")
#: 2 - Test("123" , Test ("123" , "GHI"))
#: 2 - Test ("123" , "GHI")
#: 3 - Test("123" , "ABC(JKL)")
#: 3 - Test ("123" , "MNO")