perl regex for variable substitution - regex

I want to substitute variables marked by a "#" and terminated by a dot or a non-alphanumeric character.
Example: Variable #name should be substituted be "Peter"
abc#name.def => abcPeterdef
abc#namedef => abc#namedef
abc#name-def => abcPeter-def
So if the variable is terminated with a dot, it is replaced and the dot removed. Is it terminated by any non-alphanum character, it is replaced also.
I use the following:
s/#name\./Peter/i
s/#name(\W)/Peter$1/i
This works but is it possible to merge it into one expression?

There are several possible approaches.
s/#name(\W)/"Peter" . ($1 eq "." ? "" : $1)/e
Here we use /e to turn the replacement part into an expression, so we can inspect $1 and choose the replacement string dynamically.
s/#name(?|\.()|([^.\w]))/Peter$1/
Here we use (?| ) to reset the numbering of capture groups between branches, so both \.() and ([^.\w]) set $1. If a . is matched, $1 becomes the empty string; otherwise it contains the matched character.

You may use
s/#name(?|\.()|(\W))/Peter$1/i
Details
#name - matches the literal substring
(?|\.()|(\W)) - a branch reset group matching either of the two alternatives:
\.() - a dot and then captures an empty string into $1
| - or
(\W) - any non-word char captured into $1.
So, upon a match, $1 placeholder is either empty or contains any non-word char other than a dot.

You can do this by using either a literal dot or a word boundary for the terminator
Like this
s/#name(?:\.|\b)/Peter/i
Here's a complete program that reproduces the required output shown in your question
use strict;
use warnings 'all';
for my $s ( 'abc#name.def', 'abc#namedef', 'abc#name-def' ) {
( my $s2 = $s ) =~ s/#name(?:\.|\b)/Peter/i;
printf "%-12s => %-s\n", $s, $s2;
}
output
abc#name.def => abcPeterdef
abc#namedef => abc#namedef
abc#name-def => abcPeter-def

Related

How can I regexp capture the string between 2 specific sets of double underscores?

I want to regexp capture the string between 2 specific sets of double underscores. The string that get captured may itself have single underscore occurrences in it. Here's the test Perl script I've been working with:
#!/usr/bin/env perl
use strict;
my $str = "DFD_20220913_121409_strix1a0__z1_erx_adm__CL1695331__RTL_Dfdsg4__regression__df_umc_nbio_hubs_gfx__220913_150718";
(my $grp) = $str =~ /CL\d+\_\_(\w+)\_\_/;
print "grp = $grp\n";
exit;
This returns...
grp = RTL_Dfdsg4__regression__df_umc_nbio_hubs_gfx
I want...
grp = RTL_Dfdsg4
As you can see, I know something about where the first set of double underscores exists (after the CL\d+). But for some reason, the regexp reads past the next occurrence of the double underscores until it hits the last set.
You need to use the non-greedy quantifier, ?.
(my $grp) = $str =~ /CL\d+__(\w+?)__/;
I removed the unnecessary backslashes from before the underscores.
Note that using the non-greedy modifier is fragile and can easily work differently than intended. This is the robust alternative:
my ( $grp ) = $str =~ /
CL \d+
__
( [^\W_]+ (?: _ [^\W_]+ )* ) # `[^\W_]` is `\w` minus `_`
__
/x;

Perl - Problem with "]" in a regular expression

I have a string :
my $string = "name_of_my_function(arg1,arg2,[arg3,arg4])";
and I want to extract the name of the function "name_of_my_function" and the parameters :
$arg1 = "arg1"
$arg2 = "arg2"
#arg_list = ("arg3", "arg4")
the code I use to extract the function is :
$row =~ m/^([^\(]*)\(([^\)]*)\)/;
$function = $1;
However, it works when the string doesn't have any "]", for example :
my $string = "name_of_my_function(arg1,arg2,arg3)";
but it doesn't return anything when there is a "]"
Any idea?
Thanks,
SLP
The regex you show captures the function name, and all other arguments in a string, which is a very reasonble first step. Then parse the arguments out of that second string. I expand your $string so to have multiple bracketed lists of arguments, interleaved with non-bracketed ones
perl -wE'
$s = "name_of_my_function(arg1,arg2,[arg3,arg4],arg5,[arg6,arg7])";
#m = $s =~ /^([^\(]*)\(([^\)]*)\)/;
#p = grep { $_ } split /\s*,\s*|\[(.*?)\]/, $m[1];
for (#p) {
if (/,/) { push #arg_list, $_ }
else { push #args, $_ }
}
say $m[0];
say for #args;
say for #arg_list
'
This prints
name_of_my_function
arg1
arg2
arg5
arg3,arg4
arg6,arg7
The split is where individual arguments are extracted, as well as bracketed argument list(s), each as a string. That may return empty elements thus grep { $_ } to filter them out.
Then you can proceed to extract individual arguments from lists that were in brackets, by splitting each string in #arg_list by , again.
The main part of the above can, as the problem stands, go in one statement
#p = grep { $_ } split /\( | \) | \[(.*?)\] |,/x, $s;
where I added /x modifier so to be able to space it out for readability. This delivers to #p the function name, individual arguments, and a string with (comma separated) argument list from each [].
However, I think that it is far more sensible to break this up into several steps.
Well, if the number of arguments is variable, that is not that simple to do it with rgex only (arguments will be matched with + quantifier, so they won't be stored in capturing group, which would be easy to extract). Having in mind the above, you could use this pattern (\w+)\(((\w+|\[(\w+,?)+\]),?)+\)
Explanation:
(\w+) - match one or more word characters (name of a function) and store it in first capturing group,
(\w+|\[(\w+,?)+\]) - alternation: match \w+ (same as above) or \[(\w+,?)+\]: \[ - match [ literally, (\w+,?)+ - match on or more times \w+, pattern which is one or more word characters followed by one or zero commas (,?), \] - match ] literally,
((\w+|\[(\w+,?)+\]),?)+ - match whole above pattern, optionally followed by comma (,?) one or more times. This would match argument list.
\(, \) 0 match (, ) literally
Further processing - extract whats between brackets () in order to extract arguments list programatically - it would be easier that doing it with complex regular expression
Demo
UPDATE:
Try pattern: https://regex101.com/r/wBcJZ0/3
I omitted explanation, as it is very similair to previous pattern.
Updted demo

Perl Regular Expression extracting sub-string?

I have a String variable containing something like ABCD.asd.qwe.com:/dir1.
I want to extract the ABCD portion i.e. the portion from beginning till the first appearance of .. The problem is that there can be almost any characters (only alphanumeric) of any length before the .. So I created this regexp.
if($arg =~ /(.*?\.?)/)
{
my $temp_name = $1;
}
However it is giving me blank string. The logic is that :
.*? - any character non-greedily
\.? - till first or none appearance of .
What could be wrong?
You can instead use negative character class like this
^[^.]+
[^.] would match any character except .
[^.]+ would match 1 to many characters(except .)
^ depicts the start of string
OR
^.+?(?=\.|$)
(?=) is a lookahead which checks for a particular pattern after the current position..So for text abcdad with regex a(?=b) only a would match
$ depicts the end of line(if used with multiline option) or end of string(if used with singleline option)
\.? doesn't mean "till first or none appearance of .". It means "a . here or not".
If the first character of the string is .:
.*? matches 0 chars at position 0.
\.? matches 1 char at position 0.
$1 contains ..
If the first character of the string isn't .:
.*? matches 0 chars at position 0.
\.? matches 0 chars at position 0.
$1 is empty.
To match ABCD, the following would do:
/^(.*?)\./
However, I hate the non-greedy modifier. It's fragile, in the sense that it stops doing what you want if you use two in the same pattern. I'd use the following instead ("match non-periods"):
/^([^.]*)\./
or even just
/^([^.]*)/
use strict;
my $string = "ABCD.asd.qwe.com:/dir1";
$string =~ /([^.]+)/;
my $capture = $1;
print"$capture\n";
OR you can also use Split function like,
my $sub_string = ( split /\./, $string )[0];
print"$sub_string\n";
Note in general: For the explaination of Regex (understanding the complex Regex), take a look at YAPE::Regex::Explain module.
This should work:
if($arg =~ /(.*?)\..+/)
{
my $temp_name = $1;
}
That would match anything before the first . .
You could change the .+ to .* if your input may end after the first ..
You could change the first .*? to .+? if you are sure that there is always at least one character before the first ..

help with perl regex rules

I would need some help with a regex issue in perl. I need to match non_letter characters "nucleated" around letter characters string (of size one).
That is to say... I have a string like
CDF((E)TR)FT
and I want to match ALL the following:
C, D, F((, ((E), )T, R), )F, T.
I was trying with something like
/([^A-Za-z]*[A-Za-z]{1}[^A-Za-z]*)/
but I'm obtaining:
C, D, F((, E), T, R), F, T.
Is like if once a non-letter characters has been matched it can NOT be matched again in another matching.
How can I do this?
A little late on this. Somebody has probably proposed this already.
I would consume the capture in the assertion to the left (via backref) and not consume the capture in the assertion to the right. All the captures can be seen, but the last one is not consumed, so the next pass continues right after the last atomic letter was found.
Character class is simplified for clarity:
/(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 1
(\1[A-Z]) # capture grp 2, consume capture group 1, plus atomic letter
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 3
Do globally, in a while loop, combined groups $2$3 (in that order) are the answer.
Test:
$samp = 'CDF((E)TR)FT';
while ( $samp =~ /(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/g )
{
print "$2$3, ";
}
output:
C, D, F((, ((E), )T, R), )F, T,
The problem is that you are consuming your characters or non letter characters the first time you encounter them, therefore you can't match all that you want. A solution would be to use different regexes for different patterns and combine the results at the end so that you could have your desired result :
This will match all character starting with a non character followed by a single character but NOT followed by a non character
[^A-Z]+[A-Z](?![^A-Z])
This will match a character enclosed by non characters, containing overlapping results :
(?=([^A-Z]+[A-Z][^A-Z]+))
This will match a character followed by one or more non characters only if it is not preceded by a non character :
(?<![^A-Z])[A-Z][^A-Z]+
And this will match single characters which are not enclosed to non characters
(?<![^A-Z])[A-Z](?![^A-Z])
By combining the results you will have the correct desired result:
C,D,T, )T, )F, ((E), F((, R)
Also if you understand the small parts you could join this into one Regex :
#!/usr/local/bin/perl
use strict;
my $subject = "0C0CC(R)CC(L)C0";
while ($subject =~ m/(?=([^A-Z]+[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z](?![^A-Z])))|(?=([^A-Z]+[A-Z](?![^A-Z])))/g) {
# matched text = $1, $2, $3, $4
print $1, " " if defined $1;
print $2, " " if defined $2;
print $3, " " if defined $3;
print $4, " " if defined $4;
}
Output :
0C0 0C C( (R) )C C( (L) )C0
You're right, once a character has been consumed in a regex match, it can't be matched again. In regex flavors that fully support lookaround assertions, you could do it with the regex
(?<=(\P{L}*))\p{L}(?=(\P{L}*))
where the match result would be the letter, and $1 and $2 would contain the non-letters around it. Since they are only matched in the context of lookaround assertions, they are not consumed in the match and can therefore be matched multiple times. You then need to construct the match result as $1 + $& + $2. This approach would work in .NET, for example.
In most other flavors (including Perl) that have limited support for lookaround, you can take a mixed approach, which is necessary because lookbehind expressions don't allow for indefinite repetition:
\P{L}*\p{L}(?=(\P{L}*))
Now $& will contain the non-letter characters before the letter and the letter itself, and $1 contains any non-letter characters that follow the letter.
while ($subject =~ m/\P{L}*\p{L}(?=(\P{L}*))/g) {
# matched text = $& . $1
}
Or, you could do it the hard way and tokenize first, then process the tokens:
#!/usr/bin/perl
use warnings;
use strict;
my $str = 'CDF((E)TR)FT';
my #nucleated = nucleat($str);
print "$_\n" for #nucleated;
sub nucleat {
my($s) = #_;
my #parts; # return list stored here
my #tokens = grep length, split /([a-z])/i, $s;
# bracket the tokens with empty strings to avoid warnings
unshift #tokens, '';
push #tokens, '';
foreach my $i (0..$#tokens) {
next unless $tokens[$i] =~ /^[a-z]$/i; # one element per letter token
my $str = '';
if ($tokens[$i-1] !~ /^[a-z]$/i) { # punc before letter
$str .= $tokens[$i-1];
}
$str .= $tokens[$i]; # the letter
if ($tokens[$i+1] !~ /^[a-z]$/i) { # punc after letter
$str .= $tokens[$i+1];
}
push #parts, $str;
}
return #parts;
}

How can I extract substrings from a string in Perl?

Consider the following strings:
1) Scheme ID: abc-456-hu5t10 (High priority) *****
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *****
and so on in the above format - the parts in bold are changes across the strings.
==> Imagine I've many strings of format Shown above.
I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.
1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
2nd substring containing the word (in eg above it's "High priority")
3rd substring containing * (IF * is present at the end of the string ELSE leave it )
How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?
You could do something like this:
my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END
foreach (split(/\n/,$data)) {
$_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
my ($id,$word,$star) = ($1,$2,$3);
print "$id $word $star\n";
}
The key thing is the Regular expression:
Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?
Which breaks up as follows.
The fixed String "Scheme ID: ":
Scheme ID:
Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:
([a-z0-9-]+)
Followed by one or more whitespace characters:
\s+
Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:
\(([^)]+)\)
Followed by some spaces any maybe a *, captured as $3:
\s*(\*)?
You could use a regular expression such as the following:
/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/
So for example:
$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";
prints
abc-456-hu5t10
High priority
*
(\S*)\s*\((.*?)\)\s*(\*?)
(\S*) picks up anything which is NOT whitespace
\s* 0 or more whitespace characters
\( a literal open parenthesis
(.*?) anything, non-greedy so stops on first occurrence of...
\) a literal close parenthesis
\s* 0 or more whitespace characters
(\*?) 0 or 1 occurances of literal *
Well, a one liner here:
perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt
Expanded to a simple script to explain things a bit better:
#!/usr/bin/perl -ln
#-w : warnings
#-l : print newline after every print
#-n : apply script body to stdin or files listed at commandline, dont print $_
use strict; #always do this.
my $regex = qr{ # precompile regex
Scheme\ ID: # to match beginning of line.
\s+ # 1 or more whitespace
(.*?) # Non greedy match of all characters up to
\s+ # 1 or more whitespace
\( # parenthesis literal
(.*?) # non-greedy match to the next
\) # closing literal parenthesis
\s* # 0 or more whitespace (trailing * is optional)
(\*)? # 0 or 1 literal *s
}x; #x switch allows whitespace in regex to allow documentation.
#values trapped in $1 $2 $3, so do whatever you need to:
#Perl lets you use any characters as delimiters, i like pipes because
#they reduce the amount of escaping when using file paths
m|$regex| && print "$1 : $2 : $3";
#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }
Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.
Long time no Perl
while(<STDIN>) {
next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
print "|$1|$2|$3|\n";
}
This just requires a small change to my last answer:
my ($guid, $scheme, $star) = $line =~ m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( (.+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star
)? #if it exists
}x;
String 1:
$input =~ /'^\S+'/;
$s1 = $&;
String 2:
$input =~ /\(.*\)/;
$s2 = $&;
String 3:
$input =~ /\*?$/;
$s3 = $&;