Perl - Problem with "]" in a regular expression - regex

I have a string :
my $string = "name_of_my_function(arg1,arg2,[arg3,arg4])";
and I want to extract the name of the function "name_of_my_function" and the parameters :
$arg1 = "arg1"
$arg2 = "arg2"
#arg_list = ("arg3", "arg4")
the code I use to extract the function is :
$row =~ m/^([^\(]*)\(([^\)]*)\)/;
$function = $1;
However, it works when the string doesn't have any "]", for example :
my $string = "name_of_my_function(arg1,arg2,arg3)";
but it doesn't return anything when there is a "]"
Any idea?
Thanks,
SLP

The regex you show captures the function name, and all other arguments in a string, which is a very reasonble first step. Then parse the arguments out of that second string. I expand your $string so to have multiple bracketed lists of arguments, interleaved with non-bracketed ones
perl -wE'
$s = "name_of_my_function(arg1,arg2,[arg3,arg4],arg5,[arg6,arg7])";
#m = $s =~ /^([^\(]*)\(([^\)]*)\)/;
#p = grep { $_ } split /\s*,\s*|\[(.*?)\]/, $m[1];
for (#p) {
if (/,/) { push #arg_list, $_ }
else { push #args, $_ }
}
say $m[0];
say for #args;
say for #arg_list
'
This prints
name_of_my_function
arg1
arg2
arg5
arg3,arg4
arg6,arg7
The split is where individual arguments are extracted, as well as bracketed argument list(s), each as a string. That may return empty elements thus grep { $_ } to filter them out.
Then you can proceed to extract individual arguments from lists that were in brackets, by splitting each string in #arg_list by , again.
The main part of the above can, as the problem stands, go in one statement
#p = grep { $_ } split /\( | \) | \[(.*?)\] |,/x, $s;
where I added /x modifier so to be able to space it out for readability. This delivers to #p the function name, individual arguments, and a string with (comma separated) argument list from each [].
However, I think that it is far more sensible to break this up into several steps.

Well, if the number of arguments is variable, that is not that simple to do it with rgex only (arguments will be matched with + quantifier, so they won't be stored in capturing group, which would be easy to extract). Having in mind the above, you could use this pattern (\w+)\(((\w+|\[(\w+,?)+\]),?)+\)
Explanation:
(\w+) - match one or more word characters (name of a function) and store it in first capturing group,
(\w+|\[(\w+,?)+\]) - alternation: match \w+ (same as above) or \[(\w+,?)+\]: \[ - match [ literally, (\w+,?)+ - match on or more times \w+, pattern which is one or more word characters followed by one or zero commas (,?), \] - match ] literally,
((\w+|\[(\w+,?)+\]),?)+ - match whole above pattern, optionally followed by comma (,?) one or more times. This would match argument list.
\(, \) 0 match (, ) literally
Further processing - extract whats between brackets () in order to extract arguments list programatically - it would be easier that doing it with complex regular expression
Demo
UPDATE:
Try pattern: https://regex101.com/r/wBcJZ0/3
I omitted explanation, as it is very similair to previous pattern.
Updted demo

Related

Perl Regular expression | how to exclude words from a file

i searching to find some Perl Regular Expression Syntax about some requirements i have in a project.
First i want to exclude strings from a txt file (dictionary).
For example if my file have this strings:
path.../Document.txt |
tree
car
ship
i using Regular Expression
a1testtre -- match
orangesh1 -- match
apleship3 -- not match [contains word from file ]
Also i have one more requirement that i couldnt solve. I have to create a Regex that not allow a String to have over 3 times a char repeat (two chars).
For example :
adminnisstrator21 -- match (have 2 times a repetition of chars)
kkeeykloakk -- not match have over 3 times repetition
stack22ooverflow -- match (have 2 times a repetition of chars)
for this i have try
\b(?:([a-z])(?!\1))+\b
but it works only for the first char-reppeat
Any idea how to solve these two?
To not match a word from a file you might check whether a string contains a substring or use a negative lookahead and an alternation:
^(?!.*(?:tree|car|ship)).*$
^ Assert start of string
(?! negative lookahead, assert what is on the right is not
.*(?:tree|car|ship) Match 0+ times any char except a newline and match either tree car or ship
) Close negative lookahead
.* Match any char except a newline
$ Assert end of string
Regex demo
To not allow a string to have over 3 times a char repeat you could use:
\b(?!(?:\w*(\w)\1){3})\w+\b
\b Word boundary
(?! Negative lookahead, assert what is on the right is not
(?: NOn capturing group
\w*(\w)\1 Match 0+ times a word character followed by capturing a word char in a group followed by a backreference using \1 to that group
){3} Close non capturing group and repeat 3 times
) close negative lookahead
\w+ Match 1+ word characters
\b word boundary
Regex demo
Update
According to this posted answer (which you might add to the question instead) you have 2 patterns that you want to combine but it does not work:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
In those 2 patterns you use 2 capturing groups, so the second pattern has to point to the second capturing group \2.
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\2){4}))*$)
^
Pattern demo
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
use warnings;
use strict;
use feature qw(say);
use Path::Tiny;
my $file = shift // die "Usage: $0 file\n"; #/
my #words = split ' ', path($file)->slurp;
my $exclude = join '|', map { quotemeta } #words;
foreach my $string (qw(a1testtre orangesh1 apleship3))
{
if ($string !~ /$exclude/) {
say "OK: $string";
}
}
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)
This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.†
The check that successive duplicate characters do not occur more than three times
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
my #chars_that_repeat = $string =~ /(.)\1+/g;
if (#chars_that_repeat < 3) {
say "OK: $string";
}
}
A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.
This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.
†  Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } #words;
#==> so|sole|solely
for a quicker match (so matches all three). This, by all means, appears to be the case here.
But, if you wanted to correctly identify which word matched then you must have longer words first,
solely|sole|so
so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round,
sort { length $b <=> length $a }
I hope someone else will come with a better solution, but this seems to do what you want:
\b Match word boundary
(?: Start capture group
(?:([a-z0-9])(?!\1))* Match all characters until it encounters a double
(?:([a-z0-9])\2)+ Match all repeated characters until a different one is reached
){0,2} Match capture group 0 or 2 times
(?:([a-z0-9])(?!\3))+ Match all characters until it encounters a double
\b Match end of word
I changed the [a-z] to also match numbers, since the examples you gave seem to also include numbers. Perl regex also has the \w shorthand, which is equivalent to [A-Za-z0-9_], which could be handy if you want to match any character in a word.
My problem is that i have 2 regex that working:
Not allow over 3 pairs of chars:
(?=^(?!(?:\w*(.)\1){3}).+$)
Not allow over 4 times a char to repeat:
(?=^(?:(.)(?!(?:.*?\1){4}))*$)
Now i want to combine them into one row like:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
but its working only the regex that is first and not both of them
As mentioned in comment to #zdim's answer, take it a bit further by making sure that the order in which your words are assembled into the match pattern doesn't trip you. If the words in the file are not very carefully ordered to start, I use a subroutine like this when building the match string:
# Returns a list of alternative match patterns in tight matching order.
# E.g., TRUSTEES before TRUSTEE before TRUST
# TRUSTEES|TRUSTEE|TRUST
sub tight_match_order {
return #_ unless #_ > 1;
my (#alts, #ordered_alts, %alts_seen);
#alts = map { $alts_seen{$_}++ ? () : $_ } #_;
TEST: {
my $alt = shift #alts;
if (grep m#$alt#, #alts) {
push #alts => $alt;
} else {
push #ordered_alts => $alt;
}
redo TEST if #alts;
}
#ordered_alts
}
So following #zdim's answer:
...
my #words = split ' ', path($file)->slurp;
#words = tight_match_order(#words); # add this line
my $exclude = join '|', map { quotemeta } #words;
...
HTH

Empty $1 and $2 values Regex Perl

I have the following code:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ m/^(\w)*\:(\w*)\_em\.$/)
{
print "$1\:$2\.\n";
}
else
{
print "$1\:$2\_em\.\n";
}
but I am getting empty $1 and $2. The output is:
Use of uninitialized value in concatenation (.) or string at new_mscn_iden_parse.pl line 187.
Use of uninitialized value in concatenation (.) or string at new_mscn_iden_parse.pl line 187.
:_em.
This code will do what you want
my $sDatabase = "abc_def:xyz_comp.";
$sDatabase =~ m/^(\w+):(\w+?)(_em)?\.$/ or die "Invalid data";
if ($3) {
print "$1:$2.\n";
}
else {
print "$1:$2_em.\n";
}
What do you expect $1 and $2 to contain when you fail to match?!
It contains whatever it contains before you attempted the match.
Possible solution:
$sDatabase =~ s/(?<!_em)(?=\.\z)/_em/;
You have:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ m/^(\w)*\:(\w*)\_em\.$/);
Let's see if this matches:
You're regular expression says:
Anchor at the start of a line.
You are looking for zero or more word characters . Word characters (in the ASCII alphabet) includes lowercase letters, uppercase letters numbers and underscores.
Thus /\w*/ will match all the following:
Computer
computer
computer23
computer_32
an empty string
You're next looking for a colon
Then, more word characters
Followed by a _em string
Followed by a period
And that should be the end of the string (if there's no NL and you're not doing multi-line string searches. Looks like you're safe there).
Now, let's look at your string: abc_def:xyz_comp.
\w* will match up to abc_def. Regular expressions are greedy and will try to match the biggest portion of the string as possible.
The : will match the colon. So far, you're matching abc_def:.
That \w* will match on xyz_comp.
Now, you're trying to match a _em. Oops! No good. There is no _em in your string. Your regular expression match will fail.
Since your regular expression match fails, the $1 and $2 variables simply are not set and have no value.
That's why you're getting Use of uninitialized value. What you can do is make the later half of your expression optional:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ /^(\w)+:(\w*)(_em)?\.$/) {
if ( $3 ) {
print "$1:${2}${3}.\n";
else {
print "$1:${2}_em.";
}
}
else {
die qq(String doesn't match regular expression at all\n);
}
}
First of all, I think you want to match at least one character (I could be wrong), so I switched the asterisk which matches zero or more to a + which matches one or more.
Note I have a third set of parentheses followed by a ?. This means match this zero or one times. Thus, you will have a match, and $1 and $2 will be set as long as your string starts with one or more word characters, followed by a colon, followed by one or more word characters.
What won't necessarily happen is that $3 will be set. This will only be set if your string also ends with _em.. If your string doesn't include the _em, but ends with a period, $1 and $2 will still match.
In your case, we could simplify it by doing this:
my $sDatabase = "abc_def:xyz_comp.";
if ($sDatabase =~ /^(\w)+:(\w*)(?:_em)?\.$/) {
print "$1:${2}_em.";
else {
die qq(String doesn't match regular expression at all\n);
}
The (?:...) means don't set a match, just group. Thus, $3 will never be set. That's okay, either $3 is _em. or we add _em. to the end of the match anyway.

help with perl regex rules

I would need some help with a regex issue in perl. I need to match non_letter characters "nucleated" around letter characters string (of size one).
That is to say... I have a string like
CDF((E)TR)FT
and I want to match ALL the following:
C, D, F((, ((E), )T, R), )F, T.
I was trying with something like
/([^A-Za-z]*[A-Za-z]{1}[^A-Za-z]*)/
but I'm obtaining:
C, D, F((, E), T, R), F, T.
Is like if once a non-letter characters has been matched it can NOT be matched again in another matching.
How can I do this?
A little late on this. Somebody has probably proposed this already.
I would consume the capture in the assertion to the left (via backref) and not consume the capture in the assertion to the right. All the captures can be seen, but the last one is not consumed, so the next pass continues right after the last atomic letter was found.
Character class is simplified for clarity:
/(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 1
(\1[A-Z]) # capture grp 2, consume capture group 1, plus atomic letter
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 3
Do globally, in a while loop, combined groups $2$3 (in that order) are the answer.
Test:
$samp = 'CDF((E)TR)FT';
while ( $samp =~ /(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/g )
{
print "$2$3, ";
}
output:
C, D, F((, ((E), )T, R), )F, T,
The problem is that you are consuming your characters or non letter characters the first time you encounter them, therefore you can't match all that you want. A solution would be to use different regexes for different patterns and combine the results at the end so that you could have your desired result :
This will match all character starting with a non character followed by a single character but NOT followed by a non character
[^A-Z]+[A-Z](?![^A-Z])
This will match a character enclosed by non characters, containing overlapping results :
(?=([^A-Z]+[A-Z][^A-Z]+))
This will match a character followed by one or more non characters only if it is not preceded by a non character :
(?<![^A-Z])[A-Z][^A-Z]+
And this will match single characters which are not enclosed to non characters
(?<![^A-Z])[A-Z](?![^A-Z])
By combining the results you will have the correct desired result:
C,D,T, )T, )F, ((E), F((, R)
Also if you understand the small parts you could join this into one Regex :
#!/usr/local/bin/perl
use strict;
my $subject = "0C0CC(R)CC(L)C0";
while ($subject =~ m/(?=([^A-Z]+[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z](?![^A-Z])))|(?=([^A-Z]+[A-Z](?![^A-Z])))/g) {
# matched text = $1, $2, $3, $4
print $1, " " if defined $1;
print $2, " " if defined $2;
print $3, " " if defined $3;
print $4, " " if defined $4;
}
Output :
0C0 0C C( (R) )C C( (L) )C0
You're right, once a character has been consumed in a regex match, it can't be matched again. In regex flavors that fully support lookaround assertions, you could do it with the regex
(?<=(\P{L}*))\p{L}(?=(\P{L}*))
where the match result would be the letter, and $1 and $2 would contain the non-letters around it. Since they are only matched in the context of lookaround assertions, they are not consumed in the match and can therefore be matched multiple times. You then need to construct the match result as $1 + $& + $2. This approach would work in .NET, for example.
In most other flavors (including Perl) that have limited support for lookaround, you can take a mixed approach, which is necessary because lookbehind expressions don't allow for indefinite repetition:
\P{L}*\p{L}(?=(\P{L}*))
Now $& will contain the non-letter characters before the letter and the letter itself, and $1 contains any non-letter characters that follow the letter.
while ($subject =~ m/\P{L}*\p{L}(?=(\P{L}*))/g) {
# matched text = $& . $1
}
Or, you could do it the hard way and tokenize first, then process the tokens:
#!/usr/bin/perl
use warnings;
use strict;
my $str = 'CDF((E)TR)FT';
my #nucleated = nucleat($str);
print "$_\n" for #nucleated;
sub nucleat {
my($s) = #_;
my #parts; # return list stored here
my #tokens = grep length, split /([a-z])/i, $s;
# bracket the tokens with empty strings to avoid warnings
unshift #tokens, '';
push #tokens, '';
foreach my $i (0..$#tokens) {
next unless $tokens[$i] =~ /^[a-z]$/i; # one element per letter token
my $str = '';
if ($tokens[$i-1] !~ /^[a-z]$/i) { # punc before letter
$str .= $tokens[$i-1];
}
$str .= $tokens[$i]; # the letter
if ($tokens[$i+1] !~ /^[a-z]$/i) { # punc after letter
$str .= $tokens[$i+1];
}
push #parts, $str;
}
return #parts;
}

How can I extract substrings from a string in Perl?

Consider the following strings:
1) Scheme ID: abc-456-hu5t10 (High priority) *****
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *****
and so on in the above format - the parts in bold are changes across the strings.
==> Imagine I've many strings of format Shown above.
I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.
1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
2nd substring containing the word (in eg above it's "High priority")
3rd substring containing * (IF * is present at the end of the string ELSE leave it )
How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?
You could do something like this:
my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END
foreach (split(/\n/,$data)) {
$_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
my ($id,$word,$star) = ($1,$2,$3);
print "$id $word $star\n";
}
The key thing is the Regular expression:
Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?
Which breaks up as follows.
The fixed String "Scheme ID: ":
Scheme ID:
Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:
([a-z0-9-]+)
Followed by one or more whitespace characters:
\s+
Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:
\(([^)]+)\)
Followed by some spaces any maybe a *, captured as $3:
\s*(\*)?
You could use a regular expression such as the following:
/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/
So for example:
$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";
prints
abc-456-hu5t10
High priority
*
(\S*)\s*\((.*?)\)\s*(\*?)
(\S*) picks up anything which is NOT whitespace
\s* 0 or more whitespace characters
\( a literal open parenthesis
(.*?) anything, non-greedy so stops on first occurrence of...
\) a literal close parenthesis
\s* 0 or more whitespace characters
(\*?) 0 or 1 occurances of literal *
Well, a one liner here:
perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt
Expanded to a simple script to explain things a bit better:
#!/usr/bin/perl -ln
#-w : warnings
#-l : print newline after every print
#-n : apply script body to stdin or files listed at commandline, dont print $_
use strict; #always do this.
my $regex = qr{ # precompile regex
Scheme\ ID: # to match beginning of line.
\s+ # 1 or more whitespace
(.*?) # Non greedy match of all characters up to
\s+ # 1 or more whitespace
\( # parenthesis literal
(.*?) # non-greedy match to the next
\) # closing literal parenthesis
\s* # 0 or more whitespace (trailing * is optional)
(\*)? # 0 or 1 literal *s
}x; #x switch allows whitespace in regex to allow documentation.
#values trapped in $1 $2 $3, so do whatever you need to:
#Perl lets you use any characters as delimiters, i like pipes because
#they reduce the amount of escaping when using file paths
m|$regex| && print "$1 : $2 : $3";
#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }
Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.
Long time no Perl
while(<STDIN>) {
next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
print "|$1|$2|$3|\n";
}
This just requires a small change to my last answer:
my ($guid, $scheme, $star) = $line =~ m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( (.+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star
)? #if it exists
}x;
String 1:
$input =~ /'^\S+'/;
$s1 = $&;
String 2:
$input =~ /\(.*\)/;
$s2 = $&;
String 3:
$input =~ /\*?$/;
$s3 = $&;

what does this line do in Perl? ($rowcol =~ m/([A-Z]?)([0-9]+)/);

What does this line do in Perl?
my #parsedarray = ($rowcol =~ m/([A-Z]?)([0-9]+)/);
$rowcol is something like A1, D8 etc... and I know that the script somehow splits them up because the next two lines are these:
my $row = $parsedarray[0];
my $col = $parsedarray[1];
I just can't see what this line does ($rowcol =~ m/([A-Z]?)([0-9]+)/); and how it works.
([A-Z]?) means capture at most one uppercase letter. ([0-9]+) means match and capture at least one digit.
Next time, you can install YAPE::Regex::Explain to tell you what's going on. eg
use YAPE::Regex::Explain;
my $regex = "([A-Z]?)([0-9]+)";
my $exp = YAPE::Regex::Explain->new($regex)->explain;
print $exp."\n";
Note that m// in list context returns all the captured sub-strings.
Broken into pieces, this is what's going on:
my #parsedarray = # declare an array, and assign it the results of:
(
$rowcol =~ # the results of $rowcol matched against
m/ # the pattern:
([A-Z]?) # 0 or 1 upper-case-alpha characters (1st submatch),
# followed by
([0-9]+) # 1 or more numeric characters (2nd submatch)
/x # this flag added to allow this verbose formatting
); # ...which in list context is all the submatches
So if $rowcal = 'D3':
my #parsedarray = ('D3' =~ m/([A-Z]?)([0-9]+)/); # which reduces to:
my #parsedarray = ('D', '3');
You can read about regular expressions in depth at perldoc perlrequick (a quick summary), perldoc perlretut (the tutorial), and perldoc perlre (all the details), and the various regular expression operations (matching, substitution, translation) at perldoc perlop.
The operator m// is a pattern match, basically a synonym of //. This matches an optional first letter and then 1 or more digits in row column. An array is returned as the result of the match with each element containing one of the matched groups (in brackets). Therefore $parsedarray[0] contains the letter (or nothing) and $parsedarray[1] contains the digits.
It:
matches against the regular expression (zero or more capitalised letters, followed by one or more numbers.)
Captures two groups:
zero or more letters
one or more numbers
Assigns those captured groups to the #parsedarray array
Example code to test:
use Data::Dumper;
my $rowcol = "A1";
my #parsedarray = ($rowcol =~ m/([A-Z]?)([0-9]+)/);
print Dumper(\#parsedarray);
yields:
$VAR1 = [
'A',
'1'
];
Note that if the string had no leading capitalised letter (e.g. "a1") then it would return an empty string for $parsedarray[0].
My Perl is a little rusty but if I understood your question, the answer is that it matches the regular expression of :
optional any capital letter between a through z followed by one or more number digits, and assigns it to rowcol