How can I extract substrings from a string in Perl? - regex

Consider the following strings:
1) Scheme ID: abc-456-hu5t10 (High priority) *****
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *****
and so on in the above format - the parts in bold are changes across the strings.
==> Imagine I've many strings of format Shown above.
I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.
1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
2nd substring containing the word (in eg above it's "High priority")
3rd substring containing * (IF * is present at the end of the string ELSE leave it )
How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?

You could do something like this:
my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END
foreach (split(/\n/,$data)) {
$_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
my ($id,$word,$star) = ($1,$2,$3);
print "$id $word $star\n";
}
The key thing is the Regular expression:
Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?
Which breaks up as follows.
The fixed String "Scheme ID: ":
Scheme ID:
Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:
([a-z0-9-]+)
Followed by one or more whitespace characters:
\s+
Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:
\(([^)]+)\)
Followed by some spaces any maybe a *, captured as $3:
\s*(\*)?

You could use a regular expression such as the following:
/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/
So for example:
$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";
prints
abc-456-hu5t10
High priority
*

(\S*)\s*\((.*?)\)\s*(\*?)
(\S*) picks up anything which is NOT whitespace
\s* 0 or more whitespace characters
\( a literal open parenthesis
(.*?) anything, non-greedy so stops on first occurrence of...
\) a literal close parenthesis
\s* 0 or more whitespace characters
(\*?) 0 or 1 occurances of literal *

Well, a one liner here:
perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt
Expanded to a simple script to explain things a bit better:
#!/usr/bin/perl -ln
#-w : warnings
#-l : print newline after every print
#-n : apply script body to stdin or files listed at commandline, dont print $_
use strict; #always do this.
my $regex = qr{ # precompile regex
Scheme\ ID: # to match beginning of line.
\s+ # 1 or more whitespace
(.*?) # Non greedy match of all characters up to
\s+ # 1 or more whitespace
\( # parenthesis literal
(.*?) # non-greedy match to the next
\) # closing literal parenthesis
\s* # 0 or more whitespace (trailing * is optional)
(\*)? # 0 or 1 literal *s
}x; #x switch allows whitespace in regex to allow documentation.
#values trapped in $1 $2 $3, so do whatever you need to:
#Perl lets you use any characters as delimiters, i like pipes because
#they reduce the amount of escaping when using file paths
m|$regex| && print "$1 : $2 : $3";
#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }
Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.

Long time no Perl
while(<STDIN>) {
next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
print "|$1|$2|$3|\n";
}

This just requires a small change to my last answer:
my ($guid, $scheme, $star) = $line =~ m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( (.+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star
)? #if it exists
}x;

String 1:
$input =~ /'^\S+'/;
$s1 = $&;
String 2:
$input =~ /\(.*\)/;
$s2 = $&;
String 3:
$input =~ /\*?$/;
$s3 = $&;

Related

Perl - Problem with "]" in a regular expression

I have a string :
my $string = "name_of_my_function(arg1,arg2,[arg3,arg4])";
and I want to extract the name of the function "name_of_my_function" and the parameters :
$arg1 = "arg1"
$arg2 = "arg2"
#arg_list = ("arg3", "arg4")
the code I use to extract the function is :
$row =~ m/^([^\(]*)\(([^\)]*)\)/;
$function = $1;
However, it works when the string doesn't have any "]", for example :
my $string = "name_of_my_function(arg1,arg2,arg3)";
but it doesn't return anything when there is a "]"
Any idea?
Thanks,
SLP
The regex you show captures the function name, and all other arguments in a string, which is a very reasonble first step. Then parse the arguments out of that second string. I expand your $string so to have multiple bracketed lists of arguments, interleaved with non-bracketed ones
perl -wE'
$s = "name_of_my_function(arg1,arg2,[arg3,arg4],arg5,[arg6,arg7])";
#m = $s =~ /^([^\(]*)\(([^\)]*)\)/;
#p = grep { $_ } split /\s*,\s*|\[(.*?)\]/, $m[1];
for (#p) {
if (/,/) { push #arg_list, $_ }
else { push #args, $_ }
}
say $m[0];
say for #args;
say for #arg_list
'
This prints
name_of_my_function
arg1
arg2
arg5
arg3,arg4
arg6,arg7
The split is where individual arguments are extracted, as well as bracketed argument list(s), each as a string. That may return empty elements thus grep { $_ } to filter them out.
Then you can proceed to extract individual arguments from lists that were in brackets, by splitting each string in #arg_list by , again.
The main part of the above can, as the problem stands, go in one statement
#p = grep { $_ } split /\( | \) | \[(.*?)\] |,/x, $s;
where I added /x modifier so to be able to space it out for readability. This delivers to #p the function name, individual arguments, and a string with (comma separated) argument list from each [].
However, I think that it is far more sensible to break this up into several steps.
Well, if the number of arguments is variable, that is not that simple to do it with rgex only (arguments will be matched with + quantifier, so they won't be stored in capturing group, which would be easy to extract). Having in mind the above, you could use this pattern (\w+)\(((\w+|\[(\w+,?)+\]),?)+\)
Explanation:
(\w+) - match one or more word characters (name of a function) and store it in first capturing group,
(\w+|\[(\w+,?)+\]) - alternation: match \w+ (same as above) or \[(\w+,?)+\]: \[ - match [ literally, (\w+,?)+ - match on or more times \w+, pattern which is one or more word characters followed by one or zero commas (,?), \] - match ] literally,
((\w+|\[(\w+,?)+\]),?)+ - match whole above pattern, optionally followed by comma (,?) one or more times. This would match argument list.
\(, \) 0 match (, ) literally
Further processing - extract whats between brackets () in order to extract arguments list programatically - it would be easier that doing it with complex regular expression
Demo
UPDATE:
Try pattern: https://regex101.com/r/wBcJZ0/3
I omitted explanation, as it is very similair to previous pattern.
Updted demo

perl regex for variable substitution

I want to substitute variables marked by a "#" and terminated by a dot or a non-alphanumeric character.
Example: Variable #name should be substituted be "Peter"
abc#name.def => abcPeterdef
abc#namedef => abc#namedef
abc#name-def => abcPeter-def
So if the variable is terminated with a dot, it is replaced and the dot removed. Is it terminated by any non-alphanum character, it is replaced also.
I use the following:
s/#name\./Peter/i
s/#name(\W)/Peter$1/i
This works but is it possible to merge it into one expression?
There are several possible approaches.
s/#name(\W)/"Peter" . ($1 eq "." ? "" : $1)/e
Here we use /e to turn the replacement part into an expression, so we can inspect $1 and choose the replacement string dynamically.
s/#name(?|\.()|([^.\w]))/Peter$1/
Here we use (?| ) to reset the numbering of capture groups between branches, so both \.() and ([^.\w]) set $1. If a . is matched, $1 becomes the empty string; otherwise it contains the matched character.
You may use
s/#name(?|\.()|(\W))/Peter$1/i
Details
#name - matches the literal substring
(?|\.()|(\W)) - a branch reset group matching either of the two alternatives:
\.() - a dot and then captures an empty string into $1
| - or
(\W) - any non-word char captured into $1.
So, upon a match, $1 placeholder is either empty or contains any non-word char other than a dot.
You can do this by using either a literal dot or a word boundary for the terminator
Like this
s/#name(?:\.|\b)/Peter/i
Here's a complete program that reproduces the required output shown in your question
use strict;
use warnings 'all';
for my $s ( 'abc#name.def', 'abc#namedef', 'abc#name-def' ) {
( my $s2 = $s ) =~ s/#name(?:\.|\b)/Peter/i;
printf "%-12s => %-s\n", $s, $s2;
}
output
abc#name.def => abcPeterdef
abc#namedef => abc#namedef
abc#name-def => abcPeter-def

Replace character between two string in PCRE (Perl) syntax

How can I replace a special character between two special strings.
I have something like this:
"start 1
2-
G
23
end"
I want to have the following:
"start 1 2- G 23 end"
Only replace \n with space between "start and end"
Test1;Hello;"Text with more words";123
Test2;want;"start
1-
76 end";123
Test3;Test;"It's a test";123
Test4;Hellp;"start
1234
good-
the end";1234
Test5;Test;"It's a test";123
Is it possible in notepad++?
You can use this pattern:
(?:\G(?!\A)|\bstart\b)(?:(?!\bend\b).)*\K\R
demo
details:
(?:
\G(?!\A) # contiguous to a previous match
|
\bstart\b # this is the first branch that matches
)
(?:(?!\bend\b).)* # zero or more chars that are not a newline nor the start of the word "end"
\K # remove all on the left from the match result
\R # any newline sequence (\n or \r\n or \r)
Note: (?:(?!\bend\b).)* isn't very efficient, feel free to replace it by something better for your particular case.
Magic words are lazy quantifier, lookaheads and single line mode.
A solution for PHP (uses PCRE) would be:
<?php
$string = __your_string_here__;
$regex = '~(?s)(?:start)(?<content>.*?)(?=end)(?s-)~';
# ~ delimiter
# (?s) starts single line mode - aka dot matches everything
# (?:start) captures start literally
# .*? matches everything lazily
# (?=end) positive lookahead
# (?s-) turn single line mode off
# ~ delimiter
preg_match_all($regex, $string, $matches);
$content = str_replace("\n", '', $matches["content"][1]);
echo $content; // 1234good-the
?>

extract a part of string using regex

I have a text file with pattern as below.
"s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}"
I am trying to extract the values cap1, cap2, CAP3.
I am trying with regex "([a-z]|[|])cap1(\{(.*?)\})([a-z]|[|]|[0-9])" but with no luck any help is appreciated.
As I understand you want to extract the value of cap1, cap2, CAP3 one by one. There are 3 regex then
For cap1
cap1\{([^\}]*)\}
Explanation
cap1\{ match text cap1{,
([^\}]*) capture any characters except } to group $1,
\} match text }.
For cap2
cap2\[([^\]]*)\]
Explanation
cap2\[ match text cap2[,
([^\]]*) capture any characters except ] to group $1,
\] match text ].
For CAP3
CAP3\{\[([^\]]*)\]\|\[([^\]]*)\]\}
Explanation
CAP3\{ match text CAP3{,
\[([^\]]*)\]\|\[([^\]]*)\] capture any characters except ] to groups $1, $2 respectively,
\} match text }.
Additional: Thank you for a comment from #Borodin, to do this task you don't need to use lookaround but in case that you want to do search and replace, the lookaround may be necessary.
For cap1: (?<=cap1\{)([^\}]*)(?=\})
For cap2: (?<=cap2\[)([^\]]*)(?=\])
For CAP3: (?<=CAP3\{)\[([^\]]*)\]\|\[([^\]]*)\](?=\})
Using a pattern such as this should work:
[{\[]+([^}{\]\[]+)[\]}]+
Code:
$searchText =~ m/[{\[]+([^}{\]\[]+)[\]}]+/
Example:
https://regex101.com/r/qI3fI6/1
Update
I apologise -- I initially mistook your question for something more trivial
Essentially you want to perform a split on pipe | characters, excluding those found inside pairs of brackets or braces [ ... ] or { ... }. As long as you don't need to take account of nesting inside brackets of the same type (i.e. braces will only ever contain brackets, and brackets will only ever contain braces) then it is simply done like this
my #matches = $s =~ m{ \w+ ( \{ [^{}]* \} | \[ [^\[\]]* \] ) }gx;
print "$_\n" for #matches;
output
{capture|these|values}
[capture|these|values]
{[capture|these|values]|[capture|these|values]}
The data you show has no instances of braces containing braces, or brackets containing brackets, but I suspect that there is no theoretical limit to the nesting of the your data in which case some recursion is necessary
The regex pattern in the program below defines the text that can appear inside a pair of matching brackets as a pipe-delimited sequence of
another pair of matching brackets and their content [ ... ]
another pair of matching braces and their content { ... }
a sequence of word characters like capture and values
A pattern matching that is inside the second pair of capturing parentheses. It is a recursive pattern that calls itself using relative numbering (?-1). That could also be absolute numbering (?2) but it would have to be changed if the number of preceding captures was changed
The complete pattern looks for and captures a series of word characters immediately before the recursive pattern to account for the cap1, cap2 etc. This allows the result of a glolbal search to be assigned directly to a hash with the result show below
use strict;
use warnings;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
my %captures = $s =~ m{
( (?> \w+ ) )
(
\{ (?-1) (?> \| (?-1) )* \} |
\[ (?-1) (?> \| (?-1) )* \] |
\w+
)
}gx;
use Data::Dump;
dd \%captures;
output
{
cap1 => "{capture|these|values}",
cap2 => "[capture|these|values]",
CAP3 => "{[capture|these|values]|[capture|these|values]}",
}
Original answer
It looks like you want all identifiers that are preceded by a pipe | character and followed by either a square or curly opening bracket [ or {
This program will do that for you
use strict;
use warnings;
use v5.10;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
for ( $s ) {
my #captures = /\|(\w+)[\[\{]/g;
say for #captures;
}
output
cap1
cap2
CAP3

help with perl regex rules

I would need some help with a regex issue in perl. I need to match non_letter characters "nucleated" around letter characters string (of size one).
That is to say... I have a string like
CDF((E)TR)FT
and I want to match ALL the following:
C, D, F((, ((E), )T, R), )F, T.
I was trying with something like
/([^A-Za-z]*[A-Za-z]{1}[^A-Za-z]*)/
but I'm obtaining:
C, D, F((, E), T, R), F, T.
Is like if once a non-letter characters has been matched it can NOT be matched again in another matching.
How can I do this?
A little late on this. Somebody has probably proposed this already.
I would consume the capture in the assertion to the left (via backref) and not consume the capture in the assertion to the right. All the captures can be seen, but the last one is not consumed, so the next pass continues right after the last atomic letter was found.
Character class is simplified for clarity:
/(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 1
(\1[A-Z]) # capture grp 2, consume capture group 1, plus atomic letter
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 3
Do globally, in a while loop, combined groups $2$3 (in that order) are the answer.
Test:
$samp = 'CDF((E)TR)FT';
while ( $samp =~ /(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/g )
{
print "$2$3, ";
}
output:
C, D, F((, ((E), )T, R), )F, T,
The problem is that you are consuming your characters or non letter characters the first time you encounter them, therefore you can't match all that you want. A solution would be to use different regexes for different patterns and combine the results at the end so that you could have your desired result :
This will match all character starting with a non character followed by a single character but NOT followed by a non character
[^A-Z]+[A-Z](?![^A-Z])
This will match a character enclosed by non characters, containing overlapping results :
(?=([^A-Z]+[A-Z][^A-Z]+))
This will match a character followed by one or more non characters only if it is not preceded by a non character :
(?<![^A-Z])[A-Z][^A-Z]+
And this will match single characters which are not enclosed to non characters
(?<![^A-Z])[A-Z](?![^A-Z])
By combining the results you will have the correct desired result:
C,D,T, )T, )F, ((E), F((, R)
Also if you understand the small parts you could join this into one Regex :
#!/usr/local/bin/perl
use strict;
my $subject = "0C0CC(R)CC(L)C0";
while ($subject =~ m/(?=([^A-Z]+[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z](?![^A-Z])))|(?=([^A-Z]+[A-Z](?![^A-Z])))/g) {
# matched text = $1, $2, $3, $4
print $1, " " if defined $1;
print $2, " " if defined $2;
print $3, " " if defined $3;
print $4, " " if defined $4;
}
Output :
0C0 0C C( (R) )C C( (L) )C0
You're right, once a character has been consumed in a regex match, it can't be matched again. In regex flavors that fully support lookaround assertions, you could do it with the regex
(?<=(\P{L}*))\p{L}(?=(\P{L}*))
where the match result would be the letter, and $1 and $2 would contain the non-letters around it. Since they are only matched in the context of lookaround assertions, they are not consumed in the match and can therefore be matched multiple times. You then need to construct the match result as $1 + $& + $2. This approach would work in .NET, for example.
In most other flavors (including Perl) that have limited support for lookaround, you can take a mixed approach, which is necessary because lookbehind expressions don't allow for indefinite repetition:
\P{L}*\p{L}(?=(\P{L}*))
Now $& will contain the non-letter characters before the letter and the letter itself, and $1 contains any non-letter characters that follow the letter.
while ($subject =~ m/\P{L}*\p{L}(?=(\P{L}*))/g) {
# matched text = $& . $1
}
Or, you could do it the hard way and tokenize first, then process the tokens:
#!/usr/bin/perl
use warnings;
use strict;
my $str = 'CDF((E)TR)FT';
my #nucleated = nucleat($str);
print "$_\n" for #nucleated;
sub nucleat {
my($s) = #_;
my #parts; # return list stored here
my #tokens = grep length, split /([a-z])/i, $s;
# bracket the tokens with empty strings to avoid warnings
unshift #tokens, '';
push #tokens, '';
foreach my $i (0..$#tokens) {
next unless $tokens[$i] =~ /^[a-z]$/i; # one element per letter token
my $str = '';
if ($tokens[$i-1] !~ /^[a-z]$/i) { # punc before letter
$str .= $tokens[$i-1];
}
$str .= $tokens[$i]; # the letter
if ($tokens[$i+1] !~ /^[a-z]$/i) { # punc after letter
$str .= $tokens[$i+1];
}
push #parts, $str;
}
return #parts;
}