How to cut out a specific substring? - regex

I wonder if it is possible to cut a substring from a string if this substring's occurency can be zero or more times using a perl regex?
So for example:
"foo bar //baz" and "foo bar" should both result into "foo bar", cutting out everything behind a double slash, if it's there.
I know this could be easily achieved using other methods, but I'm interested if a regex oneliner is possible for that.
I tried ($new_string) = ($string =~ /(.*?)(\/\/)*.*/)
But that does not work.

_____________ Matches 0 chars at position 0 ("").
/ ______ Matches 0 chars at position 0 ("").
/ / _____ Matches 13 chars at position 0 ("foo bar //baz").
_/ _____/ /
/ \ / \/\
(.*?)(\/\/)*.*
What you want:
( my $new_string = $string ) =~ s{//.*}{};
my $new_string = $string =~ s{//.*}{}r; # 5.14+

Related

How to match same length digits in a string using regex

I want to find a text (for example: stack) in a String that contains digits and chars (for example: s123t123a123c123k). The only rule is that between every character of the search key there should be the same amount of digits, so all of this should match:
search key: stack
Strings:
stack //0 digit between chars of stack
s7t3a9c0k //1 digit between chars of stack
s27t33a49c50k //2 digit between chars of stack
s127t312a229c330k //3 digit between chars of stack and so on for 4,5,6 digits...
If I could match same length digits then I can write something like: s[]*t[]*a[]*c[]*k if the regex for same length digit is [].
How to match same length digits in a string using regex?
In case you want to do that with Perl, you may use
m/s(\d+)t(??{ "\\d{".length($^N)."}" })a(??{ "\\d{".length($^N)."}" })c(??{ "\\d{".length($^N)."}" })k/
Or, if you want to match digits with [0-9]:
m/s([0-9]+)t(??{ "[0-9]{".length($^N)."}" })a(??{ "[0-9]{".length($^N)."}" })c(??{ "[0-9]{".length($^N)."}" })k/
In the Perl code, you will most likely want to build the pattern dynamically. See the full Perl code demo:
#!/usr/bin/perl
use warnings;
use strict;
use re 'eval'; # stackoverflow.com/a/16320570/3832970
my #input = split /\n/, <<"END";
s7t3a9c0k
s27t33a49c50k
s127t312a229c330k
s1t312a22c3300k
END
my $keyword = "stack";
my $pattern = substr($keyword, 0, 1) . '([0-9]+)' . join( '(??{ "[0-9]{".length($^N)."}" })', split("", substr($keyword, 1)) );
#my $pattern = substr($keyword, 0, 1) . '(\d+)' . join( '(??{ "\\\\d{".length($^N)."}" })', split("", substr($keyword, 1)) );
for my $input ( #input ) {
if ($input =~ m/$pattern/) {
print $input . ": PASS!\n";
} else {
print $input . ": FAIL!\n"
}
}
Output:
s7t3a9c0k: PASS!
s27t33a49c50k: PASS!
s127t312a229c330k: PASS!
s1t312a22c3300k: FAIL!
The $pattern is built dynamically: substr($keyword, 0, 1) gets the first char, then ([0-9]+) is added, then join( '(??{ "[0-9]{".length($^N)."}" })', split("", substr($keyword, 1)) adds the following: it inserts (??{ "[0-9]{".length($^N)."}" }) in between each char of the $keyword substring from the second char. The (??{ "[0-9]{".length($^N)."}" }) part acts as a \d{X} pattern where X is the length of the most recent captured substring (it was ([0-9]+)).
The use re 'eval'; is necessary to build the pattern dynamically. As per this answer, it will only affect the regular expressions in the file or in the curlies where it is used.

exactly once from a set of characters perl using regex

how to check exactly one character from a group of characters in perl using regexp.Suppose from (abcde) i want to check if out of all these 5 characters only one has occured which can occur multiple times.I have tried quantifiers but it does not work for a set of characters.
You could use the following regex match:
/
^
[^a-e]*+
(?: a [^bcde]*+
| b [^acde]*+
| c [^abde]*+
| d [^abce]*+
| e [^abcd]*+
)
\z
/x
The following is a simpler pattern that might be less efficient:
/ ^ [^a-e]*+ ([a-e]) (?: \1|[^a-e] )*+ \z /x
A non-regex solution might be simpler.
# Count the number of instances of each letter.
my %chars;
++$chars{$_} for split //;
# Count how many of [a-e] are found.
my $count = 0;
++$count for grep $chars{$_}, qw( a b c d e );
$count == 1
you can use regex to return a list of matches. then you can store the result in an array.
my #arr = "abcdeaa" =~ /a/g; print scalar #arr ."\n";
prints 3
my #arr = "bcde" =~ /a/g; print scalar #arr ."\n";
prints 0
if you use scalar #arr. it will return the length of the array.

How to match string that contain exact 3 time occurrence of special character in perl

I have try few method to match a word that contain exact 3 times slash but cannot work. Below are the example
#array = qw( abc/ab1/abc/abc a2/b1/c3/d4/ee w/5/a s/t )
foreach my $string (#array){
if ( $string =~ /^\/{3}/ ){
print " yes, word with 3 / found !\n";
print "$string\n";
}
else {
print " no word contain 3 / found\n";
}
Few macthing i try but none of them work
$string =~ /^\/{3}/;
$string =~ /^(\w+\/\w+\/\w+\/\w+)/;
$string =~ /^(.*\/.*\/.*\/.*)/;
Any other way i can match this type of string and print the string?
Match a / globally and compare the number of matches with 3
if ( ( () = m{/}g ) == 3 ) { say "Matched 3 times" }
where the =()= operator is a play on context, forcing list context on its right side but returning the number of elements of that list when scalar context is provided on its left side.
If you are uncomfortable with such a syntax stretch then assign to an array
if ( ( my #m = m{/}g ) == 3 ) { say "Matched 3 times" }
where the subsequent comparison evaluates it in the scalar context.
You are trying to match three consecutive / and your string doesn't have that.
The pattern you need (with whitespace added) is
^ [^/]* / [^/]* / [^/]* / [^/]* \z
or
^ [^/]* (?: / [^/]* ){3} \z
Your second attempt was close, but using ^ without \z made it so you checked for string starting with your pattern.
Solutions:
say for grep { m{^ [^/]* (?: / [^/]* ){3} \z}x } #array;
or
say for grep { ( () = m{/}g ) == 3 } #array;
or
say for grep { tr{/}{} == 3 } #array;
You need to match
a slash
surrounded by some non-slashes (^(?:[^\/]*)
repeating the match exactly three times
and enclosing the whole triple in start of line and and of line anchors:
$string =~ /^(?:[^\/]*\/[^\/]*){3}$/;
if ( $string =~ /\/.*\/.*\// and $string !~ /\/.*\/.*\/.*\// )

Codegolf regex match

In the codegold i found this answer: https://codegolf.stackexchange.com/a/34345/29143 , where is this perl one liner:
perl -e '(q x x x 10) =~ /(?{ print "hello\n" })(?!)/;'
After the -MO=Deparse got:
' ' =~ /(?{ print "hello\n" })(?!)/;
^^^^^^^^^^^^
10 spaces
The explanation told than the (?!) never match, so the regex tries match each character. OK, but why it prints 11 times hello and not 10 times?
Regular expressions start matching based off positions, which can includes both before each character but also after the last character.
The following zero width regular expression will match before each of the 5 characters of the string, but also after the last one, thus demonstrated why you got 11 prints instead of just 10.
use strict;
use warnings;
my $string = 'ABCDE';
# Zero width Regular expression
$string =~ s//x/g;
print $string;
Outputs:
xAxBxCxDxEx
^ ^ ^ ^ ^ ^
1 2 3 4 5 6
It's because when you have a string of n characters there are n+1 positions in the string where the pattern is tested.
example with "abc":
a b c
^ ^ ^ ^
| | | |
| | | +--- end of the string
| | +----- position of c
| +------- position of b
+--------- position of a
The position of the end of the string can be a little counter-intuitive, but this position exists. To illustrate this fact, consider the pattern /c$/ that will succeed with the example string. (think of the position in the string when the end anchor is tested). Or this other one /(?<=c)/ that succeeds in the last position.
Take a look at the following:
$x = "abc"; $x =~ s/.{0}/x/; print("$x\n"); # xabc
$x = "abc"; $x =~ s/.{1}/x/; print("$x\n"); # xbc
$x = "abc"; $x =~ s/.{2}/x/; print("$x\n"); # xc
$x = "abc"; $x =~ s/.{3}/x/; print("$x\n"); # x
Nothing surprising. You can match anywhere between 0 and 3 of the three characters, and place an x at the position where you left off. That's four positions for three characters.
Also consider 'abc' =~ /^abc\z/.
Starting at position 0, ^ matches zero chars.
Starting at position 0, a matches one char.
Starting at position 1, b matches one char.
Starting at position 2, c matches one char.
Starting at position 3, \z matches zero char.
Again, that's a total of four positions needed for a three character string.
Only zero-width assertions can match at the last position, but there are plenty of those (^, \z, \b, (?=...), (?!...), (?<=...), (?:...)?, etc).
You can think of the positions as the edges of the characters, if that helps.
|a|b|c|
0 1 2 3

Regular expressions to match protected separated values

I'd like to have a regular expression to match a separated values with some protected values that can contain the separator character.
For instance:
"A,B,{C,D,E},F"
would give:
"A"
"B"
"{C,D,E}"
"F"
Please note the protected values can be nested, as follows:
"A,B,{C,D,{E,F}},G"
would give:
"A"
"B"
"{C,D,{E,F}}"
"G"
I already coded that feature with a character iteration as follow:
sub Parse
{
my #item;
my $curly;
my $string;
foreach(split //)
{
$_ eq "{" and ++$curly;
$_ eq "}" and --$curly;
if(!$curly && /[,:]/)
{
push #item, $string;
undef $string;
next;
}
$string .= $_;
}
push #item, $string;
return #item;
}
But it would definitively be so much nicer with a regexp.
A regex that supports nesting would look as follows:
my #items;
push #items, $1 while
/
(?: ^ | \G , )
(
(?: [^,{}]+
| (
\{
(?: [^{}]
| (?2)
)*
\}
)
| # Empty
)
)
/xg;
$ perl -E'$_ = shift; ... say for #items;' 'A,B,{C,D,{E,F}},G'
A
B
{C,D,{E,F}}
G
Assumes valid input since it can't extract and validate at the same time. (Well, not without making things really messy.)
Improved from nhahtdh's answer.
$_ = "A,B,{C,D,E},F";
while ( m/(\{.*?\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "[$&]\n";
}
Improved it again. Please look at this one!
$_ = "A,B,{C,D,{E,F}},G";
while ( m/(\{.*\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "$&\n";
}
It will get:
A
B
{C,D,{E,F}}
G
$a = "A,B,{C,D,E},F";
while ($a =~ s/(\{[\{\}\w,]+\}|\w)//) {
push (#res, $1);
}
print "\#res: #res\n"
Result:
#res: A B {C,D,E} F
Explanation : we try to match either the protected block \{[\{\}\w,]+\} or just a single character \w successively in a loop, deleting it from the original string if there is a match. Every time there is a match, we store it (meaning the $1) in the array, et voilĂ !
Here is a regex in bash:
chronos#localhost / $ echo "A,B,{C,D,E},F" | grep -oE "(\{[^\}]*\}|[A-Z])"
A
B
{C,D,E}
F
Try this regex. Use the regex to match and extract the token.
/(\{.*?\}|(?<=,|^).*?(?=,|$))/
I have not tested this code in Perl.
There is an assumption about on how the regex engine works here (I assume that it will try to match the first part \{.*?\} before the second part). I also assume that there are no nested curly bracket, and badly paired curly brackets.
$s = "A,B,{C,D,E},F";
#t = split /,(?=.*{)|,(?!.*})/, $s;