exactly once from a set of characters perl using regex - regex

how to check exactly one character from a group of characters in perl using regexp.Suppose from (abcde) i want to check if out of all these 5 characters only one has occured which can occur multiple times.I have tried quantifiers but it does not work for a set of characters.

You could use the following regex match:
/
^
[^a-e]*+
(?: a [^bcde]*+
| b [^acde]*+
| c [^abde]*+
| d [^abce]*+
| e [^abcd]*+
)
\z
/x
The following is a simpler pattern that might be less efficient:
/ ^ [^a-e]*+ ([a-e]) (?: \1|[^a-e] )*+ \z /x
A non-regex solution might be simpler.
# Count the number of instances of each letter.
my %chars;
++$chars{$_} for split //;
# Count how many of [a-e] are found.
my $count = 0;
++$count for grep $chars{$_}, qw( a b c d e );
$count == 1

you can use regex to return a list of matches. then you can store the result in an array.
my #arr = "abcdeaa" =~ /a/g; print scalar #arr ."\n";
prints 3
my #arr = "bcde" =~ /a/g; print scalar #arr ."\n";
prints 0
if you use scalar #arr. it will return the length of the array.

Related

How to match string that contain exact 3 time occurrence of special character in perl

I have try few method to match a word that contain exact 3 times slash but cannot work. Below are the example
#array = qw( abc/ab1/abc/abc a2/b1/c3/d4/ee w/5/a s/t )
foreach my $string (#array){
if ( $string =~ /^\/{3}/ ){
print " yes, word with 3 / found !\n";
print "$string\n";
}
else {
print " no word contain 3 / found\n";
}
Few macthing i try but none of them work
$string =~ /^\/{3}/;
$string =~ /^(\w+\/\w+\/\w+\/\w+)/;
$string =~ /^(.*\/.*\/.*\/.*)/;
Any other way i can match this type of string and print the string?
Match a / globally and compare the number of matches with 3
if ( ( () = m{/}g ) == 3 ) { say "Matched 3 times" }
where the =()= operator is a play on context, forcing list context on its right side but returning the number of elements of that list when scalar context is provided on its left side.
If you are uncomfortable with such a syntax stretch then assign to an array
if ( ( my #m = m{/}g ) == 3 ) { say "Matched 3 times" }
where the subsequent comparison evaluates it in the scalar context.
You are trying to match three consecutive / and your string doesn't have that.
The pattern you need (with whitespace added) is
^ [^/]* / [^/]* / [^/]* / [^/]* \z
or
^ [^/]* (?: / [^/]* ){3} \z
Your second attempt was close, but using ^ without \z made it so you checked for string starting with your pattern.
Solutions:
say for grep { m{^ [^/]* (?: / [^/]* ){3} \z}x } #array;
or
say for grep { ( () = m{/}g ) == 3 } #array;
or
say for grep { tr{/}{} == 3 } #array;
You need to match
a slash
surrounded by some non-slashes (^(?:[^\/]*)
repeating the match exactly three times
and enclosing the whole triple in start of line and and of line anchors:
$string =~ /^(?:[^\/]*\/[^\/]*){3}$/;
if ( $string =~ /\/.*\/.*\// and $string !~ /\/.*\/.*\/.*\// )

Perl regex to "Count number of whitespace" inside parens: () or double quotes ""

I want to count the number of white space present inside: () OR "".
Can that be done using a perl regex.
Example:
Let the string be:
abcd(efg h i)jkl -> count = 2
abc def(dfdsff fd)dfdsf -> count = 1
( )(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$)|( )(?=[^(]*\))
You can try this.Count the number of groups.See demo.
https://regex101.com/r/zM7yV5/6
Here we find space which is inside "" or ().Once we find that we capture and count the number of groups we found and that is the answer.
( )(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$) ==>here lookahead ensures it finds a " ahead of it followed by groups of "" ahead of it.In short odd numbers of " ahead of it.That enables us to pick space which is between "" as it will have odd number of " ahead of it.
( )(?=[^(]*\)) ==> hre lookahead implies that it should find a ) ahead of it without ( .So this enables us to cpature space between ().Though this will not work for nested ().
You can use a regex to find and capture the bracketed string, and tr/// to count the number of whitespace characters in the catptured string.
This program demonstrates the principle. It reads the strings from the DATA file handle (I have used the sample data from your question, but duplicated it to provide samples that contain double quotes) and assumes that there is only one pair of parentheses or quotation marks in each string.
I have used the branch reset construct (?| ... ) so that the bracketed contents are captured in $1 regardless of whether it was the parentheses or the double quotes that matched.
It is a simple matter to modify the code if the assumptions are untrue, but you have been asked about it in comments and haven't provided an answer.
use strict;
use warnings;
use List::Util 'max';
my $re = qr/ (?|
\( ( [^()]+ ) \)
|
" ( [^()]+ ) "
) /x;
my #data = <DATA>;
chomp #data;
my $width = max map length, #data;
for (#data) {
chomp;
if ( $_ =~ $re ) {
my $count = $1 =~ tr/\x20\t//;
printf "%-*s -> count = %d\n", $width, $_, $count;
}
}
__DATA__
abcd(efg h i)jkl
abc def(dfdsff fd)dfdsf
abcd"efg h i"jkl
abc def"dfdsff fd"dfdsf
output
abcd(efg h i)jkl -> count = 2
abc def(dfdsff fd)dfdsf -> count = 1
abcd"efg h i"jkl -> count = 2
abc def"dfdsff fd"dfdsf -> count = 1

Codegolf regex match

In the codegold i found this answer: https://codegolf.stackexchange.com/a/34345/29143 , where is this perl one liner:
perl -e '(q x x x 10) =~ /(?{ print "hello\n" })(?!)/;'
After the -MO=Deparse got:
' ' =~ /(?{ print "hello\n" })(?!)/;
^^^^^^^^^^^^
10 spaces
The explanation told than the (?!) never match, so the regex tries match each character. OK, but why it prints 11 times hello and not 10 times?
Regular expressions start matching based off positions, which can includes both before each character but also after the last character.
The following zero width regular expression will match before each of the 5 characters of the string, but also after the last one, thus demonstrated why you got 11 prints instead of just 10.
use strict;
use warnings;
my $string = 'ABCDE';
# Zero width Regular expression
$string =~ s//x/g;
print $string;
Outputs:
xAxBxCxDxEx
^ ^ ^ ^ ^ ^
1 2 3 4 5 6
It's because when you have a string of n characters there are n+1 positions in the string where the pattern is tested.
example with "abc":
a b c
^ ^ ^ ^
| | | |
| | | +--- end of the string
| | +----- position of c
| +------- position of b
+--------- position of a
The position of the end of the string can be a little counter-intuitive, but this position exists. To illustrate this fact, consider the pattern /c$/ that will succeed with the example string. (think of the position in the string when the end anchor is tested). Or this other one /(?<=c)/ that succeeds in the last position.
Take a look at the following:
$x = "abc"; $x =~ s/.{0}/x/; print("$x\n"); # xabc
$x = "abc"; $x =~ s/.{1}/x/; print("$x\n"); # xbc
$x = "abc"; $x =~ s/.{2}/x/; print("$x\n"); # xc
$x = "abc"; $x =~ s/.{3}/x/; print("$x\n"); # x
Nothing surprising. You can match anywhere between 0 and 3 of the three characters, and place an x at the position where you left off. That's four positions for three characters.
Also consider 'abc' =~ /^abc\z/.
Starting at position 0, ^ matches zero chars.
Starting at position 0, a matches one char.
Starting at position 1, b matches one char.
Starting at position 2, c matches one char.
Starting at position 3, \z matches zero char.
Again, that's a total of four positions needed for a three character string.
Only zero-width assertions can match at the last position, but there are plenty of those (^, \z, \b, (?=...), (?!...), (?<=...), (?:...)?, etc).
You can think of the positions as the edges of the characters, if that helps.
|a|b|c|
0 1 2 3

Regular expressions to match protected separated values

I'd like to have a regular expression to match a separated values with some protected values that can contain the separator character.
For instance:
"A,B,{C,D,E},F"
would give:
"A"
"B"
"{C,D,E}"
"F"
Please note the protected values can be nested, as follows:
"A,B,{C,D,{E,F}},G"
would give:
"A"
"B"
"{C,D,{E,F}}"
"G"
I already coded that feature with a character iteration as follow:
sub Parse
{
my #item;
my $curly;
my $string;
foreach(split //)
{
$_ eq "{" and ++$curly;
$_ eq "}" and --$curly;
if(!$curly && /[,:]/)
{
push #item, $string;
undef $string;
next;
}
$string .= $_;
}
push #item, $string;
return #item;
}
But it would definitively be so much nicer with a regexp.
A regex that supports nesting would look as follows:
my #items;
push #items, $1 while
/
(?: ^ | \G , )
(
(?: [^,{}]+
| (
\{
(?: [^{}]
| (?2)
)*
\}
)
| # Empty
)
)
/xg;
$ perl -E'$_ = shift; ... say for #items;' 'A,B,{C,D,{E,F}},G'
A
B
{C,D,{E,F}}
G
Assumes valid input since it can't extract and validate at the same time. (Well, not without making things really messy.)
Improved from nhahtdh's answer.
$_ = "A,B,{C,D,E},F";
while ( m/(\{.*?\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "[$&]\n";
}
Improved it again. Please look at this one!
$_ = "A,B,{C,D,{E,F}},G";
while ( m/(\{.*\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "$&\n";
}
It will get:
A
B
{C,D,{E,F}}
G
$a = "A,B,{C,D,E},F";
while ($a =~ s/(\{[\{\}\w,]+\}|\w)//) {
push (#res, $1);
}
print "\#res: #res\n"
Result:
#res: A B {C,D,E} F
Explanation : we try to match either the protected block \{[\{\}\w,]+\} or just a single character \w successively in a loop, deleting it from the original string if there is a match. Every time there is a match, we store it (meaning the $1) in the array, et voilĂ !
Here is a regex in bash:
chronos#localhost / $ echo "A,B,{C,D,E},F" | grep -oE "(\{[^\}]*\}|[A-Z])"
A
B
{C,D,E}
F
Try this regex. Use the regex to match and extract the token.
/(\{.*?\}|(?<=,|^).*?(?=,|$))/
I have not tested this code in Perl.
There is an assumption about on how the regex engine works here (I assume that it will try to match the first part \{.*?\} before the second part). I also assume that there are no nested curly bracket, and badly paired curly brackets.
$s = "A,B,{C,D,E},F";
#t = split /,(?=.*{)|,(?!.*})/, $s;

regular epxressions that matches the longest repeating sequence

I want to match the longest sequence that is repeating at least once
Having:
T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced
the result should be: pending-cancel-replace_replaced
Try this
(.+)(?=.*\1)
See it here on Regexr
This will match any character sequence with at least one character, that is repeated later on in the string.
You would need to store your matches and decide which one is the longest afterwards.
This solution requires your regex flavour to support backreferences and lookaheads.
it will match any character sequence with at least one character .+ and store it in the group 1 because of the brackets around it. The next step is the positive lookahead (?=.*\1), it will be true if the captured sequence occurs at a later point again in the string.
Here a perl script that does the job:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $s = q/T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced/;
my $max = 0;
my $seq = '';
while($s =~ /(.+)(?=.*\1)/g) {
if(length$1 > $max) {
$max = length $1;
$seq = $1;
}
}
say "longuest sequence : $seq, length = $max"
output:
longuest sequence : _pending-cancel-replace_replaced, length = 32
I have to admit that this one got me thinking. It was obvious that positive lookahead is absolutely necessary to solve this with regex. Anyhow here is how it would work in Java:
public static String biggestOccurance(String input){
Pattern p = Pattern.compile("(.+)(?=.*\\1)");
Matcher m = p.matcher(input);
String longestOccurence = "";
while(m.find()){
if(longestOccurence.length() < m.group(1).length()) longestOccurence = m.group(1);
}
return longestOccurence;
}
The thing that got me stuck was the
\\1
I knew that you could refer to a backreference in Java with
$1
but if you replace $1 with \\1 it will not work.
Will have to dig into that.
Cheers,Eugene.
Using Perl you can do:
s='T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced'
echo $s | perl -pe 's/([^\s]+)(?=.*?\1)/\1\n/g'
Which gives:
T_
send_
ac
k-
n
e
w_
a
mend
_pending-cancel-replace_replaced
_
cancel
_
p
e
n
d
in
g-
c
a
nce
l
-replace
_re
placed
Then you need to post process it in any language or script to get longest text.
One Possible Post Processing of repeated string can be using awk:
echo $s | perl -pe 's/([^\s]+)(?=.*?\1)/\1\n/g' | awk '{ if (length($0) > max) {max = length($0); maxline = $0} } END { print maxline }'
Which prints:
_pending-cancel-replace_replaced
PS: Note longest string here is _pending-cancel-replace_replaced