How to match same length digits in a string using regex - regex

I want to find a text (for example: stack) in a String that contains digits and chars (for example: s123t123a123c123k). The only rule is that between every character of the search key there should be the same amount of digits, so all of this should match:
search key: stack
Strings:
stack //0 digit between chars of stack
s7t3a9c0k //1 digit between chars of stack
s27t33a49c50k //2 digit between chars of stack
s127t312a229c330k //3 digit between chars of stack and so on for 4,5,6 digits...
If I could match same length digits then I can write something like: s[]*t[]*a[]*c[]*k if the regex for same length digit is [].
How to match same length digits in a string using regex?

In case you want to do that with Perl, you may use
m/s(\d+)t(??{ "\\d{".length($^N)."}" })a(??{ "\\d{".length($^N)."}" })c(??{ "\\d{".length($^N)."}" })k/
Or, if you want to match digits with [0-9]:
m/s([0-9]+)t(??{ "[0-9]{".length($^N)."}" })a(??{ "[0-9]{".length($^N)."}" })c(??{ "[0-9]{".length($^N)."}" })k/
In the Perl code, you will most likely want to build the pattern dynamically. See the full Perl code demo:
#!/usr/bin/perl
use warnings;
use strict;
use re 'eval'; # stackoverflow.com/a/16320570/3832970
my #input = split /\n/, <<"END";
s7t3a9c0k
s27t33a49c50k
s127t312a229c330k
s1t312a22c3300k
END
my $keyword = "stack";
my $pattern = substr($keyword, 0, 1) . '([0-9]+)' . join( '(??{ "[0-9]{".length($^N)."}" })', split("", substr($keyword, 1)) );
#my $pattern = substr($keyword, 0, 1) . '(\d+)' . join( '(??{ "\\\\d{".length($^N)."}" })', split("", substr($keyword, 1)) );
for my $input ( #input ) {
if ($input =~ m/$pattern/) {
print $input . ": PASS!\n";
} else {
print $input . ": FAIL!\n"
}
}
Output:
s7t3a9c0k: PASS!
s27t33a49c50k: PASS!
s127t312a229c330k: PASS!
s1t312a22c3300k: FAIL!
The $pattern is built dynamically: substr($keyword, 0, 1) gets the first char, then ([0-9]+) is added, then join( '(??{ "[0-9]{".length($^N)."}" })', split("", substr($keyword, 1)) adds the following: it inserts (??{ "[0-9]{".length($^N)."}" }) in between each char of the $keyword substring from the second char. The (??{ "[0-9]{".length($^N)."}" }) part acts as a \d{X} pattern where X is the length of the most recent captured substring (it was ([0-9]+)).
The use re 'eval'; is necessary to build the pattern dynamically. As per this answer, it will only affect the regular expressions in the file or in the curlies where it is used.

Related

exactly once from a set of characters perl using regex

how to check exactly one character from a group of characters in perl using regexp.Suppose from (abcde) i want to check if out of all these 5 characters only one has occured which can occur multiple times.I have tried quantifiers but it does not work for a set of characters.
You could use the following regex match:
/
^
[^a-e]*+
(?: a [^bcde]*+
| b [^acde]*+
| c [^abde]*+
| d [^abce]*+
| e [^abcd]*+
)
\z
/x
The following is a simpler pattern that might be less efficient:
/ ^ [^a-e]*+ ([a-e]) (?: \1|[^a-e] )*+ \z /x
A non-regex solution might be simpler.
# Count the number of instances of each letter.
my %chars;
++$chars{$_} for split //;
# Count how many of [a-e] are found.
my $count = 0;
++$count for grep $chars{$_}, qw( a b c d e );
$count == 1
you can use regex to return a list of matches. then you can store the result in an array.
my #arr = "abcdeaa" =~ /a/g; print scalar #arr ."\n";
prints 3
my #arr = "bcde" =~ /a/g; print scalar #arr ."\n";
prints 0
if you use scalar #arr. it will return the length of the array.

How to match string that contain exact 3 time occurrence of special character in perl

I have try few method to match a word that contain exact 3 times slash but cannot work. Below are the example
#array = qw( abc/ab1/abc/abc a2/b1/c3/d4/ee w/5/a s/t )
foreach my $string (#array){
if ( $string =~ /^\/{3}/ ){
print " yes, word with 3 / found !\n";
print "$string\n";
}
else {
print " no word contain 3 / found\n";
}
Few macthing i try but none of them work
$string =~ /^\/{3}/;
$string =~ /^(\w+\/\w+\/\w+\/\w+)/;
$string =~ /^(.*\/.*\/.*\/.*)/;
Any other way i can match this type of string and print the string?
Match a / globally and compare the number of matches with 3
if ( ( () = m{/}g ) == 3 ) { say "Matched 3 times" }
where the =()= operator is a play on context, forcing list context on its right side but returning the number of elements of that list when scalar context is provided on its left side.
If you are uncomfortable with such a syntax stretch then assign to an array
if ( ( my #m = m{/}g ) == 3 ) { say "Matched 3 times" }
where the subsequent comparison evaluates it in the scalar context.
You are trying to match three consecutive / and your string doesn't have that.
The pattern you need (with whitespace added) is
^ [^/]* / [^/]* / [^/]* / [^/]* \z
or
^ [^/]* (?: / [^/]* ){3} \z
Your second attempt was close, but using ^ without \z made it so you checked for string starting with your pattern.
Solutions:
say for grep { m{^ [^/]* (?: / [^/]* ){3} \z}x } #array;
or
say for grep { ( () = m{/}g ) == 3 } #array;
or
say for grep { tr{/}{} == 3 } #array;
You need to match
a slash
surrounded by some non-slashes (^(?:[^\/]*)
repeating the match exactly three times
and enclosing the whole triple in start of line and and of line anchors:
$string =~ /^(?:[^\/]*\/[^\/]*){3}$/;
if ( $string =~ /\/.*\/.*\// and $string !~ /\/.*\/.*\/.*\// )

Perl regex: Substitution of everything but the pattern

In perl, I would like to substitute a negated class character set (everything but the pattern) by nothing, to keep only the expected string. Normally, this approach should work, but in my case it isn't :
$var =~ s/[^PATTERN]//g;
the original string:
$string = '<iframe src="https://foo.bar/embed/b74ed855-63c9-4795-b5d5-c79dd413d613?autoplay=1&context=cGF0aD0yMSwx</iframe>';
wished pattern to get: b74ed855-63c9-4795-b5d5-c79dd413d613
(5 hex number groups split with 4 dashes)
my code:
$pattern2keep = "[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}";
(should match only : xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (5 hex number groups split with 4 dashes) , char length : 8-4-4-4-12 )
The following should substitute everything but the pattern by nothing, but in fact it does not.
$string =~ s/[^$pattern2keep]//g;
What am I doing wrong please? Thanks.
A character class matches a single character equal to any one of the characters in the class. If the class begins with a caret then the class is negated, so it matches any one character that isn't any of the characters in the class
If $pattern2keep is [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} then [^$pattern2keep] will match any character other than -, 0, 1, 2, 4, 8, 9, [, ], a, f, {, or }
You need to capture the substring, like this
use strict;
use warnings 'all';
use feature 'say';
my $string = '<iframe src="https://foo.bar/embed/b74ed855-63c9-4795-b5d5-c79dd413d613?autoplay=1&context=cGF0aD0yMSwx</iframe>';
my $pattern_to_keep = qr/ \p{hex}{8} (?: - \p{hex}{4} ){3} - \p{hex}{12} /x;
my $kept;
$kept = $1 if $string =~ /($pattern_to_keep)/;
say $kept // 'undef';
output
b74ed855-63c9-4795-b5d5-c79dd413d613

Pattern matching in perl (Lookahead and Condition on word Index)

I have a long string, containing alphabetic words and each delimited by one single character ";" . The whole string also starts and ends with a ";" .
How do I count the number of occurrences of a pattern (started with ";") if index of a success match is divisible by 5.
Example:
$String = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;"
$Pattern = ";the(?=;f)"
OUTPUT: 1
Since:
Note 1: In above case, the $Pattern ;the(?=;f) exists as the 1st and 10th words in the $String; however; the output result would be 1, since only the index of second match (10) is divisible by 5.
Note 2: Every word delimited by ";" counts toward the index set.
Index of the = 1 -> this does not match since 1 is not divisible by 5
Index of fox = 2
Index of jumped = 3
Index of over = 4
Index of the = 5 -> this does not match since the next word (dog) starts with "d" not "f"
Index of dog = 6
Index of the = 7 -> this does not match since 7 is not divisible by 5
Index of duck = 8
Index of and = 9
Index of the = 10 -> this does match since 10 is divisible by 5 and the next word (frog) starts with "f"
Index of frog = 11
If possible, I am wondering if there is a way to do this with a single pattern matching without using list or array as the $String is extremely long.
Use Backtracking control verbs to process the string 5 words at a time
One solution is to add a boundary condition that the pattern is preceded by 4 other words.
Then setup an alteration so that if your pattern is not matched, the 5th word is gobbled and then skipped using backtracking control verbs.
The following demonstrates:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;";
my $pattern = qr{;the(?=;f)};
my #matches = $string =~ m{
(?: ;[^;]* ){4} # Preceded by 4 words
(
$pattern # Match Pattern
|
;(*SKIP)(*FAIL) # Or consume 5th word and skip to next part of string.
)
}xg;
print "Number of Matches = " . #matches . "\n";
Outputs:
Number of Matches = 1
Live Demo
Supplemental Example using Numbers 1 through 100 in words
For additional testing, the following constructs a string of all numbers in word format from 1 to 100 using Lingua::EN::Numbers.
For the pattern it looks for a number that's a single word with the next number that begins with the letter S.
use Lingua::EN::Numbers qw(num2en);
my $string = ';' . join( ';', map { num2en($_) } ( 1 .. 100 ) ) . ';';
my $pattern = qr{;\w+(?=;s)};
my #matches = $string =~ m{(?:;[^;]*){4}($pattern|;(*SKIP)(*FAIL))}g;
print "#matches\n";
Outputs:
;five ;fifteen ;sixty ;seventy
Reference for more techniques
The following question from last month is a very similar problem. However, I provided 5 different solutions in addition to the one demonstrated here:
In Perl, how to count the number of occurences of successful matches based on a condition on their absolute positions
You can count the number of semicolons in each substring up to the matching position. For a million-word string, it takes 150 seconds.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = join ';', q(),
map { qw( the fox jumped over the dog the duck and the frog)[int rand 11] }
1 .. 1000;
$string .= ';';
my $pattern = qr/;the(?=;f)/;
while ($string =~ /$pattern/g) {
my $count = substr($string, 0, pos $string) =~ tr/;//;
say $count if 0 == $count % 5;
}
Revised Answer
One relatively simple way to achieve what you want is by replacing the delimiters in the original text that occur on a 5-word-index boundary:
$text =~ s/;/state $idx++ % 5 ? ',' : ';'/eg;
Now you just need to trivially adjust your $pattern to look for ;the,f instead of ;the;f. You can use the =()= pseudo-operator to return the count:
my $count =()= $text =~ /;the(?=,f)/g;
Original answer after the break. (Thanks to #choroba for pointing out the correct interpretation of the question.)
Character-Based Answer
This uses the /g regex modifier in combination with pos() to look at matching words. For illustration, I print out all matches (not just those on 5-character boundaries), but I print (match) beside those on 5-char boundaries. The output is:
;the;fox;jumped;over;the;dog;the;duck;and;the;frog
^....^....^....^....^....^....^....^....^....^....
`the' #0 (match)
`the' #41
And the code is:
#!/usr/bin/env perl
use 5.010;
my $text = ';the;fox;jumped;over;the;dog;the;duck;and;the;frog';
say $text;
say '^....^....' x 5;
my $pat = qr/;(the)(?=;f)/;
#$pat = qr/;([^;]+)/;
while ($text =~ /$pat/g) {
my $pos = pos($text) - length($1) - 1;
say "`$1' \#$pos". ($pos % 5 ? '' : ' (match)');
}
First of, pos is also possible as a left hand side expression. You could make use of the \G assertion in combination with index (since speed is of concern for you). I expanded your example to showcase that it only "matches" for divisibles of 5 (your example also allowed for indices not divisible by 5 to be 1 a solution, too). Since you only wanted the number of matches, I only used a $count variable and incremented. If you want something more, use the normal if {} clause and do something in the block.
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;or;the;fish";
my $pattern = qr/;the(?=;f)/;
my ($index,$count, $position) = (0,0,0);
while(0 <= ($position = index $string, ';',$position)){
pos $string = $position++; #add one to $position, to terminate the loop
++$count if (!(++$index % 5) and $string =~/\G$pattern/);
}
say $count; # says 1, not 2
You could use the experimental features of regexes to solve you problem (especially the (?{}) blocks). Before you do, you really should read the corresponding section in the perldocs.
my ($index, $count) = (0,0);
while ($string =~ /; # the `;'
(?(?{not ++$index % 5}) # if with a code condition
the(?=;f) # almost your pattern, but we'll have to count
|(*FAIL)) # else fail
/gx) {
$count++;
}

How do I write a Perl regular expression that will match a string with only these characters?

I am pretty new to regular expressions. I want to write a regular expression which validates whether the given string has only certain characters. If the string has any other characters than these it should not be matched.
The characters I want are:
& ' : , / - ( ) . # " ; A-Z a-z 0-9
Try this:
$val =~ m/^[&':,\/\-().#";A-Za-z0-9]+$/;
$val will match if it has at least one character and consists entirely of characters in that character set. An empty string will not be matched (if you want an empty string to match, change the last + to a *).
You can test it out yourself:
# Here's the file contents. $ARGV[0] is the first command-line parameter.
# We print out the matched text if we have a match, or nothing if we don't.
[/tmp]> cat regex.pl
$val = $ARGV[0];
print ($val =~ m/^[&':,\/\-().#";A-Za-z0-9]+$/g);
print "\n";
Some examples:
# Have to escape ( and & in the shell, since they have meaning.
[/tmp]> perl regex.pl a\(bc\&
a(bc&
[/tmp]> perl regex.pl abbb%c
[/tmp]> perl regex.pl abcx
abcx
[/tmp]> perl regex.pl 52
52
[/tmp]> perl regex.pl 5%2
/\A[A-Za-z0-9&':,\().#";-]+\z/
Those so called special characters are not special in a character class.
There are two main approaches to construct a regular expression for this purpose. First is to make sure that all symbols are allowed. Another is to make sure that no symbols are not allowed. And you can also use the transliteration operator instead. Here's a benchmark:
use Benchmark 'cmpthese';
my #chars = ('0' .. '9', 'A' .. 'Z', 'a' .. 'z');
my $randstr = map $chars[rand #chars], 1 .. 16;
sub nextstr() { return $randstr++ }
cmpthese 1000000, {
regex1 => sub { nextstr =~ /\A["#&'(),\-.\/0-9:;A-Za-z]*\z/ },
regex2 => sub { nextstr !~ /[^"#&'(),\-.\/0-9:;A-Za-z]/ },
tr => sub { (my $dummy = nextstr) !~ y/"#&'(),\-.\/0-9:;A-Za-z/"#&'(),\-.\/0-9:;A-Za-z/c },
};
Results:
Rate regex1 regex2 tr
regex1 137552/s -- -41% -60%
regex2 231481/s 68% -- -32%
tr 341297/s 148% 47% --
/^[&':,/-().#";A-Za-z0-9]*$/