Generic Regular expression in Perl for the following text - regex

I need to match the following text with a regular expression in Perl.
PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00
The expression that I have written is:
(\w+)\-(\w+)\-(\w+)\-(\w+)\-(\w+)\-(\w+)
But I want something more generic. By generic what I mean is I want to have any number of hyphens (-) in it.
Maybe there is something like if - then in regex, i.e. if some character is present then look for some other thing. Can anyone please help me out?
More about my problem:
AB-ab
abc-mno-xyz
lmi-jlk-mno-xyz
......... and so on...!
I wish to match all patterns.. to be more precise my string(feel free to use \w Since I can have uppercase , lowercase , numeric and '_'underscore here.) can be considered as a set of any number of alphanumeric substrings with hyphen('-') as a delimiter

You are looking for a regex with quatifiers (see perldoc perlre - Section Quantifiers).
You have several possibilities:
/\w+(?:-\w+)+)/ will match any two groups of \w characters if linked by a hyphen (-). For example, AB-CD will match. Pay attention that with \w you are matching upper and lower case letters, so you will also match a word like pre-owned as key.
/\w+(?:-\w+){5})/ will match keys with exactly 6 groups. It's equivalent to the one you have
/\w+(?:-\w+){5,})/ will match keys with 6 groups or more.
If there are more than one key in the document, you can do an implicit loop in the regex with the /g option.
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw{say};
use Data::Dumper;
my $text = "some text here PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00 some text there";
my #matches = $text =~ /\w+(?:-\w+)+)/g;
print Dumper(\#matches);
Result:
$VAR1 = [
'PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00'
];

How about using split:
my $str = 'PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00';
my #elem = split(/-/, $str);
Edit according to comments:
#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;
my $str = 'Text before PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00 text after';
my ($str2) = $str =~ /(\w+(?:-\w+)+)/;
my #elem = split(/-/, $str2);
say Dumper\#elem;
Output:
$VAR1 = [
'PS3XAY3N5SZ4K',
'XX_5C9F',
'S801',
'F04BN01K',
'00000',
'00'
];

Related

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Perl regular expression to split string by word

I have a string which consists of several words (separated by Capital letter).
For example:
$string1="TestWater"; # to be splited in an array #string1=("Test","Water")
$string2="TodayIsNiceDay"; # as #string2=("Today","Is","Nice","Day")
$string3="EODIsAlwaysGood"; # as #string3=("EOD","Is","Always","Good")
I know that Perl easily split uses the split function for fixed character, or the match regex can separate $1, $2 with fixed amount of variable. But how can this be done dynamically? Thanks in advance!
That post Spliting CamelCase doesn't answer my question, my question is more related to regex in Perl, that one was in Java (differences apply here).
Use split to split a string on a regex. What you want is an upper case character not followed by an upper case character as the boundary, which can be expressed by two look-ahead assertions (perlre for details):
#!/usr/bin/perl
use warnings;
use strict;
use Test::More;
sub split_on_capital {
my ($string) = #_;
return [ split /(?=[[:upper:]](?![[:upper:]]))/, $string ]
}
is_deeply split_on_capital('TestWater'), [ 'Test', 'Water' ];
is_deeply split_on_capital('TodayIsNiceDay'), [ 'Today', 'Is', 'Nice', 'Day' ];
is_deeply split_on_capital('EODIsAlwaysGood'), [ 'EOD', 'Is', 'Always', 'Good' ];
done_testing();
You can do this by using m//g in list context, which returns a list of all matches found. (Rule of thumb: Use m//g if you know what you want to extract; use split if you know what you want to throw away.)
Your case is a bit more complicated because you want to split "EODIs" into ("EOD", "Is").
The following code handles this case:
my #words = $string =~ /\p{Lu}(?:\p{Lu}+(?!\p{Ll})|\p{Ll}*)/g;
I.e. every word starts with an uppercase letter (\p{Lu}) and is followed by either
1 or more uppercase letters (but the last one is not followed by a lowercase letter), or
0 or more lowercase letters (\p{Ll})

validating initialization variables in perl constructor using ref

I'm trying to apply a split function on a string only where 1 colon (:) exists using regular expressions. The problem is that while a colon could exist multiple times consecutively, I'm only interested in instances where a colon is not preceded or followed by another colon . Any other character could precede or follow the colon.
Example string:
my $example_string = "_Fruit|Apple:~Vegetable|Carrot:~fruitfunc|Package::User::Today:~datefunct|{~date}"
Expected result:
my #result_array = ["_Fruit|Apple","~Vegetable|Carrot","~fruitfunc|Package::User::Today","~datefunct|{~date}"];
What I've tried so far is a combination of negation and group regular expressions...one example that got me close:
Cuts off 1 value before and after colon
my #result_array= split(/[^:][:][^:]/g, $example_string )
#result_array = [
'_targetfund|tes',
'rowcountmax|10',
'test|YE',
'fruit|appl',
'date|\'12/31/2016\''
];
I was playing around with https://regex101.com/, thought maybe there was a way to return $1 within the same regex or something which could be done recursively.
Any help would be appreciated
Maybe overkill, but i would use the
split /(?<!:):(?!:)/, $str;
demo
use 5.014;
use warnings;
use Test::More;
my $str = "_Fruit|Apple:~Vegetable|Carrot:~fruitfunc|Package::User::Today:~datefunct|{~date}";
my #wanted = ("_Fruit|Apple","~Vegetable|Carrot","~fruitfunc|Package::User::Today","~datefunct|{~date}");
my #arr = split /(?<!:):(?!:)/, $str;
is_deeply(\#arr, \#wanted);
done_testing(1);
#ok 1
#1..1
You can use look-around assertions, i.e. split on semicolon not preceded nor followed by a semicolon:
#!/usr/bin/perl
use warnings;
use strict;
use Test::Deep;
my $example_string = "_Fruit|Apple:~Vegetable|Carrot:~fruitfunc|Package::User::Today:~datefunct|{~date}";
my $result_array = ["_Fruit|Apple","~Vegetable|Carrot","~fruitfunc|Package::User::Today","~datefunct|{~date}"];
cmp_deeply( [ split /(?<!:):(?!:)/, $example_string ], $result_array );
This one should do the job : :(?=~)
Demo

How to split string and capture sentences ending using regex?

I want to split a string and capture sentences ending characters like ., ?, ! as well.
In other words, my regex separates a string based on whitespace and special characters that English sentence using end with like ., ?, ! but it should keep these.
I know it is kind confusing so look at array below, in case of
sentence like this
why you are eating too much?
The array that stores these words should be like this
#word = ( "why", "you", "are", "eating", "too", "much", "?" );
but my code output array like this instead
#word=("why"," ","you","are","eating","too"," ","much","?","?");
code :
my $s = "why you are eating too much?";
my #word = split /(\s+|([\s+.?!]))/, $s;
for ( #word ){
print "$_\n";
}
If you know what you want to throw away, use split.
If you know what you want to keep, use m//g in list context.
This looks like a case of the latter:
my $str = "why are you eating too much?";
my #words = $str =~ m/[^\s.!?]+|[.!?]/g;
You could use the following regular expression instead of using split():
(\w+|[\.!?])
Here is a sample code in Perl and a live example:
use Data::Dumper;
my $str = "why you are eating too much?";
my #matches = $str =~ /(\w+|[\.!?])/g;
print Dumper \#matches;

Regular Expression to find $0.00

Need to count the number of "$0.00" in a string. I'm using:
my $zeroDollarCount = ("\Q$menu\E" =~ tr/\$0\.00//);
but it doesn't work. The issue is the $ sign is throwing the regex off. It works if I just want to count the number of $, but fails to find $0.00.
How is this a duplicate? Your solution does not address dollar sign which is an issue for me.
You are using the transliteration operator tr///. That doesn't have anything to do with a pattern. You need the match operator m// instead. And because you want it to find all occurances of the pattern, use the /g modifier.
my $count = () = $menu =~ m/\$0\.00/g;
If we run this program, the output is 2.
use strict;
use warnings;
my $menu = '$0.00 and $0.00';
my $count = () = $menu =~ m/\$0\.00/g;
print $count;
Now lets take a look at what is going on. First, the pattern of the match.
/\$0\.00/
This is fairly straight-forward. There is a literal $, which we need to escape with a backslash \. The zero is followed by a literal dot ., which again we need to escape, because like the $ it has special meanings in regular expressions.
my $count = () = $menu =~ m/\$0\.00/g;
This whole line looks weird. We can break it up into a few lines to make it more readable.
my #matches = ( $menu =~ m/\$0\.00/g );
my $count = scalar #matches;
We need the /g switch on the regular expression match to make it match all occurrences. In list context, the match operation returns all matches (which will be the string "$0.00" a number of times). Because we want the count, we then force that into scalar context, which gives us the number of elements. That can be shortened to one line by the idiom shown above.