How to split string and capture sentences ending using regex? - regex

I want to split a string and capture sentences ending characters like ., ?, ! as well.
In other words, my regex separates a string based on whitespace and special characters that English sentence using end with like ., ?, ! but it should keep these.
I know it is kind confusing so look at array below, in case of
sentence like this
why you are eating too much?
The array that stores these words should be like this
#word = ( "why", "you", "are", "eating", "too", "much", "?" );
but my code output array like this instead
#word=("why"," ","you","are","eating","too"," ","much","?","?");
code :
my $s = "why you are eating too much?";
my #word = split /(\s+|([\s+.?!]))/, $s;
for ( #word ){
print "$_\n";
}

If you know what you want to throw away, use split.
If you know what you want to keep, use m//g in list context.
This looks like a case of the latter:
my $str = "why are you eating too much?";
my #words = $str =~ m/[^\s.!?]+|[.!?]/g;

You could use the following regular expression instead of using split():
(\w+|[\.!?])
Here is a sample code in Perl and a live example:
use Data::Dumper;
my $str = "why you are eating too much?";
my #matches = $str =~ /(\w+|[\.!?])/g;
print Dumper \#matches;

Related

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Generic Regular expression in Perl for the following text

I need to match the following text with a regular expression in Perl.
PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00
The expression that I have written is:
(\w+)\-(\w+)\-(\w+)\-(\w+)\-(\w+)\-(\w+)
But I want something more generic. By generic what I mean is I want to have any number of hyphens (-) in it.
Maybe there is something like if - then in regex, i.e. if some character is present then look for some other thing. Can anyone please help me out?
More about my problem:
AB-ab
abc-mno-xyz
lmi-jlk-mno-xyz
......... and so on...!
I wish to match all patterns.. to be more precise my string(feel free to use \w Since I can have uppercase , lowercase , numeric and '_'underscore here.) can be considered as a set of any number of alphanumeric substrings with hyphen('-') as a delimiter
You are looking for a regex with quatifiers (see perldoc perlre - Section Quantifiers).
You have several possibilities:
/\w+(?:-\w+)+)/ will match any two groups of \w characters if linked by a hyphen (-). For example, AB-CD will match. Pay attention that with \w you are matching upper and lower case letters, so you will also match a word like pre-owned as key.
/\w+(?:-\w+){5})/ will match keys with exactly 6 groups. It's equivalent to the one you have
/\w+(?:-\w+){5,})/ will match keys with 6 groups or more.
If there are more than one key in the document, you can do an implicit loop in the regex with the /g option.
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw{say};
use Data::Dumper;
my $text = "some text here PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00 some text there";
my #matches = $text =~ /\w+(?:-\w+)+)/g;
print Dumper(\#matches);
Result:
$VAR1 = [
'PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00'
];
How about using split:
my $str = 'PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00';
my #elem = split(/-/, $str);
Edit according to comments:
#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;
my $str = 'Text before PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00 text after';
my ($str2) = $str =~ /(\w+(?:-\w+)+)/;
my #elem = split(/-/, $str2);
say Dumper\#elem;
Output:
$VAR1 = [
'PS3XAY3N5SZ4K',
'XX_5C9F',
'S801',
'F04BN01K',
'00000',
'00'
];

Perl split function - use repeating characters as delimiter

I want to split a string using repeating letters as delimiter, for example,
"123aaaa23a3" should be split as ('123', '23a3') while "123abc4" should be left unchanged.
So I tried this:
#s = split /([[:alpha:]])\1+/, '123aaaa23a3';
But this returns '123', 'a', '23a3', which is not what I wanted. Now I know that this is because the last 'a' in 'aaaa' is captured by the parantheses and thus preserved by split(). But anyway, I can't add something like ?: since [[:alpha:]] must be captured for back reference.
How can I resolve this situation?
Hmm, its an interesting one. My first thought would be - your delimiter will always be odd numbers, so you can just discard any odd numbered array elements.
Something like this perhaps?:
my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' );
print Dumper \%s;
This'll give you:
$VAR1 = {
'23a3' => '',
'123' => 'a'
};
So you can extract your pattern via keys.
Unfortunately my second approach of 'selecting out' the pattern matches via %+ doesn't help particularly (split doesn't populate the regex stuff).
But something like this:
my #delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g;
print Dumper \%+;
By using a named capture, we identify that a is from the capture group. Unfortunately, this doesn't seem to be populated when you do this via split - which might lead to a two-pass approach.
This is the closest I got:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $str = '123aaaa23a3';
#build a regex out of '2-or-more' characters.
my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g);
#make the regex non-capturing
$regex = qr/(?:$regex)/;
print "Using: $regex\n";
#split on the regex
my #s = split m/$regex/, $str;
print Dumper \#s;
We first process the string to extract "2-or-more" character patterns, to set as our delmiters. Then we assemble a regex out of them, using non-capturing, so we can split.
One solution would be to use your original split call and throw away every other value. Conveniently, List::Util::pairkeys is a function that keeps the first of every pair of values in its input list:
use List::Util 1.29 qw( pairkeys );
my #vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';
Gives
Odd number of elements in pairkeys at (eval 6) line 1.
[ '123', '23a3' ]
That warning comes from the fact that pairkeys wants an even-sized list. We can solve that by adding one more value at the end:
my #vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;
Alternatively, and maybe a little neater, is to add that extra value at the start of the list and use pairvalues instead:
use List::Util 1.29 qw( pairvalues );
my #vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';
The 'split' can be made to work directly by using the delayed execution assertion (aka postponed regular subexpression), (??{ code }), in the regular expression:
#s = split /[[:alpha:]](??{"$&+"})/, '123aaaa23a3';
(??{ code }) is documented on the 'perlre' manual page.
Note that, according to the 'perlvar' manual page, the use of $& anywhere in a program imposes a considerable performance penalty on all regular expression matches. I've never found this to be a problem, but YMMV.

Perl regex - can I say 'if character/string matches, delete it and all to right of it'?

I have an array of strings, some of which contain the character '-'. I want to be able to search for it and for those strings that contain it I wish to delete all characters to the right of it.
So for example if I have:
$string1 = 'home - London';
$string2 = 'office';
$string3 = 'friend-Manchester';
or something as such, then the affected strings would become:
$string1 = 'home';
$string3 = 'friend';
I don't know if the white-space before the '-' would be included in the string afterwards (I don't want it as I will be comparing strings at a later point, although if it doesn't affect string comparisons then it doesn't matter).
I do know that I can search and replace specific strings/characters using something like:
$string1 =~ s/-//
or
$string1 =~ tr/-//
but I'm not very familiar with regular expressions in Perl so I'm not 100% sure of these. I've looked around and couldn't see anything to do with 'to the right of' in regex. Help appreciated!
You can delete anything after a hyphen - with this substitution:
s/-.*$//s
However, you will want to remove the whitespace prior to the hyphen and thus do
s/\s* - .* $//xs
The $ anchores the regex at the end of the string and the /s flag allows the dot to match newlines as well. While the $ is superfluous, it might add clarity.
Your substitution would just have removed the first -, and your transliteration would have removed all hyphens from the string.
Your regular expressions are just searching for the dash, so that's all they replace. You want to search for the dash, and anything after it.
$string =~ s/-.*//;
. represents any character, * means search for that character 0 or more times, and match as many as possible (i.e. to the end of the string if possible)
You can also search for an optional space before it.
$string =~ s/\s?-.*//;
(\s is a clearer way to specify a space character)
Using plain substr() and index() is possible as well.
my #strings = ("we are - so cool",
"lonely",
"friend-Manchester",
"home - london",
"home-new york",
"home with-childeren-first episode");
local $/ = " ";
foreach (#strings) {
$_ = substr($_,0,index($_,'-')) if (index($_,'-') != -1);
chomp;
}
The other answers are good. However, in light of what you said:
...if it doesn't affect string comparisons then it doesn't matter
You don't need a separate step for this at all. Suppose you want to compare $stringwith another variable, $search_string. The following expression will check for an exact match, except that it ignores anything $string has after a dash:
if ($string =~ /^$search_string(\s*-|$)/) { print "Strings matched"; }
#Using Regex:
my #strings =
("we are - so cool",
"lonely",
"friend - Manchester",
"home - london",
"home - new york",
"home with-childeren-first episode"
);
foreach (#strings) {
$_ =~ s/-\s*[a-zA-Z ]+\s*//g;
print "NEW: ".$_."\n";
}

Regex to match vowels in order as they appear once

I trying to figure out a correct regex to match vowels in a word as they appear.
There may be any number of consonants in the word; however, there should be no other vowels other than the 5 listed above. For example, the word “sacrilegious” should not match because, although it contains the five vowels in alphabetical order, there is an extra ‘i’ vowel located between the ‘a’ and the ‘e’. Hyphenated words are not allowed. In fact, your regular expression should not match any ‘word’ that contains any character other than the upper and lower case letters.
Here are some words that it should match
abstemious
facetious
arsenious
acheilous
anemious
caesious
This is what I have come up with so far, but when I run the program it doesn't seem to be doing what it should do.
#!/usr/bin/perl -w
use strict;
my $test = "abstemious";
if( $test =~ /a[^eiou]*e[^aiou]*i[^aeou]*o[^aeiu]u/ )
{
print "yes";
}
You have a small typo there, try this:
/a[^eiou]*e[^aiou]*i[^aeou]*o[^aeio]*u/
You're not too far off. Try this regex:
/\b[b-df-hj-np-tv-z]*a[b-df-hj-np-tv-z]*e[b-df-hj-np-tv-z]*i[b-df-hj-np-tv-z]*o[b-df-hj-np-tv-z]*u[b-df-hj-np-tv-z]*\b/i
[b-df-hj-np-tv-z]* matches any number of consonants, then it's just slotting in the vowels and the word boundary markers (\b).
Try this regexp:
\b(?=[a-zA-Z]+)[^aA]*?[aA][^aeAE]*?[eE][^aeiAEI]*?[iI][^aeioAEIO]*?[oO][^aeiouAEIOU]*?[uU][^aeiouAEIOU]*?\b
Is it okay for a word to contain duplicate vowels, as long as they're in the correct order? For example, would faaceetiioouus (if there were such a word) be acceptable? I ask because your current regex does indeed match it.
If you want to match only words that contain exactly one of each vowel, try this:
/^
(?=[a-z0-9]+$)
[^aeiou]*a
[^aeiou]*e
[^aeiou]*i
[^aeiou]*o
[^aeiou]*u
[^aeiou]*
$
/ix
Here is another approach, if you don't mind making a temporary copy of the string:
use warnings;
use strict;
while (<DATA>) {
chomp;
my $test = $_;
my $cp = $test; # leave original string intact
$cp =~ tr/aeiou//cd;
print "$test\n" if $cp eq 'aeiou';
}
=for output
abstemious
facetious
arsenious
acheilous
anemious
caesious
=cut
__DATA__
abstemious
facetious
arsenious
acheilous
anemious
caesious
unabstemious
sacrilegious
intravenous
faaceetiioouus