How to split a string with a numeric suffix?

How to split a string with a numeric suffix? - regex

I have an input string and I need to split it according to the requirement below.
Input String :
1. "string"
2. "String 12343534"
3. "String_12343534"
4. "Stringone Stringtwo 12343534"
5. "Stringone Stringtwo_12343534"
6. "string 23string 12343534"
7. "string 23string_12343534"
8. "string_23string 12343534"
9. "string_23string_12343534"
10. "string 23string 4545stringthird 12343534"
11. "string 23string 4545stringthird_12343534"
12. "string_23string_stringthird_12343534"
13. "string-23string-stringthird_12343534"
14. "string_23string-stringthird_12343534"
Like this going on. And I have to split string separately and numerical separately.
The output should like this.
1. $str = "string" ; $num = ;
2. $str = "String" $num = "12343534";
3. $str = "String" $num = "_12343534";
4. $str = "Stringone Stringtwo" $num = "12343534";
5. $str = "Stringone Stringtwo" $num = "_12343534";
6. $str = "string 23string" $num = "12343534";
7. $str = "string 23string" $num = "_12343534";
8. $str = "string_23string" $num = "12343534";
9. $str = "string_23string" $num = "_12343534";
10. $str = "string 23string 4545stringthird" $num = "12343534";
11. $str = "string 23string 4545stringthird" $num = "_12343534";
12. $str = "string_23string_stringthird" $num = "_12343534";
13. $str = "string-23string-stringthird" $num = "_12343534";
14. $str = "string_23string-stringthird" $num = "_12343534";
Anyone can help me on this? How to split the given string to get above mentioned output?

Since you want to keep everything, you have to split on an anchor point. You can use a lookahead for this. Split on the following pattern:
(?=_\d)|\s+(?=\d)
So:
my ($string, $numerical) = split /(?=_\d)|\s+(?=\d)/, $input;
If an underscore is present before the digits, it will split just before it, otherwise it will split on any whitespace followed by a digit. This is the translation of the regex.
You could also use the following:
(?=_\d+$)|\s+(?=\d+$)
This will ensure there's nothing after the digits by forcing the match to go to the end of the string. If there's a non-digit character at the end, the split won't happen.
But it's easier to just match what you need instead of splitting IMO:
my ($string, $numerical) = $input =~ /^(.*?)\s*(_?\d+)$/;
This is more readable and better conveys your intent.

Personally I find the solutions using split a little overcomplex, and none of them seem to cope with a string like:
my $input = "code 4 you 12345678";
... where I'd expect the numeric suffix to be 12345678, not "4" or "4 you".
I'd prefer something like:
my ($string, $numerical) = $input =~ /^ (.+?) \s* (_?\d+) $/x;
Update: I think my solution above already covers most of your updated examples: all but the first example where the numeric suffix is empty. To cover the first example, you also need to set $string to the entire input string when the regexp fails to match at all. Something like this:
my ($string, $numerical) = ($input =~ /^ (.+?) \s* (_?\d+) $/x) ? ($1, $2) : ($input);

You could try the below code ,
my ($string, $numerical) = split / (?=\d+)|(?=_\d+)/, $str;
(?=_\d+) called positive lookahead which asserts what follows is an underscore followed by one or more numbers. If this condition is true then the regex engine sets the matching marker just before to the _\d+. Splitting according to this zero width match will give you the desired results.

Since you want to split on the boundary between numerical and alpha characters, you need to use positive lookahead and lookbehind assertions.
The additional spec for deciding where to include underscores is not entirely clear, but this is my best interpretation of what your intent may be:
use strict;
use warnings;
while (<DATA>) {
chomp;
my #fields = split m{(?<=[a-z])\s*(?=_*\d)|(?<=\d)\s*(?=_*[a-z])}i, $_;
use Data::Dump;
dd #fields;
}
__DATA__
string 123456
string_45645645
stringone stringtwo 23435345345
string one string two_2335345345
Outputs:
("string", 123456)
("string", "_45645645")
("stringone stringtwo", 23435345345)
("string one string two", "_2335345345")

([a-zA-Z\s]*)(.*)$
This will work.
See demo.
http://regex101.com/r/rX0dM7/8

Related

How to verify if a variable value contains a character and ends with a number using Perl

I am trying to check if a variable contains a character "C" and ends with a number, in minor version. I have :
my $str1 = "1.0.99.10C9";
my $str2 = "1.0.99.10C10";
my $str3 = "1.0.999.101C9";
my $str4 = "1.0.995.511";
my $str5 = "1.0.995.AC";
I would like to put a regex to print some message if the variable has C in 4th place and ends with number. so, for str1,str2,str3 -> it should print "matches". I am trying below regexes, but none of them working, can you help correcting it.
my $str1 = "1.0.99.10C9";
if ( $str1 =~ /\D+\d+$/ ) {
print "Candy match1\n";
}
if ( $str1 =~ /\D+C\d+$/ ) {
print "Candy match2\n";
}
if ($str1 =~ /\D+"C"+\d+$/) {
print "candy match3";
}
if ($str1 =~ /\D+[Cc]+\d+$/) {
print "candy match4";
}
if ($str1 =~ /\D+\\C\d+$/) {
print "candy match5";
}

if ($str1 =~ /C[^.]*\d$/)
C matches the letter C.
[^.]* matches any number of characters that aren't .. This ensures that the match won't go across multiple fields of the version number, it will only match the last field.
\d matches a digit.
$ matches the end of the string. So the digit has to be at the end.

I found it really helpful to use https://www.regextester.com/109925 to test and analyse my regex strings.
Let me know if this regex works for you:
((.*\.){3}(.*C\d{1}))
Following your format, this regex assums 3 . with characters between, and then after the third . it checks if the rest of the string contains a C.
EDIT:
If you want to make sure the string ends in a digit, and don't want to use it to check longer strings containing the formula, use:
^((.*\.){3}(.*C\d{1}))$

Lets look what regex should look like:
start{digit}.{digit}.{2-3 digits}.{2-3 digits}C{1-2 digits}end
very very strict qr/^1\.0\.9{2,3}\.101?C\d+\z/ - must start with 1.0.99[9]?.
very strict qr/^1\.\0.\d{2,3}\.\d{2,3}C\d{1,2}\z/ - must start with 1.0.
strict qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d{1,2}\z/
relaxed qr/^\d\.\d\.\d+\.\d+C\d+\z/
very relaxed qr/\.\d+C\d+\z/
use strict;
use warnings;
use feature 'say';
my #data = qw/1.0.99.10C9 1.0.99.10C10 1.0.999.101C9 1.0.995.511 1.0.995.AC/;
#my $re = qr/^\d\.\d\.\d+\.\d+C\d+\z/;
my $re = qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d+\z/;
say '--- Input Data ---';
say for #data;
say '--- Matching -----';
for( #data ) {
say 'match ' . $_ if /$re/;
}
Output
--- Input Data ---
1.0.99.10C9
1.0.99.10C10
1.0.999.101C9
1.0.995.511
1.0.995.AC
--- Matching -----
match 1.0.99.10C9
match 1.0.99.10C10
match 1.0.999.101C9

Loop through global substitution

$num = 6;
$str = "1 2 3 4";
while ($str =~ s/\d/$num/g)
{
print $str, "\n";
$num++;
}
Is it possible to do something like the above in perl? I would like the loop to run only 4 times and to finish with $str being 6 7 8 9.

You don't need the loop: the /g modifier causes the substitution to be repeated as many times as it matches. What you want is the /e modifier to compute the substitution. Assuming the effect you were after is to add 5 to each number, the example code is as follows.
$str = "1 2 3 4";
$str =~ s/(\d)/$1+5/eg;
print "$str\n";
If you really wanted to substitute numbers starting with 6, then this works.
$num = 6;
$str = "1 2 3 4";
$str =~ s/\d/$num++/eg;
print "$str\n";

You could with the match operator.
my $new_str = '';
while ($str =~ /\G(.*?)\d/gc) {
$new_str .= $1 . $num++;
}
$new_str .= substr($str, pos($str));
However, #TFBW's solution is probably what you want unless you're writing a tokenizer for a big parser. In a tokenizer, it would look more like:
# So we don't have to say «$str =~ all» over the place.
# Also, allows us to use redo.
for ($str) {
/\G \s+ /xgc; # Skip whitespace.
if (/\G (\d+) /xgc) {
# Do something with $1.
redo;
}
...
last if /\G \z /xgc;
die "Unrecognized token";
}

Counting occurrences of a word in a string in Perl

I am trying to find out the number of occurrences of "The/the". Below is the code I tried"
print ("Enter the String.\n");
$inputline = <STDIN>;
chop($inputline);
$regex="\[Tt\]he";
if($inputline ne "")
{
#splitarr= split(/$regex/,$inputline);
}
$scalar=#splitarr;
print $scalar;
The string is :
Hello the how are you the wanna work on the project but i the u the
The
The output that it gives is 7. However with the string :
Hello the how are you the wanna work on the project but i the u the
the output is 5. I suspect my regex. Can anyone help in pointing out what's wrong.

I get the correct number - 6 - for the first string
However your method is wrong, because if you count the number of pieces you get by splitting on the regex pattern it will give you different values depending on whether the word appears at the beginning of the string. You should also put word boundaries \b into your regular expression to prevent the regex from matching something like theory
Also, it is unnecessary to escape the square brackets, and you can use the /i modifier to do a case-independent match
Try something like this instead
use strict;
use warnings;
print 'Enter the String: ';
my $inputline = <>;
chomp $inputline;
my $regex = 'the';
if ( $inputline ne '' ) {
my #matches = $inputline =~ /\b$regex\b/gi;
print scalar #matches, " occurrences\n";
}

With split, you're counting the substrings between the the's. Use match instead:
#!/usr/bin/perl
use warnings;
use strict;
my $regex = qr/[Tt]he/;
for my $string ('Hello the how are you the wanna work on the project but i the u the The',
'Hello the how are you the wanna work on the project but i the u the',
'the theological cathedral'
) {
my $count = () = $string =~ /$regex/g;
print $count, "\n";
my #between = split /$regex/, $string;
print 0 + #between, "\n";
print join '|', #between;
print "\n";
}
Note that both methods return the same number for the two inputs you mentioned (and the first one returns 6, not 7).

The following snippet uses a code side-effect to increment a counter, followed by an always-failing match to keep searching. It produces the correct answer for matches that overlap (e.g. "aaaa" contains "aa" 3 times, not 2). The split-based answers don't get that right.
my $i;
my $string;
$i = 0;
$string = "aaaa";
$string =~ /aa(?{$i++})(?!)/;
print "'$string' contains /aa/ x $i (should be 3)\n";
$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the The";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 6)\n";
$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 5)\n";

What you need is 'countof' operator to count the number of matches:
my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/[Tt]he/g;
print $count;
If you want to select only the word the or The, add word boundary:
my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/\b[Tt]he\b/g;
print $count;

Pattern matching in perl (Lookahead and Condition on word Index)

I have a long string, containing alphabetic words and each delimited by one single character ";" . The whole string also starts and ends with a ";" .
How do I count the number of occurrences of a pattern (started with ";") if index of a success match is divisible by 5.
Example:
$String = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;"
$Pattern = ";the(?=;f)"
OUTPUT: 1
Since:
Note 1: In above case, the $Pattern ;the(?=;f) exists as the 1st and 10th words in the $String; however; the output result would be 1, since only the index of second match (10) is divisible by 5.
Note 2: Every word delimited by ";" counts toward the index set.
Index of the = 1 -> this does not match since 1 is not divisible by 5
Index of fox = 2
Index of jumped = 3
Index of over = 4
Index of the = 5 -> this does not match since the next word (dog) starts with "d" not "f"
Index of dog = 6
Index of the = 7 -> this does not match since 7 is not divisible by 5
Index of duck = 8
Index of and = 9
Index of the = 10 -> this does match since 10 is divisible by 5 and the next word (frog) starts with "f"
Index of frog = 11
If possible, I am wondering if there is a way to do this with a single pattern matching without using list or array as the $String is extremely long.

Use Backtracking control verbs to process the string 5 words at a time
One solution is to add a boundary condition that the pattern is preceded by 4 other words.
Then setup an alteration so that if your pattern is not matched, the 5th word is gobbled and then skipped using backtracking control verbs.
The following demonstrates:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;";
my $pattern = qr{;the(?=;f)};
my #matches = $string =~ m{
(?: ;[^;]* ){4} # Preceded by 4 words
(
$pattern # Match Pattern
|
;(*SKIP)(*FAIL) # Or consume 5th word and skip to next part of string.
)
}xg;
print "Number of Matches = " . #matches . "\n";
Outputs:
Number of Matches = 1
Live Demo
Supplemental Example using Numbers 1 through 100 in words
For additional testing, the following constructs a string of all numbers in word format from 1 to 100 using Lingua::EN::Numbers.
For the pattern it looks for a number that's a single word with the next number that begins with the letter S.
use Lingua::EN::Numbers qw(num2en);
my $string = ';' . join( ';', map { num2en($_) } ( 1 .. 100 ) ) . ';';
my $pattern = qr{;\w+(?=;s)};
my #matches = $string =~ m{(?:;[^;]*){4}($pattern|;(*SKIP)(*FAIL))}g;
print "#matches\n";
Outputs:
;five ;fifteen ;sixty ;seventy
Reference for more techniques
The following question from last month is a very similar problem. However, I provided 5 different solutions in addition to the one demonstrated here:
In Perl, how to count the number of occurences of successful matches based on a condition on their absolute positions

You can count the number of semicolons in each substring up to the matching position. For a million-word string, it takes 150 seconds.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = join ';', q(),
map { qw( the fox jumped over the dog the duck and the frog)[int rand 11] }
1 .. 1000;
$string .= ';';
my $pattern = qr/;the(?=;f)/;
while ($string =~ /$pattern/g) {
my $count = substr($string, 0, pos $string) =~ tr/;//;
say $count if 0 == $count % 5;
}

Revised Answer
One relatively simple way to achieve what you want is by replacing the delimiters in the original text that occur on a 5-word-index boundary:
$text =~ s/;/state $idx++ % 5 ? ',' : ';'/eg;
Now you just need to trivially adjust your $pattern to look for ;the,f instead of ;the;f. You can use the =()= pseudo-operator to return the count:
my $count =()= $text =~ /;the(?=,f)/g;
Original answer after the break. (Thanks to #choroba for pointing out the correct interpretation of the question.)
Character-Based Answer
This uses the /g regex modifier in combination with pos() to look at matching words. For illustration, I print out all matches (not just those on 5-character boundaries), but I print (match) beside those on 5-char boundaries. The output is:
;the;fox;jumped;over;the;dog;the;duck;and;the;frog
^....^....^....^....^....^....^....^....^....^....
`the' #0 (match)
`the' #41
And the code is:
#!/usr/bin/env perl
use 5.010;
my $text = ';the;fox;jumped;over;the;dog;the;duck;and;the;frog';
say $text;
say '^....^....' x 5;
my $pat = qr/;(the)(?=;f)/;
#$pat = qr/;([^;]+)/;
while ($text =~ /$pat/g) {
my $pos = pos($text) - length($1) - 1;
say "`$1' \#$pos". ($pos % 5 ? '' : ' (match)');
}

First of, pos is also possible as a left hand side expression. You could make use of the \G assertion in combination with index (since speed is of concern for you). I expanded your example to showcase that it only "matches" for divisibles of 5 (your example also allowed for indices not divisible by 5 to be 1 a solution, too). Since you only wanted the number of matches, I only used a $count variable and incremented. If you want something more, use the normal if {} clause and do something in the block.
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;or;the;fish";
my $pattern = qr/;the(?=;f)/;
my ($index,$count, $position) = (0,0,0);
while(0 <= ($position = index $string, ';',$position)){
pos $string = $position++; #add one to $position, to terminate the loop
++$count if (!(++$index % 5) and $string =~/\G$pattern/);
}
say $count; # says 1, not 2
You could use the experimental features of regexes to solve you problem (especially the (?{}) blocks). Before you do, you really should read the corresponding section in the perldocs.
my ($index, $count) = (0,0);
while ($string =~ /; # the `;'
(?(?{not ++$index % 5}) # if with a code condition
the(?=;f) # almost your pattern, but we'll have to count
|(*FAIL)) # else fail
/gx) {
$count++;
}

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?

Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}

Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.

Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.

If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to split a string with a numeric suffix? - regex

([a-zA-Z\s])(.)$ This will work. See demo. http://regex101.com/r/rX0dM7/8

Related

How to verify if a variable value contains a character and ends with a number using Perl

Loop through global substitution

Counting occurrences of a word in a string in Perl

Pattern matching in perl (Lookahead and Condition on word Index)

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to split a string with a numeric suffix? - regex

([a-zA-Z\s]*)(.*)$ This will work. See demo. http://regex101.com/r/rX0dM7/8

Related

How to verify if a variable value contains a character and ends with a number using Perl

Loop through global substitution

Counting occurrences of a word in a string in Perl

Pattern matching in perl (Lookahead and Condition on word Index)

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

Categories

Resources

([a-zA-Z\s])(.)$ This will work. See demo. http://regex101.com/r/rX0dM7/8