Subtract pattern matching from array element perl - regex

I have written a code which allows me to subtract a specific value (ex: FP=0.021) from one element from an array if it matches a specific pattern. Here it is the code:
if ($info =~ /FP=/) {
my #array1 = split(';', $info);
if ($array1[$#array1] =~ /=([^.]*)/){
my $name1= $-[1];
$FPvalue = substr($array1[$#array1], $name1);
if ($FPvalue < 0.0001){
push(#FPvalues,$FPvalue);
Where $info is a string which contains information separated by a semicolon character (;).
I am lucky and the "FP=0.021" element is the last element from my array. But I would like to know a way for subtract it without using the expression: $array1[$#array1]
I would appreciate your help, thanks!

It is hard to tell without sample input data, but I think you want
push #FPvalues, $1 if $info =~ /FP=([\d.]+)/
It works by searching the string in $info for the sequence FP= followed by a number of dots and decimal digits. If that pattern is found, then the dots and digits part is put into $1 and pushed onto the array.

Here how you can parse the decimal number from the string as it resides at the end of the string:
$str = "asdsa;adsasd;adsasd;FP=0.021";
if($str =~ /=(\d+\.?\d+)$/){
print $1;
}

Related

Matching nth occurrence number in a string that contain hex number

I have string that contain both word, hex and digit I want to match the number in the string but when the hex contain digit it will mesh out with my digit.
my $string = " aabb is = 35"; #where aabb is hex number
my $string1 = " abc0 is = 75" ; #where abc0 is hex number
my (#val1) = $string =~ /(\d+)/g;
print "$val1[0]\n";
#val1 = $string1 =~ /(\d+)/g;
print " $val1[1]\n";
The above scripts can get the number after = that I want but I need to hard code $val[0] $val1[1] to get the number. Anyway I able to match the digit by ignore the hex number in case i am not sure which number will reach the number I want? So I can just print $val1 to get the number I want.
8/30 Update, Thanks for Toto point out
Will be the case that first hex number contain all digit 1234 and also the number I want to match will not necessary be a last word in string. The number I want will not necessary the second number but will be after =
Since the hex may contain only digits (1234) the two numbers can't be distinguished by format.
The shown strings allow to match positionally (end of string) or based on = preceding it
my ($num) = $string =~ /([0-9]+)$/;
my ($num) = $string =~ /=\s*([0-9]+)\b/;
or make use of some other "landmark" in your strings, if different from shown samples.
Given the clarification in the question's edit the second example above is suitable.
Original post, before comments (edited)
A number won't have letters (unless it involves exponents), so use a word boundary
my ($num) = $string =~ /\b([0-9]+)\b/;
for an (unsigned) integer.
To allow +/- and/or a floating point format
my ($num) = $string =~ /( [+-]? [0-9]+\.?[0-9]* )/x;
but note that this leaves out some formats used for numbers. The looks_like_number from core Scalar::Util is more reliable, and one can first match more broadly and then filter the list with it.
You can use \d instead of 0-9 but this matches many extra characters unless it is used with /a modifier(s), available since 5.14. But note that the /a has a broader effect than restricting to ASCII only numbers (\d). See perlre (search for /a).
You can use (\d+$) to match numbers only if they are at the end of string.
How about?
This is matching the second number that exists in the string:
while(<DATA>) {
my ($dec) = $_ =~ /.+\b(\d+)\b/a;
say $dec;
}
__END__
hex abc123 dec 456 blah
hex 123123 dec 789456 blah
Output:
456
789456

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

Perl Regex to match strings and numbers inside a file

Hi I am new to perl and trying to write a regex to find a match for specific number range and strings in a line inside the file, i need to find the lines("Document has 15 rows and 2 columns").
I know I am missing something, but the code I have tried so far is :
if(/^[a-zA-Z\d]+(has\s[1-9][0-9]$)\srows.*columns/)
{
print "$_\n";
}
It would be really helpful if anyone let me know what is wrong here!
The other answers here are good, but to explain what was wrong with the regex you used:
if(/^[a-zA-Z\d]+(has\s[1-9][0-9]$)\srows.*columns/)
First problem: the expression does not specify any whitespace between the beginning of the string and the word has, so there is no way for this pattern to match the space in Document has...
Second problem: the $ character in a regular expression means "match if the line ends here." It's almost always a mistake to use the $ anchor in the middle of a regex; the only way this would match would be in a multiline string like
Documenthas 15
rows and 7 columns
Making those two changes to your expression makes it work:
if(/^[a-zA-Z\d]+\s(has\s[1-9][0-9])\srows.*columns/)
{
print "$_\n";
}
Easy regex to use:
/Document has [0-9]+ row(s?) and [0-9]+ column(s?)/
If the s is only used when there is more than one row/column
I'm assuming you want to capture the numbers.
if ( /^Document has (\d+) rows and (\d+) columns/ ) {
my $rows = $1;
my $cols = $2;
my $line = "Document has 15 rows and 2 columns"
if ($line =~ /^Document has (\d+) rows? and (\d+) columns?/)
{
print "rows = $1\n";
print "cols = $2\n";
}
If you just want the number of rows, use this:
if (/(\d+)\s+rows/) {
print "$1\n";
}
If you want rows and columns (and they are always in that order), use:
if (/(\d+)\s+rows\s+and\s+(\d+)\s+columns/) {
print "$1 rows and $2 columns\n";
}
If you think it is necessary, you can be more restrictive if you need to: restricting the number of digits, forcing non-leading zeros, etc.
Also, I assume you are either using "-n" on the command line or have a loop around this.

Getting equal number of digits on both sides of a character in a string

I have a string
$test = 'xyz45sd2-32d34-sd23-456562.abc.com'
The objective is to obtain $1 = 23 and $2 = 45 i.e equal number of digits on both sides of the last -. Note that the number of digits is variable, and is not necessarily 2.
I have tried the following:
$test1 =~ s/.*(\d+)-(\d+).*//;
But
$1 contains 3
$2 contains 456562
You can try this regex
if($test1 =~ m/(\S+)-(\S+)-([a-z]*)(\d+)-(\d\d)(\d+).*/)
{
print $4,"|",$5;
}
I assume that u need only the first 2 didgits from 456562
perl -e '"xyz45sd2-32d34-sd23-456562.abc.com" =~ /(\d{2})-(\d{2})\d*(?=\.)/; print "$1\n$2\n"'
This other entry confirms that regex does not count:
How to match word where count of characters same
Building upon GreatBigBore's idea, if there's an upper bound to the count, then you could try the or operator |. This only matches your requirement to find a match; depending on the matched count the match will be in different bins. Only one case correctly places them in $1 and $2.
(\d{3})-(\d{3})|(\d{2})-(\d{2})|(\d{1})-(\d{1})
However if you concatenate the result captures as $1$3$5 and $2$4$6, you will effectively get the 2 stings you were looking for.
Another idea is to operate iteratively, you could repeat your search on the string by increasing the number until the match fails. (\d{1})-(\d{1}) , (\d{2})-(\d{2}) ...
A binary search comes to mind making it an O{ln(N)}, N being the upper limit for the capture length.
Theoretical answer
Short answer:
What you're looking for is not possible using regular expressions.
Long Answer:
Regular expressions (as their name suggests) are a compact representation of Regular languages (Type-3 grammars in the Chomsky Heirarchy).
What you're looking for is not possible using regular expressions as you're trying to write out an expression that maintains some kind of count (some contextual information other than beginning and end). This kind of behavior cannot be modelled as a DFA(actually any Finite Automaton). The informal proof of whether a language is regular is that there exists a DFA that accepts that language. As this kind of contextual information cannot be modeled in a DFA, thus by contradiction, you cannot write a regular expression for your problem.
Practical Solution
my ($lhs,$rhs) = $test =~ /^[^-]+-[^-]+-([^-]+)-([^-.]+)\S+/;
# Alernatively and faster
my (undef,undef,$lhs,$rhs) = split /-/, $test;
# Rest is common, no matter how $lhs and $rhs is extracted.
my #left = reverse split //, $lhs;
my #right = split //, $rhs;
my $i;
for($i=0; exists($left[$i]) and exists($right[$i]) and $left[$i] =~ /\d/ and $right[$i] =~ /\d/ ; ++$i){}
--$i;
$lhs= join "", reverse #left[0..$i];
$rhs= join "", #right[0..$i];
print $lhs, "\t", $rhs, "\n";
Edit: It's possible to improve the my solution by using regular expressions to extract the required numeric portions of $lhs and $rhs instead of split, reverse and for.
as #Samveen said it's technically not possible to do in pure regex
And Like #Samveen solution here's another version
#get left and right
my (undef,undef,$left,$right) = split /-/, $test;
#get left numbers
$left =~ s/.*?(\d+)$/$1/;
##get right numbers
$right =~ s/^(\d+).*/$1/;
##get length of both
my $right_length = length $right;
my $left_length = length $left;
if ($right_length > $left_length){
#make right length as same as left length
$right =~ s/(\d{$left_length}).*/$1/;
} else {
#make left length as same as right length
$left =~ s/.*(\d{$right_length})/$1/;
}
print $left, "\t", $right, "\n";

Search for value from command output and just print that found value

I am calling my programm from perl and getting the output with:
$output = `$calling 2>>bla.txt`;
Now I need just a specific value that will be presented in the output which I can check with Regex.
The needed output is:
Distance from Segment XY to its Centroid is: 3.455564713591596
Where XY is any number, and I just match for the "to its Centroid is: " the following:
if( $output =~ m/\sto\sits\sCentroid\sis:\s(\d)*$/)
But how do I get only the value that is presented near to the end?
I just want it to be printed on the screen.
Any advice?
Instead of \d* ("zero or more digits"), you probably need to match \d+([.]\d+)? ("one or more digits, optionally followed by a decimal point and one or more additional digits"). That would give you:
if( $output =~ m/\sto\sits\sCentroid\sis:\s\d+([.]\d+)?$/)
(hat-tip to Jonathan Leffler for pointing that out).
That done — you want to capture the \d+([.]\d+)?, so, wrap it in parentheses to create a capture-group:
if( $output =~ m/\sto\sits\sCentroid\sis:\s(\d+([.]\d+)?)$/)
and then the special variable $1 will be whatever it captured:
if( $output =~ m/\sto\sits\sCentroid\sis:\s(\d+([.]\d+)?)$/)
{ print $1; }
See the "Extracting matches" section of the perlretut ("Perl regular expressions tutorial") manual-page.
By the way, \s matches a single white-space character. Usually you'd want either to match only an actual space — write e.g. to its rather than to\sits — or to match one or more white-space characters — e.g. to\s+its.
You print the number you captured in the regex with the parentheses:
print "$1\n" if ($output =~ m/\sto\sits\sCentroid\sis:\s([-+]?\d*\.?\d+)$/);
You also make sure that the regex can pick up a number with a decimal point, and I've allowed an optional sign, too. If you need to worry about optional exponents, add (?:[eE][-+]?\d+)? after the \d+ in my regex.
If you have other things to do with the value, then convert into a regular if statement:
if ($output =~ m/\sto\sits\sCentroid\sis:\s([-+]?\d*\.?\d+)$/)
{
print "$1\n";
process_centroid($1);
}