Regex in Perl messed up by bracket - regex

I am new to perl and having the following problem recently.
I have a string with format " $num1 $num2 $num3 $num4", that $num1, $num2, $num3, $num4 are real numbers can be a scientific number or in regular format.
Now I want to extract the 4 numbers from the string using regular expression.
$real_num = '\s*([+-]?[0-9]+\.?[0-9]*([eE][+-]?[0-9]+)?)'
while (<FP>) {
if (/$real_num$real_num$real_num$real_num/) {
print $1; print $2; print$3; print$4;
}
}
How can I get $num1, $num2, $num3, $num4 from $1, $2, $3, $4? As there is a necessary bracket in the $real_num regular expression so $1, $2, $3, $4 are not what I am expecting now.
Thanks for all warm replies, non-capturing group is the answer I need!

Just use non-capturing groups in your $real_num regex and make the regex itself a captured group:
$real_num = '\s*([+-]?[0-9]+\.?[0-9]*(?:[eE][+-]?[0-9]+)?)'
Now, the problem is: /$real_num$real_num$real_num$real_num/ will easily fail, if there are more than 4 numbers out there. May be this is not the case now. But, you should take care of that also. A split would be a better option.

If you are sure that your lines contains numbers, you can avoid that regexp, using split function:
while (<FP>) {
my #numbers = split /\s+/; #<-- an array with the parsed numbers
}
If you need tho check if the extracted strings are really numbers, use the Scalar::Util looks_like_number. Example:
use strict;
use warnings;
use Scalar::Util qw/looks_like_number/;
while(<DATA>) {
my #numbers = split /\s+/;
#numbers = map { looks_like_number($_) ? $_ : undef } #numbers;
say "#numbers";
}
__DATA__
1 2 NaN 4 -1.23
5 6 f 8 1.32e12
Prints:
1 2 NaN 4 -1.23
5 6 8 1.32e12

The answers to two important questions will affect whether you even need to use a regular expression to match the various number formats, or if you can do something much simpler:
Are you certain that your lines contain numbers only or do they also contain other data (or possibly some lines have no numbers at all and only other data)?
Are you certain that all numbers are separated from each other and/or other data by at least one space? If not, how are they separated? (For example, output from portsnap fetch generates lots of numbers like this 3690....3700.... with decimal points and no spaces at all used to separate them.
If your lines contain only numbers and no other data, and numbers are separated by spaces, then you do not even need to check if the results are numbers, but only split the line apart:
my #numbers = split /\s+/;
If you are not sure that your lines contain numbers, but you are sure that there is at least one space between each number and other numbers or other data, then the next line of code is a quite good way of extracting numbers properly with a clever way of allowing Perl itself to recognize all the many different legal formats of numbers. (This assumes that you do not want to convert other data values to NaN.) The result in #numbers will be proper recognition of all numbers within the current line of input.
my #numbers = grep { 1*$_ eq $_ } m/(\S*\d\S*)/g;
# we could do simply a split, but this is more efficient because when
# non-numeric data is present, it will only perform the number
# validation on data pieces that actually do contain at least one digit
You can determine if at least one number was present by checking the truth value of the expression #numbers > 1 and if exactly four were present by using the condition #numbers == 4, etc.
If your numbers are bumped up against each other, for instance, 5.17e+7-4.0e-1 then you will have a more difficult time. That is the only time you will need complicated regular expressions.
Note: Updated code to be even faster/better.
Note 2: There is a problem with the most up-voted answer due to a subtlety of how map works when storing the value of undef. This can be illustrated by the output from that program when using it to extract numbers from the first line of data such as an HTTP log file. The output looks correct, but the array actually has many empty elements and one would not find the first number stored in $numbers[0] as expected. In fact, this is the full output:
$ head -1 http | perl prog1.pl
Use of uninitialized value $numbers[0] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[1] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[2] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[3] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[4] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[5] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[6] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[7] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[10] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[11] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[12] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[13] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[14] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[15] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[16] in join or string at prog1.pl line 8, <> line 1.
200 2206
(Note that the indentation of these numbers shows how many empty array elements are present in #numbers and have been joined together by spaces before the actual numbers when the array has been converted to a string.)
However, my solution produces the proper results both visually and in the actual array contents, i.e., $numbers[0], $number[1], etc., are actually the first and second numbers contained in the line of the data file.
while (<>) {
my #numbers = m/(\S*\d\S*)/g;
#numbers = grep { $_ eq 1*$_ } #numbers;
print "#numbers\n";
}
$ head -1 http | perl prog2.pl
200 2206
Also, using the slow library function makes the other solution run 50% slower. Output was otherwise identical when running the programs on 10,000 lines of data.

My previous answer did not address the issue of non-space separated numbers. This requires a separate answer in my opinion, since the output can be drastically different from the same data.
my $number = '([-+]?(?:\d+\.\d+|\.\d+|\d+)(?:[Ee][-+]\d+)?)';
my $type = shift;
if ($type eq 'all') {
while (<>) {
my #all_numbers = m/$number/g;
# finds legal numbers whether space separated or not
# this can be great, but it also means the string
# 120.120.120.120 (an IP address) will return
# 120.120, .120, and .120
print "#all_numbers\n";
}
} else {
while (<>) {
my #ss_numbers = grep { m/^$number$/ } split /\s+/;
# finds only space separated numbers
print "#ss_numbers\n";
}
}
Usage:
$ prog-jkm2.pl all < input # prints all numbers
$ prog-jkm2.pl < input # prints just space-separated numbers
The only code that the OP probably needs:
my $number = '(-?(?:\d+\.\d+|\.\d+|\d+)(?:[Ee][-+]\d+)?)';
my #numbers = grep { m/^$number$/ } split /\s+/;
At this point, $numbers[0] will be the first number, $numbers[1] is the second number, etc.
Examples of output:
$ head -1 http | perl prog-jkm2.pl
200 2206
$ head -1 http | perl prog-jkm2.pl all
67.195 .114 .38 19 2011 01 20 31 -0400 1 1 1.0 200 2206 5.0

Related

awk : set start and end of match

I have a LaTeX-like table like this (columns are delimited by &) :
foobar99 & 68
foobar4 & 43
foobar2 & 73
I want to get the index of the numbers at column 2 by using match.
In Vim, we can use \zs and \ze to set start and end of matching.
Thus, to match accurately number at colum 2, we can use ^.*&\s*\zs[[:digit:]]\+\ze\s*$.
How about awk? Is there an equivalent?
EDIT:
Matching for the first line:
foobar99 & 68
^^
123456789012345678
Expected output : 18.
EDIT2:
I am writing an awk script to deal with block delimited by line break (Hence, FS="\n" and RS=""). The MWE above is just one of these blocks.
A possible way to get the index of number at column 2 is to do something like that
split(line, cases, "&");
index = match(cases[2], /[[:digit:]]\+/);
but I am looking for a beautiful way to do this.
Apologies for the XY problem. But I'm still interested in start/end matching.
Too little context, so a simple guess: have you tried splitting the table into columns? With something like awk -F '\\s*&\\s*' you have your second column in $2.
In fact, you can use split() to retrieve the exact column of a string:
split(s, a[, fs ])
Split the string s into array elements a[1], a[2], ..., a[n], and
return n. All elements of the array shall be deleted before the split is
performed. The separation shall be done with the ERE fs or with the field
separator FS if fs is not given. Each array element shall have a
string value when created and, if appropriate, the array element
shall be considered a numeric string (see Expressions in awk). The
effect of a null string as the value of fs is unspecified.
So your second column is something like
split(s, a, /\s*&\s*/)
secondColumn = a[2]
By default, awk sees three columns in your data, and column 2 contains & only (and column 3 contains numbers). If you change the field delimiter to &, then you have two columns with trailing spaces in column 1 and leading spaces in column 2 (and some trailing spaces, as it happens; try copying the data from the question).
In awk, you could convert column 2 with leading spaces into a number by adding 0: $2 + 0 would force it to be treated as a number. If you use $2 in a numeric context, it'll be treated as a number. Conversely, you can force awk to treat a field as a string by concatenating with the empty string: $2 "" will be a string.
So there's no need for the complexity of regexes to get at the number — if the data is as simple as shown.
You say you want to use match; it is not clear what you need that for.
awk -F'&' '{ printf "F1 [%s], F2 [%10s] = [%d] = [%-6d] = [%06d]\n", $1, $2, $2, $2, $2 }' data
For your data, which has a single blank at the end of the first two lines and a double blank at the end of the third, the output is:
F1 [foobar99 ], F2 [ 68 ] = [68] = [68 ] = [000068]
F1 [foobar4 ], F2 [ 43 ] = [43] = [43 ] = [000043]
F1 [foobar2 ], F2 [ 73 ] = [73] = [73 ] = [000073]
Note that I didn't need to explicitly convert $2 to a number. The printf formats treated it as a string or a number depending on whether I used %s or %d.
If you need to, you can strip trailing blanks of $1 (or, indeed, $2), but without knowing what else you need to do, it's hard to demonstrate alternatives usefully.
So, I think awk does what you need without needing you to jump through much in the way of hoops. For a better explanation, you'd need to provide a better question, describing or showing what you want to do.
You can try this way
awk '{print index($0,$3)}' infile

BASH: Split strings without any delimiter and keep only first sub-string

I have a CSV file containing 7 columns and I am interested in modifying only the first column. In fact, in some of the rows a row name appears n times in a concatenated way without any space. I need a script that can identify where the duplication starts and remove all duplications.
Example of a row name among others:
Row name = EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4
Replace by: EXAMPLE1.ABC_DEF.panel4
In the different rows:
n can vary
The length of the row name can vary
The structure of the row name can vary (eg. amount of _ and .), but it is always collated without any space
What I have tried:
:%s/(.+)\1+/\1/
Step-by-step:
%s: substitute in the whole file
(.+)\1+: First capturing group. .+ matches any character (except for line terminators), + is the quantifier — matches between one and unlimited times, as many times as possible, giving back as needed.
\1+: matches the same text as most recently matched by the 1st capturing group
Substitute by \1
However, I get the following errors:
E65: Illegal back reference
E476: Invalid command
From what i understand you need only one line contain EXAMPLE1.ABC_DEF.panel4. In that case you can do the following:
First remove duplicates in one line:
sed -i "s/EXAMPLE1.ABC_DEF.panel4.*/EXAMPLE1.ABC_DEF.panel4/g"
Then remove duplicated lines:
awk '!a[$0]++'
If all your rows are of the format you gave in the question (like EXAMPLExyzEXAMPLExyz) then this should work-
awk -F"EXAMPLE" '{print FS $2}' file
This takes "EXAMPLE" as the field delimiter and asks it to print only the first 'column'. It prepends "EXAMPLE" to this first column (by calling the inbuilt awk variable FS). Thanks, #andlrc.
Not an ideal solution but may be good enough for this purpose.
This script, with first arg is the string to test, can retrieve the biggest duplicate substring (i.e. "totototo" done "toto", not "to")
#!/usr/bin/env bash
row_name="$1"
#test duplicate from the longest to the smallest, by how many we need to split the string ?
for (( i=2; i<${#row_name}; i++ ))
do
match="True"
#continue test only if it's mathematically possible
if (( ${#row_name} % i )); then
continue
fi
#length of the potential duplicate substring
len_sub=$(( ${#row_name} / i ))
#test if the first substring is equal to each others
for (( s=1; s<i; s++ ))
do
if ! [ "${row_name:0:${len_sub}}" = "${row_name:$((len_sub * s)):${len_sub}}" ]; then
match="False"
break
fi
done
#each substring are equal, so return string without duplicate
if [ $match = "True" ]; then
row_name="${row_name:0:${len_sub}}"
break
fi
done
echo "$row_name"

Perl regex from file.txt, match columns greater than x

I have a file containing several rows of code, like this:
160101, 0100, 58.8,
160101, 0200, 59.3,
160101, 0300, 59.5,
160101, 0400, 59.1,
I'm trying to print out the third column with a regex, like this:
# Read the text file.
open( IN, "file.txt" ) or die "Can't read words file: $!";
# Print out.
while (<IN>) {
print "Number: $1\n"
while s/[^\,]+\,[^\,]+\,([^\,]+)\,/$1/g;
}
And it works fairly well, however, I'm trying to only fetch the numbers that are greater than or equal to 59 (that includes numbers like 59.1 and 59.0). I've tried several numeric regex combinations (the one below will not give me the right number, obviously, but just making a point), including:
while s/[^\,]+\,[^\,]+\,([^\,]+)\,^[0-9]{3}$/$1/g;
but none seem to work. Any ideas?
My first idea would be to split that line and then pick and choose
while (my $line = <IN>) {
my #nums = split ',\s*', $line;
print "$nums[2]\n" if $nums[2] >= $cutoff;
}
If you insist on doing it all in the regex then you may want to use /e modifier, so in the substitution part you can run code. Then you can test the particular match and print it there.
Assuming that the numbers can't reach 100 (three digits) you could use
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+)\,
which uses your regex except for the capture group which captures the number 59 and it's decimals, or two digit numbers from 60-99 and it's decimals.
Regards
Edit:
To go above 100 you can add another alternative in the capture group:
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+|[1-9]\d{2,}\.\d+)\,
which allows larger numbers (>=100.0).
Why do you use while? Is it possible to have more than one third column on a line? A simple if will work the same, comunicating the intent more clearly.
Also, if you want to extract, you don't need to substitute. Use m// instead of s///.
Regexes aren't the right tool to do numberic comparisons. Use >= instead:
print "Number: $1\n" if /[^\,]+\,[^\,]+\,([^\,]+)\,/
&& $1 >= 59
Assuming the line ends with a comma :
print foreach map{s/.+?(\d+.\d+),$/$1/;$_} ;
In case there might be someting after the rightmost comma :
print foreach map{s/.+?(\d+.\d+),[^,]*$/$1/;$_} ;
But i wouldn't use regexp in that case :
print foreach map{(split, ',')[-2]} ;
I would suggest not using a regex when split is a better tool for the job. Likewise - regex is very bad at detecting numeric values - it works on text based patterns.
But how about:
while ( <> ) {
print ((split /,\s*/)[2],"\n");
}
If you want to test a conditional:
while ( <> ) {
my #fields = split /,\s*/;
print $fields[2],"\n" if $fields[2] >= 59;
}
Or perhaps:
print join "\n", grep { $_ >= 59 } map { (split /,\s*/)[2] } <>;
map takes your input, and extracts the third field (returning a list). grep then applies a filter condition to every element. And then we print it.
Note - in the above, I use <> which is the magic file handle (reads files specified on command line, or STDIN) but you can use your filehandle.
However it's probably worth noting - 3 argument open with lexical file handles are recommended now.
open ( my $input, '<', 'file.txt' ) or die $!;
It has a number of advantages and is generally good style.

Get part of a string based on conditions using regex

For the life of me, I can't figure out the combination of the regular expression characters to use to parse the part of the string I want. The string is part of a for loop giving a line of 400 thousand lines (out of order). The string I have found by matching with the unique number passed by an array for loop.
For every string I'm trying to get a date number (such as 20151212 below).
Given the following examples of the strings (pulled from a CSV file with 400k++ lines of strings):
String1:
314513,,Jr.,John,Doe,652622,U51523144,,20151212,A,,,,,,,
String2:
365422,johnd#blankity.com,John,Doe.,Jr,987235,U23481,z725432,20160221,,,,,,,,
String3:
6231,,,,31248,U51523144,,,CB,,,,,,,
There are several complications here...
Some names have a "," in them, so it makes it more than 15 commas.
We don't know the value of the date, just that it is a date format such as (get-date).tostring("yyyyMMdd")
For those who can think of a better way...
We are given two CSV files to match. Algorithmic steps:
Look in the CSV file 1 for the ID Number (found on the 2nd column)
** No ID Numbers will be blank for CSV file 1
Look in the CSV file 2 and match the ID number from CSV file 1. On this same line, get the date. Once have date, append in 5th column on CSV file 1 with the same row as ID number
** Note: CSV file 2 will have $null for some of the values in the ID
number column
I'm open to suggestions (including using the Import-Csv cmdlet in which I am not to familiar with the flags and syntax of for loops with those values yet).
You could try something like this:
,(19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01]),
This will match all dates in the given format from 1900 - 2099. It is also specific enough to rule out most other random numbers, although without a larger sample of data, it's impossible to say.
Then in PowerShell:
gc data.csv | where { $_ -match ",((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }
In the PowerShell match we added capturing parenthesis around what we want, and reference the group via the group number in the $matches index.
If you are only interested in matching one line based on a preceding id you could use a lookbehind. For example,
$id=314513; # Or maybe U23481
gc c:\temp\reg.txt | where { $_ -match "(?<=$id.*),((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }

Generate substrings from a string in Perl

I have a string of characters which I want to break down in its substrings on the spaces between words, but the number of spaces spanning between a substring should not be more than 4.
E.g.: String:
"Baicalein, a specific lipoxygenase (LOX) inhibitor, has anti-inflammatory and antioxidant effects."
The resulting substrings should look like
1. Baicalein,
2. Baicalein, a
3. Baicalein, a specific
4. Baicalein, a specific lipoxygenase
5. Baicalein, a specific lipoxygenase (LOX)
6. a
7. a specific...
I feel there must be some way with Regex, but I'm not sure
EDIT
Code that I have used:
my #arr = split('\s', $line);
for(my $i=0; $i<$#arr; $i++)
{
my $str1 = $arr[$i];
my $str2 = $arr[$i].' '.$arr[$i+1];
my $str3 = $arr[$i].' '.$arr[$i+1].' '.$arr[$i+2];
my $str4 = $arr[$i].' '.$arr[$i+1].' '.$arr[$i+2].' '.$arr[$i+3];
}
I have very long strings and by this approach it takes a lot of time.
Thanks in Advance
You could create an inner loop to avoid the repeated code. Also, repeatedly gluing stuff with the dot operator is less efficient.
my #substrings;
for (my $i=0; $i<=$#arr; ++$i)
{
for (my $j=0; $j<5 && $i+$j<=$#arr; ++$j)
{
push #substrings, join(' ', #arr[$i..$i+$j]);
}
}
You'll notice the additional boundary condition to prevent the inner loop from going past the end of the input array, and the use of a new array #substrings to contain the results. Finally, see how indentation helps you see what goes where.