Perl Regex (\d*\.\d{2}) - regex

I have run into a regex in Perl that seems to be giving me problems. I'm fairly new to Perl - but I don't think that's my problem.
Here is the code:
if ($line =~ m/<amount>(\d*\.\d{2})<\//) { $amount = $1; }
I'm essentially parsing an XML formatted file for a single tag. Here is the specific value that I'm trying to parse.
<amount>23.00000</amount>
Can someone please explain why my regex won't work?
EDIT: I should mention I'm trying to import the amount as a currency value. The trailing 3 decimals are useless.

You shouldn't use regex for parsing HTML, but regardless this will fix it:
if ($line =~ m|<amount>(\d*\.\d{2})\d*<//)| { $amount = $1; }

The \d*\.\d{2} regex fragment only recognize a number with exactly two decimal places. Your sample has five decimal place, and thus does not match this fragment.
You want to use \d*\.\d+ if you need to have a least one decimal place, or \d*\.\d{2,5} if you can have between 2 and 5 decimal place.
And you should not use back-tick characters in your regex as they have no meaning in a regex, and thus are interpreted as regular character.
So you want to use:
if ($line =~ m/<amount>(\d*\.\d{2,5})<\/amount>/) { $amount = $1; }

In a regex pattern, the sequence "{2}" means match exactly two instances of the preceding pattern.
So \d{2} will only match two digits, whereas your input text had five digits at that point.
If you don't want the trailing digits, then you can discard them using \d* outside the capture-parentheses.
Also, if your pattern contains slashes, consider using a different delimiter to avoid having to escape the slashes, e.g.
if ($line =~ m{<amount>(\d*\.\d{2})\d*</}) { $amount = $1; }
Also, if you want to parse XML, then you may want to consider using an XML library such as XML::LibXML.

Related

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

Pre-compiled regex with special characters matching

I'm trying to match if a word such as *FOO (* as a normal character) is in a line. My input is a C++ source code. I need to use a pre-compiled regex for this due to program flow requirements, so I tried the following:
$pattern = qr/[^a-zA-Z](\*FOO)[^a-zA-Z]|^\s*(\*FOO)[^a-zA-Z]/;
And I use it like this:
if ($line =~ m/$pattern/) { ... }
It works and catches lines containing *FOO such as hey *FOO.BAR but also matches lines such as:
//FOO programming using stuff and things
which I want to ignore. What am I missing? Is \* not the right way to escape * in a pre-compiled regex in perl? If *FOO is stored in $word and the pattern looks like this:
$pattern = qr/[^a-zA-Z](\\$word)[^a-zA-Z]|^\s*(\\$word)[^a-zA-Z]/;
Is that different from the previous pattern? Because I tried both and the result seems to be the same.
I found a way to bypass this problem by removing the first char of $word and escaping * in the pattern, but if $word = "**.?FOO" for example, how do I create a qr// with $word so that all the meta-characters are escaped?
You do need to escape the *. One way to do it is by the quotemeta \Q operator:
use warnings;
use strict;
my $qr = qr/\Q*FOO/;
while (<DATA>) { print if /$qr/ }
__DATA__
//FOO programming using stuff and things
hey *FOO.BAR
Note that this escapes all ASCII non-"word" characters through the rest of the pattern. If you need to limit its action to only a part of the pattern then stop it using \E. Please see linked docs.
The above determines whether *FOO is in the line, regardless of whether it is a word or a part of one. It is not clear to me which is needed. Once that is specified the pattern can be adjusted.
Note that /\*FOO/ works, too. What you tried failed probably because of all the rest that you are trying to match, which purpose I do not understand. If you only need to detect whether the pattern is present the above does it. if there is a more specific requirement please clarify.
As for the examples: for me that string //FOO... is not matched by the main (first) $pattern you show. The second one won't interpolate $word -- but is firstly much too convoluted. The regex can really tie one in nasty knots when pushed; I suggest to keep it simple as much as possible.
Question 1:
my $word = '*FOO';
my $pattern = qr/\\$word/;
is equivalent to
my $pattern = qr/\\*FOO/; # zero or more '\' followed by 'FOO'
The $word is simply interpolated as is.
To get something equivalent to
my $pattern = qr/\*FOO/;
you should use
my $word = '*FOO';
my $pattern = qr/\Q$word\E/;
By default, an interpolated variable is considered a mini-regular expression, meta characters in the variable such as *, +, ? are still interpreted as meta character. \Q...\E will add a backslash before any character not matching /[A-Za-z_0-9]/, thus any meta characters in the interpolated variable is interpreted as literal ones. Refer to perldoc.
Question 2
I tried
my $pattern = qr/[^a-zA-Z](\*FOO)[^a-zA-Z]|^\s*(\*FOO)[^a-zA-Z]/;
my $line = '//FOO programming using stuff and things';
if($line =~ m/$pattern/){
print "$&\n";
}
else{
print "No match!";
}
and it printed "No match!". I can't explain how you get it matched.

Regex to match hours and time

I'm still learning Perl regular expressions and I need to match a string that represents the time.
However there are instances where multiple times get entered. Instead of '9AM' I will sometimes get '9AM5PM' or '09AM05PM' and so on... Fortunately, It always starts with one or two numbers and ends with 'AM' or 'PM' (Upper and Lowercase)
Here's what I have so far:
$string =~ /^((([1-9])|(1[0-2]))*(A|P)M)$/i;
Any help would be greatly appreciated!
The only problem I can see with your own code is that the hours field is optional (because you use a *) but you don't say what issues you're having.
You do have a lot of unnecessary captures. Every part of the pattern that is enclosed in parentheses will capture the corresponding part of the target string in an internal variables called $1, $2 etc. Unless you really need those captures it is best to use non-capturing parentheses (?: ... ) instead of the plain ones ( ... ).
Character classes like [1-9] are a single entity and don't need enclosing in parentheses. You also haven't accounted for a leading zero on values less than ten, and you should use a character class [AP] instead of an alternation (?:A|P)
It looks like you need
/\d{1,2}[AP]M/i
But you don't say what you want to do with the times once you have found them.
This snippet of code demonstrates the functionality by putting all the times that it finds in a string into array #times and then printing it with space separators.
use strict;
use warnings;
for my $string (qw/ 9AM 9AM5PM 09AM05PM /) {
my #times = $string =~ /\d{1,2}[AP]M/ig;
print "#times\n";
}
output
9AM
9AM 5PM
09AM 05PM
If you really want to verify that the hour value is in range (are you likely to come across 35pm?) then you could write
my #times = $string =~ / (?: 1[012] | 0?[1-9] ) [AP]M /igx
Note that the /x modifier makes whitespace insignificant within regular expressions, so that it can be used to clarify the form of the pattern.
You can try something like:
$string =~ /^((0?\d|1[0-2])[AP]M)+$/i;
As you can see here. Or:
$string =~ /^((0?\d|1[0-2])[AP]M){1,2}$/i;
If you want it to be just up to 2 hours together.

Search html file for random string using regex

I am trying to use Perl to search through an html file, looking for a semi-random string and store the match in a variable or print it out.
The string is the name of a jpg image and always follows the pattern of 9 digits followed by 6 lower case letters, i.e.
140005917smpxgj.jpg
But it is random every time. I am sure Perl can do this, but I will admit I am getting a bit confused.
Not too complicated. You may want to watch out for varying caps in the extension, e.g. JPG. If that is a concern, you may add (?i) before the extension.
You may also wish to prevent partial names, e.g. discard a match that has more than 9 digits. That is the (?<!\d) part: Make sure no digit characters precede the match.
ETA: Now extracts multiple matches too, thanks to ikegami.
while (<>) {
for (/(?<!\d)([0-9]{9}[a-z]{6}\.(?i)jpg)/g) {
say;
push #match, $_;
}
}
Try this regex:
/\b\d{9}[a-z]{6}\.jpg/
perldoc perlre
use warnings;
use strict;
while (<DATA>) {
if (/ ( [0-9]{9} [a-z]{6} [.] jpg ) /x) {
print "$1\n";
}
}
__DATA__
foo 140005917smpxgj.jpg bar
sdfads 777666999abcdef.jpg dfgffgh
Prints:
140005917smpxgj.jpg
777666999abcdef.jpg
the solution regex is \d{9}[a-z]{6}\.jpg

How to return the first five digits using Regular Expressions

How do I return the first 5 digits of a string of characters in Regular Expressions?
For example, if I have the following text as input:
15203 Main Street
Apartment 3 63110
How can I return just "15203".
I am using C#.
This isn't really the kind of problem that's ideally solved by a single-regex approach -- the regex language just isn't especially meant for it. Assuming you're writing code in a real language (and not some ill-conceived embedded use of regex), you could do perhaps (examples in perl)
# Capture all the digits into an array
my #digits = $str =~ /(\d)/g;
# Then take the first five and put them back into a string
my $first_five_digits = join "", #digits[0..4];
or
# Copy the string, removing all non-digits
(my $digits = $str) =~ tr/0-9//cd;
# And cut off all but the first five
$first_five_digits = substr $digits, 0, 5;
If for some reason you really are stuck doing a single match, and you have access to the capture buffers and a way to put them back together, then wdebeaum's suggestion works just fine, but I have a hard time imagining a situation where you can do all that, but don't have access to other language facilities :)
it would depend on your flavor of Regex and coding language (C#, PERL, etc.) but in C# you'd do something like
string rX = #"\D+";
Regex.replace(input, rX, "");
return input.SubString(0, 5);
Note: I'm not sure about that Regex match (others here may have a better one), but basically since Regex itself doesn't "replace" anything, only match patterns, you'd have to look for any non-digit characters; once you'd matched that, you'd need to replace it with your languages version of the empty string (string.Empty or "" in C#), and then grab the first 5 characters of the resulting string.
You could capture each digit separately and put them together afterwards, e.g. in Perl:
$str =~ /(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)/;
$digits = $1 . $2 . $3 . $4 . $5;
I don't think a regular expression is the best tool for what you want.
Regular expressions are to match patterns... the pattern you are looking for is "a(ny) digit"
Your logic external to the pattern is "five matches".
Thus, you either want to loop over the first five digit matches, or capture five digits and merge them together.
But look at that Perl example -- that's not one pattern -- it's one pattern repeated five times.
Can you do this via a regular expression? Just like parsing XML -- you probably could, but it's not the right tool.
Not sure this is best solved by regular expressions since they are used for string matching and usually not for string manipulation (in my experience).
However, you could make a call to:
strInput = Regex.Replace(strInput, "\D+", "");
to remove all non number characters and then just return the first 5 characters.
If you are wanting just a straight regex expression which does all this for you I am not sure it exists without using the regex class in a similar way as above.
A different approach -
#copy over
$temp = $str;
#Remove non-numbers
$temp =~ s/\D//;
#Get the first 5 numbers, exactly.
$temp =~ /\d{5}/;
#Grab the match- ASSUMES that there will be a match.
$first_digits = $1
result =~ s/^(\d{5}).*/$1/
Replace any text starting with a digit 0-9 (\d) exactly 5 of them {5} with any number of anything after it '.*' with $1, which is the what is contained within the (), that is the first five digits.
if you want any first 5 characters.
result =~ s/^(.{5}).*/$1/
Use whatever programming language you are using to evaluate this.
ie.
regex.replace(text, "^(.{5}).*", "$1");