perl regex to convert currency - regex

i need some help in text cleaning/normalization process
i struck at a place where i need to convert a currency format
input: $100 million output: 100 million dollar
input: eur20 million output: 20 million euros
i'm using perl regex for the cleaning process, help will be appreciated if someone can help me in providing a regex to convert input to output
this is my code so far
s/([\$])([0-9\.])([million])/ $2 $3 dollars/g;
example number is $4.2 million
this is what i tried for converting dollars symbol into word "dollars" and shift it to end of phrase, but it is not providing the result as expected, it provide me ".2 million" as output

[...] in a regex introduces a character class, so [million] is the same as [nolim], and it matches one of those characters.
I'd create a translation table for the currencies in a hash. From the keys of the hash, you can build a regex that matches them, and use it in the replacement:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use feature qw{ say };
my %currency = ( '$' => 'dollar', # or dollars?
eur => 'euros',
'€' => 'euros',
);
my $regex = join '|', map quotemeta, keys %currency;
for my $input ('$100 million', 'eur20 million', '€13.2 thousand') {
( my $output = $input )
=~ s/($regex)([0-9.]+ (?:million|thousand))/$2 $currency{$1}/g;
say $output;
}

Your regex does not give the result you claim it does.
s/([\$])([0-9.])([million])/ $2 $3 dollars/g;
With the help of the /x modifier we can add whitespace (even newlines and comments) to the pattern to improve readability. Your pattern can then be re-written as
s/([\$]) # match a literal $ and capture that as $1
([0-9.]) # match ONE digit or a dot and capture as $2
([million]) # match ONE character of 'm', 'i', 'l', 'o', 'n'
# and capture as $3
/ $2 $3 dollars/gx;
There is no way $100 million matches this pattern and results in .2 million. Possible inputs would be
$3i, $.o or $9m. They would give 3 i dollars, . o dollars, and 9 m dollars.
What you are looking for is a pattern like this:
s/\$ # a literal '$'
([\d.]+) # one or more digits or dots, like e.g. '99.5',
# captured as $1
\s+ # one or more whitespace
(million) # the literal text 'million', captured as $2
/$1 $2 dollars/gx;
(or, as a one-liner: s/\$([\d.]+)\s+(million)/$1 $2 dollars/g;)
Note that $2 in this case always is million and you could also rewrite it as s/\$([\d.]+)\s+million/$1 million dollars/g; (omitting the () around million).

Related

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Get date in different format in Perl?

I need to get the date which could be in 3 possible format.
11/20/2012
11.20.2012
11-20-2012
How could I achieve this in Perl. I'm trying RegEx to get what I want. Here's my code.
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012"); #array values may vary in every run
foreach my $date (#dates){
$date =~ /[-.\/\d+]/g;
print "Date: $date \n";
}
I want the output to be. (code above doesn't print anything)
Date: 11/20/2012
Date: 2012.11.20
Date: 20-11-2012
Where am I wrong? Please Help. Thanks
Note: I want to achieve this without using any CPAN module as much as possible. I know there are a lot of CPAN modules that could provide what I want.
Your code almost produces what you want. I assume your input is a bit more complicated, or you have posted code that you are not actually running.
Either way, the problem is this
$date =~ /[-.\/\d+]/g;
First off, your plus multiplier is inside the character class: It should be after it. Second, it is just a pattern match, you need to use it in list context, and store its return value:
my ($match) = $date =~ /[-.\/\d]+/g;
print "Date: $match\n";
Then it will return the first of the strings found that contains one or more of dash, period, slash or a number. Be aware that it will match other things as well, as it is a rather unstrict regex.
Why does it work? Because a pattern match in list context returns a list of the matches when the global /g modifier is used.
I highly recommend the use of DateTime::Format::Strptime module, which has a rich set of funcionality. Think not only in parsing strings, but also in checking the date is valid.
Why not search for the formats one at a time?
=~ m!(\d{2}/\d{2}/\d{2}|\d{4}\.\d{2}\.\d{2}|\d{2}-\d{2}-\d{4})!
should do the trick. Other than that, there's a module dealing with dates called DateTime.
Try matching the formats in turn. The regex below matches any of your permitted separators (/, ., or -) and then requires the same separator via backreference (\2 or \3). Otherwise, you have three possible separators times two possible positions for the year to make six alternatives in your pattern.
#! /usr/bin/env perl
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr<
\b # begin on word boundary
(
(?: [0-9][0-9] ([-/.]) [0-9][0-9] \2 [0-9][0-9][0-9][0-9])
| (?: [0-9][0-9][0-9][0-9] ([-/.]) [0-9][0-9] \3 [0-9][0-9])
)
\b # end on word boundary
>x;
foreach my $date (#dates) {
if (my($match) = $date =~ /$date_pattern/) {
print "Date: $match\n";
}
}
Output:
Date: 11/20/2012
Date: 2012.11.20
Date: 20-11-2012
On my first try at the code above, I had \2 in the YYYY-MM-DD alternative where I should have had \3, which failed to match. To spare us counting parentheses, version 5.10.0 added named capture buffers.
Named Capture Buffers
It is now possible to name capturing parenthesis in a pattern and refer to the captured contents by name. The naming syntax is (?<NAME>....). It's possible to backreference to a named buffer with the \k<NAME> syntax. In code, the new magical hashes %+ and %- can be used to access the contents of the capture buffers.
Using this handy feature, the code above becomes
#! /usr/bin/env perl
use 5.10.0; # named capture buffers
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr!
\b # begin on word boundary
(?<date>
(?: [0-9][0-9] (?<sep>[-/.]) [0-9][0-9] \k{sep} [0-9][0-9][0-9][0-9])
| (?: [0-9][0-9][0-9][0-9] (?<sep>[-/.]) [0-9][0-9] \k{sep} [0-9][0-9])
)
\b # end on word boundary
!x;
foreach my $date (#dates) {
if ($date =~ /$date_pattern/) {
print "Date: $+{date}\n";
}
}
and produces the same output.
The code above still contains a lot of repetition. Using the (DEFINE) special case combined with named captures, we can make the pattern much nicer.
#! /usr/bin/env perl
use 5.10.0;
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr!
\b (?<date> (?&YMD) | (?&DMY)) \b
(?(DEFINE)
(?<SEP> [-/.])
(?<YYYY> [0-9][0-9][0-9][0-9])
(?<MM> [0-9][0-9])
(?<DD> [0-9][0-9])
(?<YMD> (?&YYYY) (?<sep>(?&SEP)) (?&MM) \k<sep> (?&DD))
(?<DMY> (?&DD) (?<sep>(?&SEP)) (?&MM) \k<sep> (?&YYYY))
)
!x;
foreach my $date (#dates) {
if ($date =~ /$date_pattern/) {
print "Date: $+{date}\n";
}
}
Yes, the subpattern named DMY also matches dates int MDY form. For now it suffices, and you ain’t gonna need it.

Using perl Regular expressions I want to make sure a number comes in order

I want to use a regular expression to check a string to make sure 4 and 5 are in order. I thought I could do this by doing
'$string =~ m/.45./'
I think I am going wrong somewhere. I am very new to Perl. I would honestly like to put it in an array and search through it and find out that way, but I'm assuming there is a much easier way to do it with regex.
print "input please:\n";
$input = <STDIN>;
chop($input);
if ($input =~ m/45/ and $input =~ m/5./) {
print "works";
}
else {
print "nata";
}
EDIT: Added Info
I just want 4 and 5 in order, but if 5 comes before at all say 322195458900023 is the number then where 545 is a problem 5 always have to come right after 4.
Assuming you want to match any string that contains two digits where the first digit is smaller than the second:
There is an obscure feature called "postponed regular expressions". We can include code inside a regular expression with
(??{CODE})
and the value of that code is interpolated into the regex.
The special verb (*FAIL) makes sure that the match fails (in fact only the current branch). We can combine this into following one-liner:
perl -ne'print /(\d)(\d)(??{$1<$2 ? "" : "(*FAIL)"})/ ? "yes\n" :"no\n"'
It prints yes when the current line contains two digits where the first digit is smaller than the second digit, and no when this is not the case.
The regex explained:
m{
(\d) # match a number, save it in $1
(\d) # match another number, save it in $2
(??{ # start postponed regex
$1 < $2 # if $1 is smaller than $2
? "" # then return the empty string (i.e. succeed)
: "(*FAIL)" # else return the *FAIL verb
}) # close postponed regex
}x; # /x modifier so I could use spaces and comments
However, this is a bit advanced and masochistic; using an array is (1) far easier to understand, and (2) probably better anyway. But it is still possible using only regexes.
Edit
Here is a way to make sure that no 5 is followed by a 4:
/^(?:[^5]+|5(?=[^4]|$))*$/
This reads as: The string is composed from any number (zero or more) characters that are not a five, or a five that is followed by either a character that is not a four or the five is the end of the string.
This regex is also a possibility:
/^(?:[^45]+|45)*$/
it allows any characters in the string that are not 4 or 5, or the sequence 45. I.e., there are no single 4s or 5s allowed.
You just need to match all 5 and search fails, where preceded is not 4:
if( $str =~ /(?<!4)5/ ) {
#Fail
}

how to extract a single digit in a number using regexp

set phoneNumber 1234567890
this number single digit, i want divide this number into 123 456 7890 by using regexp. without using split function is it possible?
The following snippet:
regexp {(\d{3})(\d{3})(\d{4})} "8144658695" -> areacode first second
puts "($areacode) $first-$second"
Prints (as seen on ideone.com):
(814) 465-8695
This uses capturing groups in the pattern and subMatchVar... for Tcl regexp
References
http://www.hume.com/html84/mann/regexp.html
regular-expressions.info/Brackets for Capturing
On the pattern
The regex pattern is:
(\d{3})(\d{3})(\d{4})
\_____/\_____/\_____/
1 2 3
It has 3 capturing groups (…). The \d is a shorthand for the digit character class. The {3} in this context is "exactly 3 repetition of".
References
regular-expressions.info/Repetition, Character Class
my($number) = "8144658695";
$number =~ m/(\d\d\d)(\d\d\d)(\d\d\d\d)/;
my $num1 = $1;
my $num2 = $2;
my $num3 = $3;
print $num1 . "\n";
print $num2 . "\n";
print $num3 . "\n";
This is writen for Perl and works assuming the number is in the exact format you specified, hope this helps.
This site might help you with regex
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

In Perl, how can I detect if a string has multiple occurrences of double digits?

I wanted to match 110110 but not 10110. That means at least twice repeating of two consecutive digits which are the same. Any regex for that?
Should match: 110110, 123445446, 12344544644
Should not match: 10110, 123445
/(\d)\1.*\1\1/
This matches a string with 2 instances of a double number, ie 11011 but not 10011
\d matches any digit
\1 matches the first match effectively doubling the first entry
This will also match 1111. If there needs to be other characters between change .* to .+
ooh, this looks neater
((\d)\2).*\1
If you want to find non-matching values, but there has to be 2 sets of doubles, then you would simply need to add the first part again as in
((\d)\2).*((\d)\4)
The bracketing would mean that $1 and $3 would contain the double digits and $2 and $4 contains the single digits (which are then doubled).
11233
$1=11
$2=1
$3=33
$4=3
If I understand correctly, your regexp will be:
m{
(\d)\1 # First repeated pair
.* # Anything in between
(\d)\2 # Second repeated pair
}x
For example:
for my $x (qw(110110 123445446 12344544644 10110 123445)) {
my $m = $x =~ m{(\d)\1.*(\d)\2} ? "matches" : "does not match";
printf "%-11s : %s\n", $x, $m;
}
110110 : matches
123445446 : matches
12344544644 : matches
10110 : does not match
123445 : does not match
If you're talking about all digits, this will do it:
00.*00|11.*11|22.*22|33.*33|44.*44|55.*55|66.*66|77.*77|88.*88|99.*99
It's just 9 different patterns OR'ed together, each of which checks for at least two occurrences of the desired 2-digit pattern.
Using Perls more advanced REs, you can use the following for two consecutive digits twice:
(\d)\1.*\1\1
or, as one of your comments states, two consecutive digits follwed somewhere by two more consecutive digits which may not be the same:
(\d)\1.*(\d)\2
depending on how your data is, here's a minimal regex way.
while(<DATA>){
chomp;
#s = split/\s+/;
foreach my $i (#s){
if( $i =~ /123445/ && length($i) ne 6){
print $_."\n";
}
}
}
__DATA__
This is a line
blah 123445446 blah
blah blah 12344544644 blah
.... 123445 ....
this is last line
There is no reason to do everything in one regex... You can use the rest of Perl as well:
#!/usr/bin/perl -l
use strict;
use warnings;
my #strings = qw( 11233 110110 10110 123445 123445446 12344544644 );
print if is_wanted($_) for #strings;
sub is_wanted {
my ($s) = #_;
my #matches = $s =~ /(?<group>(?<first>[0-9])\k<first>)/g;
return 1 < #matches / 2;
}
__END__
If I've understood your question correctly, then this, according to regexbuddy (set to using perl syntax), will match 110110 but not 10110:
(1{2})0\10
The following is more general and will match any string where two equal digits is repeated later on in the string.
(\d{2})\d+\1\d*
The above will match the following examples:
110110
110011
112345611
2200022345
Finally, to find two sets of double digits in a string and you don't care where they are, try this:
\d*?(\d{2})\d+?\1\d*
This will match the examples above plus this one:
12345501355789
Its the two sets of double 5 in the above example that are matched.
[Update]
Having just seen your extra requirement of matching a string with two different double digits, try this:
\d*?(\d)\1\d*?(\d)\2\d*
This will match strings like the following:
12342202345567
12342202342267
Note that the 22 and 55 cause the first string to match and the pair of 22 cause the second string to match.