Unable to match Indian Rupee currency symbol using regex in Perl

Unable to match Indian Rupee currency symbol using regex in Perl - regex

Following is my text:
Total: ₹ 131.84
Thanks for choosing Uber, Pradeep
I would like to match the amount part, using the following code:
if ( $mail_body =~ /Total: \x{20B9} (\d+)/ ) {
$amount = $1;
}
But, it does not match, tried using regex debugging, here's the output:
Compiling REx "Total: \x{20B9} (\d+)"
Final program:
1: EXACT <Total: \x{20b9} > (5)
5: OPEN1 (7)
7: PLUS (9)
8: DIGIT (0)
9: CLOSE1 (11)
11: END (0)
anchored utf8 "Total: %x{20b9} " at 0 (checking anchored) minlen 10
Matching REx "Total: \x{20B9} (\d+)" against "Total: %342%202%271%302%240131.84%n%nThanks for choosing Ube"...
UTF-8 pattern...
Match failed
Freeing REx: "Total: \x{20B9} (\d+)"
The full code is at http://pastebin.com/TGdFX7hg.

Disclaimer: This feels more like a comment than an answer, but I need more space.
I've never used MIME::Parser and friends before, but from what I've read in the documentation, the following might work:
use Encode qw(decode);
# according to your code, $text_mail is a MIME::Entity object
my $charset = $text_mail->head->mime_attr('content-type.charset');
my $mail_body_raw = $text_mail->bodyhandle->as_string;
my $mail_body = decode $charset, $mail_body_raw;
The idea is to get the charset from the MIME::Head object, then use Encode to decode the body accordingly.
Of course, if you know that it's always going to be UTF-8 text, you could also hardcode that:
my $mail_body = decode 'UTF-8', $mail_body_raw;
After that, your regex may still fail to work because according to the debugging output in your question the character between ₹ and the number is actually not a simple space (ASCII 32, U+0020), but a non-breaking space (U+00A0). You should be able to match that with \s:
if ( $mail_body =~ /Total: \x{20B9}\s(\d+)/ ) {

This is a bit of searching for the explanation, not an outright answer. Please bear with me.
I believe your $mail_body does not contain what you think it does. You posted the input data as plain text. Was that copied from a mail client?
If I take the code and the input data from the question and run it with use re 'debug' I get a different output.
use utf8;
use strict;
use warnings;
use re 'debug';
my $mail_body = qq{Total: ₹ 131.84
Thanks for choosing Uber, Pradeep};
if ( $mail_body =~ /Total: \x{20B9} (\d+)/ ) {
my $amount = $1;
}
It will produce this:
Compiling REx "Total: \x{20B9} (\d+)"
Final program:
1: EXACT <Total: \x{20b9} > (5)
5: OPEN1 (7)
7: PLUS (9)
8: POSIXU[\d] (0)
9: CLOSE1 (11)
11: END (0)
anchored utf8 "Total: %x{20b9} " at 0 (checking anchored) minlen 10
Matching REx "Total: \x{20B9} (\d+)" against "Total: %x{20b9} 131.84%n%nThanks for choosing Uber, Pradeep"
UTF-8 pattern and string...
Intuit: trying to determine minimum start position...
Found anchored substr "Total: %x{20b9} " at offset 0...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <Total: > | 1:EXACT <Total: \x{20b9} >(5)
11 < %x{20b9} > <131.84%n%n>| 5:OPEN1(7)
11 < %x{20b9} > <131.84%n%n>| 7:PLUS(9)
POSIXU[\d] can match 3 times out of 2147483647...
14 <%x{20b9} 131> <.84%n%nTha>| 9: CLOSE1(11)
14 <%x{20b9} 131> <.84%n%nTha>| 11: END(0)
Match successful!
Freeing REx: "Total: \x{20B9} (\d+)"
Let's compare the line with the Matching REx to your output:
against against "Total: %x{20b9} 131.84%n%nThanks for choosing Uber, Pradeep"
against "Total: %342%202%271%302%240131.84%n%nThanks for choosing Ube"...
As we can see, there my output has %x{e2} and so on, while yours has %342.
When I started trying this code I forgot to put use utf8 in my code, so I got a bunch of single characters when the regex engine tried to match:
%x{e2}%x{82}%x{b9}
It then rejected the match.
So my conclusion is: Perl doesn't know your input data is utf8.

Related

Regex to get total price with space as separator

I need to build a regex that would catch the total price, here some exemple:
Total: 145.01 $
Total: 1 145.01 $
Total: 00.01 $
Total: 12 345.01 $
It's need to get any price that follow 'Total: ', without the '$'.
That what I got so far : (?<=\bTotal:\s*)(\d+.\d+)
RegExr

I assume:
each string must begin 'Total: ' (three spaces), the prefix;
the last digit in the string must be followed by ' $' (one space), the suffix, which is at the end of the string;
the substring between the prefix and suffix must end '.dd', where 'd' presents any digit, the cents;
the substring between the prefix and cents must match one of the following patterns, where 'd' represents any digit: 'd', 'dd', 'ddd', 'd ddd', 'dd ddd', 'ddd ddd', 'd ddd ddd', 'dd ddd ddd', 'ddd ddd ddd', 'd ddd ddd ddd' and so on;
the return value is the substring between the prefix and suffix that meets the above requirements; and
spaces will be removed from the substring returned as a separate step at the end.
We can use the following regular expression.
r = /\ATotal: {3}(\d{1,3}(?: \d{3})*\.\d{2}) \$\z/
In Ruby (but if you don't know Ruby you'll get the idea):
arr = <<~_.split(/\n/)
Total: 145.01 $
Total: 1 145.01 $
Total: 00.01 $
Total: 12 345.01 $
Total: 1 241 345.01 $
Total: 1.00 $
Total: 1.00$
Total: 1.00 $x
My Total: 1.00 $
Total: 12 34.01 $
_
The following matches each string in the array arr and extracts the contents of capture group 1, which is shown on the right side of each line.
arr.each do |s|
puts "\"#{(s + '"[r,1]').ljust(30)}: #{s[r,1] || 'no match'}"
end
"Total: 145.01 $"[r,1] : 145.01
"Total: 1 145.01 $"[r,1] : 1 145.01
"Total: 00.01 $"[r,1] : 00.01
"Total: 12 345.01 $"[r,1] : 12 345.01
"Total: 1 241 345.01 $"[r,1] : 1 241 345.01
"Total: 1.00 $"[r,1] : no match
"Total: 1.00$"[r,1] : no match
"Total: 1.00 $x"[r,1] : no match
"My Total: 1.00 $"[r,1] : no match
"Total: 12 34.01 $"[r,1] : no match
The regular expression can be written in free-spacing mode to make it self-documenting.
r = /
\A # match the beginning of the string
Total:\ {3} # match 'Total:' followed by 3 digits
( # begin capture group 1
\d{1,3} # match 1, 2 or 3 digits
(?:\ \d{3}) # match a space followed by 3 digits
* # perform the previous match zero or more times
\.\d{2} # match a period followed by 2 digits
) # end capture group 1
\ \$ # match a space followed by a dollar sign
\z # match end of string
/x # free-spacing regex definition mode
The regex can be seen in action here.

bash: extract executed line numbers from gcov report

gcov is a GNU toolchain utility that produces code coverage reports (see documentation) formated as follows:
-: 0:Source:../../../edg/attribute.c
-: 0:Graph:tmp.gcno
-: 0:Data:tmp.gcda
-: 0:Runs:1
-: 0:Programs:1
-: 1:#include <stdio.h>
-: 2:
-: 3:int main (void)
1: 4:{
1: 5: int i, total;
-: 6:
1: 7: total = 0;
-: 8:
11: 9: for (i = 0; i < 10; i++)
10: 10: total += i;
-: 11:
1: 12: if (total != 45)
#####: 13: printf ("Failure\n");
-: 14: else
1: 15: printf ("Success\n");
1: 16: return 0;
-: 17:}
I need to extract the line numbers of the lines that were executed from a bash script. $ egrep --regexp='^\s+[1-9]' example_file.c.gcov seems to return the relevant lines. An exemple of typical output would be:
1: 978: attr_name_map = alloc_hash_table(NO_MEMORY_REGION_NUMBER,
79: 982: for (k = 0; k<KNOWN_ATTR_TABLE_LENGTH; ++k) {
78: 989: attr_name_map_entries[k].descr = &known_attr_table[k];
78: 990: *ep = &attr_name_map_entries[k];
1: 992:} /* init_attr_name_map */
519: 2085: new_attr_seen = FALSE;
519: 2103: p_attributes = last_attribute_link(p_attributes);
519: 2104: } while (new_attr_seen);
519: 2106: return attributes;
16: 3026:void transform_type_with_gnu_attributes(a_type_ptr *p_type,
16: 3041: for (ap = attributes; ap != NULL; ap = ap->next) {
1: 6979:void process_alias_fixup_list(void)
1: 6984: an_alias_fixup_ptr entries = alias_fixup_list, entry;
I subsequently must extract the line number strings. The expected output from this example would be:
978
982
989
990
992
2085
2103
2104
2106
3026
3041
6979
6984
Could someone suggest a reliable, robust way to achieve this?
NOTE:
My idea was to eliminate everything that is not placed between the first and the second instance of the character :, which I tried to do with sed without much success so far.

This is fairly simple to do using awk:
awk -F: '/ +[0-9]/ {gsub(/ /, "", $2); print $2}' file.gcov
That is, use : as the field separator,
and for lines starting with spaces and digits,
replace the spaces from the 2nd field and print the 2nd field.
But if you really want to use sed,
and you want something robust, you could do this:
sed -e '/^ *[0-9][0-9]*: *[0-9][0-9]*:/!d' -e 's/[^:]*: *//' -e 's/:.*//' file.gcov
What's happening here?
The first command uses a pattern to match lines starting with 1 or more spaces followed by 1 or more digits followed by a : followed by 1 or more spaces followed by 1 or more digits followed by a :. Then comes the interesting part, we invert this selection with ! and delete it with d. We effectively delete all other lines except the ones we need.
The second command is a simple substitution, replacing a sequence of characters that are not : followed by a : followed by zero or more spaces. The pattern is applied from the beginning of the line so no need for a starting ^, and no need to specify strictly 1-or-more-spaces, thanks to the previous command we already know that there will be at least one.
The last command is even simpler, replace a : and everything after it.
Some versions of sed will give you shortcuts for a more compact writing style, for example [0-9]+ instead of [0-9][0-9]*, but the example above will work with a wider variety of implementations (notably BSD).

Understanding \G and \K in regex

In a previous question, I asked to match chars that follow a specific pattern. In order to be more specific, I would like to consider this example:
We want to match all the x that follow b or d. We may want to replace these characters with o:
-a x x xx x x
-b x x x x xx x
-c x x x x x x
-d x x x xx x x
The result would be this:
-a x x xx x x
-b o o o o oo o
-c x x x x x x
-d o o o oo o o
anubhava answered my question with a pretty nice regex that has the same form as this one:
/([db]|\G)[^x-]*\Kx/g
Unfortunately I did not completely understand how \G and \K work. I would like to have a more detailed explaination on this specific case.
I tried to use the Perl regex debugger, but It is a bit cryptic.
Compiling REx "([db]|\G)[^x-]*\Kx"
Final program:
1: OPEN1 (3)
3: BRANCH (15)
4: ANYOF[bd][] (17)
15: BRANCH (FAIL)
16: GPOS (17)
17: CLOSE1 (19)
19: STAR (31)
20: ANYOF[\x00-,.-wy-\xff][{unicode_all}] (0)
31: KEEPS (32)
32: EXACT <x> (34)
34: END (0)

Correct regex is:
(-[db]|(?!^)\G)[^x-]*\Kx
Check this demo
As per the regex101 description:
\G - asserts position at the end of the previous match or the start of the string for the first match. \G will match start of line as well for the very first match hence there is a need of negative lookahead here (?!^)
\K - resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match. \K will discard all matched input hence we can avoid back-reference in replacement.
More details about \K
More details about \G

I would suggest not doing it in one regex. Your intent is much more clear if you do this:
if ( /^-[bd]/ ) { # If it's a line that starts with -b or -d...
s/x/o/g; # ... replace the x's with o's.
}
If that's too many lines for you, you could even do:
s/x/o/g if /^-[bd]/;

Regular expression in r. Grouping & Capturing

I'm trying to use regexp in R cran, using the library stringr. I was studing str_match and str_replace functions. I don't understand why they give different results when I use parentheses for Grouping :
library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"
a<-str_match("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",perl(s))
b<-str_replace("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",perl(s), "\\2")
a[3]
#[1] " PIAZZALE "
b
#[1] " SS"

Try using just the expression s instead of perl(s):
library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"
a<-str_match("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",s)
b<-str_replace("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",s, "\\2")
a[3]
#[1] " PIAZZALE "
b
#[1] " PIAZZALE "
I've had a look in the documentation for this library:
http://cran.r-project.org/web/packages/stringr/stringr.pdf
It suggests that while the str_replace method can accept POSIX patterns by default and also perl patterns if supplied, the str_match can only accept POSIX style patterns and will treat the pattern as such if supplied with a perl pattern. The reason they were supplying different values is that they were using different expression engines. str_detect can use perl expressions and returns either TRUEE or FALSE. could you potentially use the str_detect method instead of the match method?
The difference between POSIX and perl that causes this:
The POSIX engine does not recognise lazy (non-greedy) quantifiers.
Your expression
(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})
would be seen as the perl equivalent of
(.+)( PIAZZALE | SS)(.+)([0-9]{5})
Where the first quantified class .+ would match as much as it can (the full string) before backtracking and evaluating the rest of the expression. It is successful when the first quantified class .+ comes all the way back from the end of the string and consumes the characters MONT SS DPR leaving only SS for the second capture group a[3]
Simplified Explanation of Engine Inner Workings
Here is a simplified explanation of how the different engines are processing your string. All of your quantifiers/alternation are directly wrapped in capture groups so the numbered quantifiers in the following examples are also your capture groups:
Perl:
Quantifier 1: "M"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MO"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MON"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " "
Quantifier 4: FAILED - MUST BACKTRACK
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " D"
Quantifier 4: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " DPR PIAZZALE CADORNA, 1A RICCIONE "
Quantifier 4: "47838"
SUCCESS
POSIX:
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 4783"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 478"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47"
Quantifier 2: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT SS DPR P"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR "
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE 47838"
Quantifier 4: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT SS DPR "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE "
Quantifier 4: "47838"
SUCCESS

string padded with optional blank with max length

I have a problem building a regex. this is a sample of the text:
text 123 12345 abc 12 def 67 i 89 o 0 t 2
The numbers are sometimes padded with blanks to the max length (3).
e.g.:
"1" can be "1" or "1 "
"13" can be "13" or "13 "
My regex is at the moment this:
\b([\d](\s*)){1,3}\b
The results of this regex are the following: (. = blank for better visibility)
123.
12....
67.
89.
0....
2
But I need this: (. = blank for better visibility)
123
12.
67.
89.
0..
2
How can I tell the regex engine to count the blanks into the {1,3} option?

Try this:
\b(?:\d[\d\s]{0,2})(?:(?<=\s)|\b)
This will also cover strings like text 123 1 23 12345 123abc 12 def 67 i 89 o 0 t 2 and results in:
123
1.
23.
12.
67.
89.
0..
2

Does this do what you want?
\b(\d){1,3}\s*\b
This will also include whitespace (if available) after the selection.

I think you want this
\b(?:\d[\d\s]{0,2})(?!\d)
See it here on Regexr
the word boundary will not work at the end, because if the end of the match is a whitespace, there is no word boundary. Therefor I use a negative lookahead (?!\d) to ensure that there is no digit following.
But if you have a string like this "1 23". It will match only the "2" and the "23", but not the whitespace after the first "2".

Assuming you want to use the padded numbers somewhere else, break the problem apart into two; (simple) parsing the numbers, and (simple) formatting the numbers (including padding).
while ( $text =~ /\b(\d{1,3})\b/g ) {
printf( "%-3d\n", $1 );
}
Alternatively:
#padded_numbers = map { sprintf( "%-3d", $_ ) } ( $text =~ /\b(\d{1,3})\b/g )

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unable to match Indian Rupee currency symbol using regex in Perl - regex

Related

Regex to get total price with space as separator

bash: extract executed line numbers from gcov report

Understanding \G and \K in regex

Regular expression in r. Grouping & Capturing

string padded with optional blank with max length

Categories

Resources