How to get string from in between 2 strings - regex

I am currently trying to get a string that is in between 2 substrings. In this case the string I need to manipulate is a block of code. Not sure if it is the regex or the search function but i keep getting none back and I shouldn't. I need to get the Offset on line 53 but I need to use Gusset To Backplate Left Gus 1 as the start marker and ENDFOR I think could be the end marker. Just not quite sure how to the syntax for something like this would work in python. I have tried some of the examples that I have seen online and have had no luck so far. Any help would be appreciated. Also I would like to do it with compile being that the offsets could be accessed multiple times.
s = '''!GUSSET TO BACKPLATE LEFT GUS 1 ;
45: E_NO(8) ;
46: FOR R[191:COUNTER B]=1 TO R[199:CHANNELS] ;
47: ;
48: CALL CHAN_BP_TO_GR ;
49: ;
50: PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]-R[197:X OFFSET MM] ;
51: --eg:THESE OFFSETS ONLY APPLY TO THIS BLOCK AND INCREASE THE AMOUNT GIVEN
: EACH LOOP ;
52: !X OFFSET ;
53: PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]+21 ;
54: !Y OFFSET ;
55: PR[GP1:2,2:OFFSET]=PR[GP1:2,2:OFFSET]+0 ;
56: !Z OFFSET ;
57: PR[GP1:2,3:OFFSET]=PR[GP1:2,3:OFFSET]+0 ;
58: ENDFOR ;'''
string1 = re.compile('!GUSSET TO BACKPLATE LEFT GUS 1 ;')
string2 = re.compile('PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]+[0-9]* ;')
string3 = re.compile('ENDFOR ;')
result = re.search(r'!GUSSET TO BACKPLATE LEFT GUS 1 ;, (PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]+[0-9]* ;),ENDFOR ;', s)
'.(PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]+[0-9]* ;'
print(result)

As your text is multiline you will need the re.M flag.
To use . to match newline you also need the re.DOTALL flag.
!GUSSET.*PR[GP1:2,1:OFFSET]= will match all text up to the OFFSET on line 53 then we match anything that's not a space or ; and save that to be returned by result.group(1) as shown below.
(?!ENDFOR).ENDFOR. will match anything thats not ENDFOR followed by ENDFOR
This should prevent it from being too greedy and limit the match to this specific section and not span multiple ENDFOR's.
try
result = re.search('!GUSSET.*PR\[GP1:2,1:OFFSET\]=([^; ]+)(?!ENDFOR).*ENDFOR.*', s,re.M|re.DOTALL)
print(result.group(1))
this will return
PR[GP1:2,1:OFFSET]+21

Related

Convert a word's characters into its ascii code list concisely in Raku

I'm trying to convert the word wall into its ascii code list (119, 97, 108, 108) like this:
my #ascii="abcdefghijklmnopqrstuvwxyz";
my #tmp;
map { push #tmp, $_.ord if $_.ord == #ascii.comb.any.ord }, "wall".comb;
say #tmp;
Is there a way to use the #tmp without declaring it in a seperate line?
Is there a way to produce the ascii code list in one line instead of 3 lines? If so, how to do it?
Note that I have to use the #ascii variable i.e. I can't make use of the consecutively increasing ascii sequence (97, 98, 99 ... 122) because I plan to use this code for non-ascii languages too.
There are a couple of things we can do here to make it work.
First, let's tackle the #ascii variable. The # sigil indicates a positional variable, but you assigned a single string to it. This creates a 1-element array ['abc...'], which will cause problems down the road. Depending on how general you need this to be, I'd recommend either creating the array directly:
my #ascii = <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my #ascii = 'a' .. 'z';
my #ascii = 'abcdefghijklmnopqrstuvwxyz'.comb;
or going ahead and handling the any part:
my $ascii-char = any <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my $ascii-char = any 'a' .. 'z';
my $ascii-char = 'abcdefghijklmnopqrstuvwxyz'.comb.any;
Here I've used the $ sigil, because any really specifies any single value, and so will function as such (which also makes our life easier). I'd personally use $ascii, but I'm using a separate name to make later examples more distinguishable.
Now we can handle the map function. Based on the above two versions of ascii, we can rewrite your map function to either of the following
{ push #tmp, $_.ord if $_ eq #ascii.any }
{ push #tmp, $_.ord if $_ eq $ascii-char }
Note that if you prefer to use ==, you can go ahead and create the numeric values in the initial ascii creation, and then use $_.ord. As well, personally, I like to name the mapped variable, e.g.:
{ push #tmp, $^char.ord if $^char eq #ascii.any }
{ push #tmp, $^char.ord if $^char eq $ascii-char }
where $^foo replaces $_ (if you use more than one, they map alphabetical order to #_[0], #_[1], etc).
But let's get to the more interesting question here. How can we do all of this without needing to predeclare #tmp? Obviously, that just requires creating the array in the map loop. You might think that might be tricky for when we don't have an ASCII value, but the fact that an if statement returns Empty (or () ) if it's not run makes life really easy:
my #tmp = map { $^char.ord if $^char eq $ascii-char }, "wall".comb;
my #tmp = map { $^char.ord if $^char eq #ascii.any }, "wall".comb;
If we used "wáll", the list collected by map would be 119, Empty, 108, 108, which is automagically returned as 119, 108, 108. Consequently, #tmp is set to just 119, 108, 108.
Yes there is a much simpler way.
"wall".ords.grep('az'.ords.minmax);
Of course this relies on a to z being an unbroken sequence. This is because minmax creates a Range object based on the minimum and maximum value in the list.
If they weren't in an unbroken sequence you could use a junction.
"wall".ords.grep( 'az'.ords.minmax | 'AZ'.ords.minmax );
But you said that you want to match other languages. Which to me screams regex.
"wall".comb.grep( /^ <:Ll> & <:ascii> $/ ).map( *.ord )
This matches Lowercase Letters that are also in ASCII.
Actually we can make it even simpler. comb can take a regex which determines which characters it takes from the input.
"wall".comb( / <:Ll> & <:ascii> / ).map( *.ord )
# (119, 97, 108, 108)
"ΓΔαβγδε".comb( / <:Ll> & <:Greek> / ).map( *.ord )
# (945, 946, 947, 948, 949)
# Does not include Γ or Δ, as they are not lowercase
Note that the above only works with ASCII if you don't have a combining accent.
"de\c[COMBINING ACUTE ACCENT]f".comb( / <:Ll> & <:ascii> / )
# ("d", "f")
The Combining Acute Accent combines with the e which composes to Latin Small Letter E With Acute.
That composed character is not in ASCII so it is skipped.
It gets even weirder if there isn't a composed value for the character.
"f\c[COMBINING ACUTE ACCENT]".comb( / <:Ll> & <:ascii> / )
# ("f́",)
That is because the f is lowercase and in ASCII. The composing codepoint gets brought along for the ride though.
Basically if your data has, or can have combining accents and if it could break things, then you are better off dealing with it while it is still in binary form.
$buf.grep: {
.uniprop() eq 'Ll' #
&& .uniprop('Block') eq 'Basic Latin' # ASCII
}
The above would also work for single character strings because .uniprop works on either integers representing a codepoint, or on the actual character.
"wall".comb.grep: {
.uniprop() eq 'Ll' #
&& .uniprop('Block') eq 'Basic Latin' # ASCII
}
Note again that this would have the same issues with composing codepoints since it works with strings.
You may also want to use .uniprop('Script') instead of .uniprop('Block') depending on what you want to do.
Here's a working approach using Raku's trans method (code snippet performed in the Raku REPL):
> my #a = "wall".comb;
[w a l l]
> #a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put;
119 97 108 108
Above, we handle an ascii string. Below I add the "é" character, and show a 2-step solution:
> my #a = "wallé".comb;
[w a l l é]
> my #b = #a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') );
[119 97 108 108 é]
> #b.trans("é" => ords("é")).put
119 97 108 108 233
Nota bene #1: Although all the code above works fine, when I tried shortening the alphabet to 'a'..'z' I ended up seeing erroneous return values...hence the use of the full 'abcdefghijklmnopqrstuvwxyz'.
Nota bene #2: One question in my mind is trying to suppress output when trans fails to recognize a character (e.g. how to suppress assignment of "é" as the last element of #b in the second-example code above). I've tried adding the :delete argument to trans, but no luck.
EDITED: To remove unwanted characters, here's code using grep (à la #Brad Gilbert), followed by trans:
> my #a = "wallé".comb;
[w a l l é]
> #a.grep('a'..'z'.comb.any).trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put
119 97 108 108

How to get the current line number in a multi-line list initializer of testcases?

Is there a way to reliably get the current line number during a Perl
multiline list assignment without explicitly using __LINE__? I am
storing testcases in a list and would like to tag each with its line
number.* That way I can do (roughly)
ok($_->[1], 'line ' . $_->[0]) for #tests.
And, of course, I would like to save typing compared to
putting __LINE__ at the beginning of each test case :) . I have
not been able to find a way to do so, and I have encountered some
confusing behaviour in the lines reported by caller.
* Possible XY, but I can't find a module to do it.
Update I found a hack and posted it as an answer. Thanks to #zdim for helping me think about the problem a different way!
MCVE
A long one, because I've tried several different options. my_eval,
L(), and L2{} are some I've tried so far — L() was the one
I initially hoped would work. Jump down to my #testcases to see how
I'm using these. When testing, do copy the shebang line.
Here's my non-MCVE use case, if you are interested.
#!perl
use strict; use warnings; use 5.010;
# Modified from https://www.effectiveperlprogramming.com/2011/06/set-the-line-number-and-filename-of-string-evals/#comment-155 by http://sites.google.com/site/shawnhcorey/
sub my_eval {
my ( $expr ) = #_;
my ( undef, $file, $line ) = caller;
my $code = "# line $line \"$file\"\n" . $expr;
unless(defined wantarray) {
eval $code; die $# if $#;
} elsif(wantarray) {
my #retval = eval $code; die $# if $#; return #retval;
} else {
my $retval = eval $code; die $# if $#; return $retval;
}
}
sub L { # Prepend caller's line number
my (undef, undef, $line) = caller;
return ["$line", #_];
} #L
sub L2(&) { # Prepend caller's line number
my $fn = shift;
my (undef, undef, $line) = caller;
return ["$line", &$fn];
} #L2
# List of [line number, item index, expected line number, type]
my #testcases = (
([__LINE__,0,32,'LINE']),
([__LINE__,1,33,'LINE']),
(L(2,34,'L()')),
(L(3,35,'L()')),
(do { L(4,36,'do {L}') }),
(do { L(5,37,'do {L}') }),
(eval { L(6,38,'eval {L}') }),
(eval { L(7,39,'eval {L}') }),
(eval "L(8,40,'eval L')"),
(eval "L(9,41,'eval L')"),
(my_eval("L(10,42,'my_eval L')")),
(my_eval("L(11,43,'my_eval L')")),
(L2{12,44,'L2{}'}),
(L2{13,45,'L2{}'}),
);
foreach my $idx (0..$#testcases) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases[$idx]->[3], $testcases[$idx]->[0],
$testcases[$idx]->[2],
($testcases[$idx]->[0] != $testcases[$idx]->[2]) && '*';
}
Output
With my comments added.
0 LINE line 32 expected 32
1 LINE line 33 expected 33
Using __LINE__ expressly works fine, but I'm looking for an
abbreviation.
2 L() line 45 expected 34 *
3 L() line 45 expected 35 *
L() uses caller to get the line number, and reports a line later
in the file (!).
4 do {L} line 36 expected 36
5 do {L} line 45 expected 37 *
When I wrap the L() call in a do{}, caller returns the correct
line number — but only once (!).
6 eval {L} line 38 expected 38
7 eval {L} line 39 expected 39
Block eval, interestingly, works fine. However, it's no shorter
than __LINE__.
8 eval L line 1 expected 40 *
9 eval L line 1 expected 41 *
String eval gives the line number inside the eval (no surprise)
10 my_eval L line 45 expected 42 *
11 my_eval L line 45 expected 43 *
my_eval() is a string eval plus a #line directive based on
caller. It also gives a line number later in the file (!).
12 L2{} line 45 expected 44 *
13 L2{} line 45 expected 45
L2 is the same as L, but it takes a block that returns a list,
rather than
the list itself. It also uses caller for the line number. And it
is correct once, but not twice (!). (Possibly just because it's the last
item — my_eval reported line 45 also.)
So, what is going on here? I have heard of Deparse and wonder if this is
optimization-related, but I don't know enough about the engine to know
where to start investigating. I also imagine this could be done with source
filters or Devel::Declare, but that is well beyond my
level of experience.
Take 2
#zdim's answer got me started thinking about fluent interfaces, e.g., as in my answer:
$testcases2 # line 26
->add(__LINE__,0,27,'LINE')
->add(__LINE__,1,28,'LINE')
->L(2,29,'L()')
->L(3,30,'L()')
->L(3,31,'L()')
;
However, even those don't work here — I get line 26 for each of the ->L() calls. So it appears that caller sees all of the chained calls as coming from the $testcases2->... line. Oh well. I'm still interested in knowing why, if anyone can enlighten me!
The caller can get only the line numbers of statements, decided at compilation.
When I change the code to
my #testcases;
push #testcases, ([__LINE__,0,32,'LINE']);
push #testcases, ([__LINE__,1,33,'LINE']);
push #testcases, (L(2,34,'L()'));
push #testcases, (L(3,35,'L()'));
...
maintaining line numbers, it works (except for string evals).
So, on the practical side, using caller is fine with separate statements for calls.
Perl internals
The line numbers are baked into the op-tree at compilation and (my emphasis)
At run-time, only the line numbers of statements are available [...]
from ikegami's post on permonks.
We can see this by running perl -MO=Concise script.pl where the line
2 nextstate(main 25 line_nos.pl:45) v:*,&,{,x*,x&,x$,$,67108864 ->3
is for the nextstate op, which sets the line number for caller (and warnings). See this post, and the nextstate example below.
A way around this would be to try to trick the compilation (somehow) or, better of course, to not assemble information in a list like that. One such approach is in the answer by cxw.
See this post for a related case and more detail.
nextstate example
Here's a multi-line function-call chain run through Deparse (annotated):
$ perl -MO=Concise -e '$x
->foo()
->bar()
->bat()'
d <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3 <=== the only nextstate
c <1> entersub[t4] vKRS/TARG ->d
3 <0> pushmark s ->4
a <1> entersub[t3] sKRMS/LVINTRO,TARG,INARGS ->b
4 <0> pushmark s ->5
8 <1> entersub[t2] sKRMS/LVINTRO,TARG,INARGS ->9
5 <0> pushmark s ->6
- <1> ex-rv2sv sKM/1 ->7
6 <#> gvsv[*x] s ->7
7 <.> method_named[PV "foo"] s ->8
9 <.> method_named[PV "bar"] s ->a
b <.> method_named[PV "bat"] ->c
-e syntax OK
Even though successive calls are on separate lines, they are part of the same statement, so are all attached to the same nextstate.
Edit This answer is now wrapped in a CPAN module (GitHub)!
#zdim's answer got me thinking about fluent interfaces. Below are two hacks that work for my particular use case, but that don't help me understand the behaviour reported in the question. If you can help, please post another answer!
Hack 2 (newer) (the one now on CPAN)
I think this one is very close to minimal. In perl, you can call a subroutine through a reference with $ref->(), and you can leave out the second and subsequent -> in a chain of arrows. That means, for example, that you can do:
my $foo; $foo=sub { say shift; return $foo; };
$foo->(1)
(2)
(3);
Looks good, right? So here's the MCVE:
#!perl
use strict; use warnings; use 5.010;
package FluentAutoIncList2 {
sub new { # call as $class->new(__LINE__); each element is one line
my $class = shift;
my $self = bless {lnum => shift // 0, arr => []}, $class;
# Make a loader that adds an item and returns itself --- not $self
$self->{loader} = sub { $self->L(#_); return $self->{loader} };
return $self;
}
sub size { return scalar #{ shift->{arr} }; }
sub last { return shift->size-1; } # $#
sub load { goto &{ shift->{loader} } } # kick off loading
sub L { # Push a new record with the next line number on the front
my $self = shift;
push #{ $self->{arr} }, [++$self->{lnum}, #_];
return $self;
} #L
sub add { # just add it
my $self = shift;
++$self->{lnum}; # keep it consistent
push #{ $self->{arr} }, [#_];
return $self;
} #add
} #FluentAutoIncList2
# List of [line number, item index, expected line number, type]
my $testcases = FluentAutoIncList2->new(__LINE__) # line 28
->add(__LINE__,0,36,'LINE')
->add(__LINE__,1,37,'LINE');
$testcases->load(2,38,'load')-> # <== Only need two arrows.
(3,39,'chain load') # <== After that, () are enough.
(4,40,'chain load')
(5,41,'chain load')
(6,42,'chain load')
(7,43,'chain load')
;
foreach my $idx (0..$testcases->last) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases->{arr}->[$idx]->[3],
$testcases->{arr}->[$idx]->[0],
$testcases->{arr}->[$idx]->[2],
($testcases->{arr}->[$idx]->[0] !=
$testcases->{arr}->[$idx]->[2]) && '*';
}
Output:
0 LINE line 36 expected 36
1 LINE line 37 expected 37
2 load line 38 expected 38
3 chain load line 39 expected 39
4 chain load line 40 expected 40
5 chain load line 41 expected 41
6 chain load line 42 expected 42
7 chain load line 43 expected 43
All the chain load lines were loaded with zero extra characters compared to the original [x, y] approach. Some overhead, but not much!
Hack 1
Code:
By starting with __LINE__ and assuming a fixed number of lines per call, a counter will do the trick. This could probably be done more cleanly with a tie.
#!perl
use strict; use warnings; use 5.010;
package FluentAutoIncList {
sub new { # call as $class->new(__LINE__); each element is one line
my $class = shift;
return bless {lnum => shift // 0, arr => []}, $class;
}
sub size { return scalar #{ shift->{arr} }; }
sub last { return shift->size-1; } # $#
sub L { # Push a new record with the next line number on the front
my $self = shift;
push #{ $self->{arr} }, [++$self->{lnum}, #_];
return $self;
} #L
sub add { # just add it
my $self = shift;
++$self->{lnum}; # keep it consistent
push #{ $self->{arr} }, [#_];
return $self;
} #add
} #FluentAutoIncList
# List of [line number, item index, expected line number, type]
my $testcases = FluentAutoIncList->new(__LINE__) # line 28
->add(__LINE__,0,29,'LINE')
->add(__LINE__,1,30,'LINE')
->L(2,31,'L()')
->L(3,32,'L()')
->L(4,33,'L()')
;
foreach my $idx (0..$testcases->last) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases->{arr}->[$idx]->[3],
$testcases->{arr}->[$idx]->[0],
$testcases->{arr}->[$idx]->[2],
($testcases->{arr}->[$idx]->[0] !=
$testcases->{arr}->[$idx]->[2]) && '*';
}
Output:
0 LINE line 29 expected 29
1 LINE line 30 expected 30
2 L() line 31 expected 31
3 L() line 32 expected 32
4 L() line 33 expected 33

bash: extract executed line numbers from gcov report

gcov is a GNU toolchain utility that produces code coverage reports (see documentation) formated as follows:
-: 0:Source:../../../edg/attribute.c
-: 0:Graph:tmp.gcno
-: 0:Data:tmp.gcda
-: 0:Runs:1
-: 0:Programs:1
-: 1:#include <stdio.h>
-: 2:
-: 3:int main (void)
1: 4:{
1: 5: int i, total;
-: 6:
1: 7: total = 0;
-: 8:
11: 9: for (i = 0; i < 10; i++)
10: 10: total += i;
-: 11:
1: 12: if (total != 45)
#####: 13: printf ("Failure\n");
-: 14: else
1: 15: printf ("Success\n");
1: 16: return 0;
-: 17:}
I need to extract the line numbers of the lines that were executed from a bash script. $ egrep --regexp='^\s+[1-9]' example_file.c.gcov seems to return the relevant lines. An exemple of typical output would be:
1: 978: attr_name_map = alloc_hash_table(NO_MEMORY_REGION_NUMBER,
79: 982: for (k = 0; k<KNOWN_ATTR_TABLE_LENGTH; ++k) {
78: 989: attr_name_map_entries[k].descr = &known_attr_table[k];
78: 990: *ep = &attr_name_map_entries[k];
1: 992:} /* init_attr_name_map */
519: 2085: new_attr_seen = FALSE;
519: 2103: p_attributes = last_attribute_link(p_attributes);
519: 2104: } while (new_attr_seen);
519: 2106: return attributes;
16: 3026:void transform_type_with_gnu_attributes(a_type_ptr *p_type,
16: 3041: for (ap = attributes; ap != NULL; ap = ap->next) {
1: 6979:void process_alias_fixup_list(void)
1: 6984: an_alias_fixup_ptr entries = alias_fixup_list, entry;
I subsequently must extract the line number strings. The expected output from this example would be:
978
982
989
990
992
2085
2103
2104
2106
3026
3041
6979
6984
Could someone suggest a reliable, robust way to achieve this?
NOTE:
My idea was to eliminate everything that is not placed between the first and the second instance of the character :, which I tried to do with sed without much success so far.
This is fairly simple to do using awk:
awk -F: '/ +[0-9]/ {gsub(/ /, "", $2); print $2}' file.gcov
That is, use : as the field separator,
and for lines starting with spaces and digits,
replace the spaces from the 2nd field and print the 2nd field.
But if you really want to use sed,
and you want something robust, you could do this:
sed -e '/^ *[0-9][0-9]*: *[0-9][0-9]*:/!d' -e 's/[^:]*: *//' -e 's/:.*//' file.gcov
What's happening here?
The first command uses a pattern to match lines starting with 1 or more spaces followed by 1 or more digits followed by a : followed by 1 or more spaces followed by 1 or more digits followed by a :. Then comes the interesting part, we invert this selection with ! and delete it with d. We effectively delete all other lines except the ones we need.
The second command is a simple substitution, replacing a sequence of characters that are not : followed by a : followed by zero or more spaces. The pattern is applied from the beginning of the line so no need for a starting ^, and no need to specify strictly 1-or-more-spaces, thanks to the previous command we already know that there will be at least one.
The last command is even simpler, replace a : and everything after it.
Some versions of sed will give you shortcuts for a more compact writing style, for example [0-9]+ instead of [0-9][0-9]*, but the example above will work with a wider variety of implementations (notably BSD).

Unable to match Indian Rupee currency symbol using regex in Perl

Following is my text:
Total: ₹ 131.84
Thanks for choosing Uber, Pradeep
I would like to match the amount part, using the following code:
if ( $mail_body =~ /Total: \x{20B9} (\d+)/ ) {
$amount = $1;
}
But, it does not match, tried using regex debugging, here's the output:
Compiling REx "Total: \x{20B9} (\d+)"
Final program:
1: EXACT <Total: \x{20b9} > (5)
5: OPEN1 (7)
7: PLUS (9)
8: DIGIT (0)
9: CLOSE1 (11)
11: END (0)
anchored utf8 "Total: %x{20b9} " at 0 (checking anchored) minlen 10
Matching REx "Total: \x{20B9} (\d+)" against "Total: %342%202%271%302%240131.84%n%nThanks for choosing Ube"...
UTF-8 pattern...
Match failed
Freeing REx: "Total: \x{20B9} (\d+)"
The full code is at http://pastebin.com/TGdFX7hg.
Disclaimer: This feels more like a comment than an answer, but I need more space.
I've never used MIME::Parser and friends before, but from what I've read in the documentation, the following might work:
use Encode qw(decode);
# according to your code, $text_mail is a MIME::Entity object
my $charset = $text_mail->head->mime_attr('content-type.charset');
my $mail_body_raw = $text_mail->bodyhandle->as_string;
my $mail_body = decode $charset, $mail_body_raw;
The idea is to get the charset from the MIME::Head object, then use Encode to decode the body accordingly.
Of course, if you know that it's always going to be UTF-8 text, you could also hardcode that:
my $mail_body = decode 'UTF-8', $mail_body_raw;
After that, your regex may still fail to work because according to the debugging output in your question the character between ₹ and the number is actually not a simple space (ASCII 32, U+0020), but a non-breaking space (U+00A0). You should be able to match that with \s:
if ( $mail_body =~ /Total: \x{20B9}\s(\d+)/ ) {
This is a bit of searching for the explanation, not an outright answer. Please bear with me.
I believe your $mail_body does not contain what you think it does. You posted the input data as plain text. Was that copied from a mail client?
If I take the code and the input data from the question and run it with use re 'debug' I get a different output.
use utf8;
use strict;
use warnings;
use re 'debug';
my $mail_body = qq{Total: ₹ 131.84
Thanks for choosing Uber, Pradeep};
if ( $mail_body =~ /Total: \x{20B9} (\d+)/ ) {
my $amount = $1;
}
It will produce this:
Compiling REx "Total: \x{20B9} (\d+)"
Final program:
1: EXACT <Total: \x{20b9} > (5)
5: OPEN1 (7)
7: PLUS (9)
8: POSIXU[\d] (0)
9: CLOSE1 (11)
11: END (0)
anchored utf8 "Total: %x{20b9} " at 0 (checking anchored) minlen 10
Matching REx "Total: \x{20B9} (\d+)" against "Total: %x{20b9} 131.84%n%nThanks for choosing Uber, Pradeep"
UTF-8 pattern and string...
Intuit: trying to determine minimum start position...
Found anchored substr "Total: %x{20b9} " at offset 0...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <Total: > | 1:EXACT <Total: \x{20b9} >(5)
11 < %x{20b9} > <131.84%n%n>| 5:OPEN1(7)
11 < %x{20b9} > <131.84%n%n>| 7:PLUS(9)
POSIXU[\d] can match 3 times out of 2147483647...
14 <%x{20b9} 131> <.84%n%nTha>| 9: CLOSE1(11)
14 <%x{20b9} 131> <.84%n%nTha>| 11: END(0)
Match successful!
Freeing REx: "Total: \x{20B9} (\d+)"
Let's compare the line with the Matching REx to your output:
against against "Total: %x{20b9} 131.84%n%nThanks for choosing Uber, Pradeep"
against "Total: %342%202%271%302%240131.84%n%nThanks for choosing Ube"...
As we can see, there my output has %x{e2} and so on, while yours has %342.
When I started trying this code I forgot to put use utf8 in my code, so I got a bunch of single characters when the regex engine tried to match:
%x{e2}%x{82}%x{b9}
It then rejected the match.
So my conclusion is: Perl doesn't know your input data is utf8.

Changing spaces with "prxchange", but not all spaces

I need to change the spaces in my text to underscores, but only the spaces that are between words, not the ones between digits, so, for an example
"The quick brown fox 99 07 3475"
Would become
"The_quick_brown_fox 99 07 3475"
I tried using this in a data step:
mytext = prxchange('s/\w\s\w/_/',-1,mytext);
But the result was not what i wanted
"Th_uic_row_ox 99 07 3475"
Any ideas on what i could do?
Thanks in advance.
Data One ;
X = "The quick brown fox 99 07 3475" ;
Y = PrxChange( 's/(?<=[a-z])\s+(?=[a-z])/_/i' , -1 , X ) ;
Put X= Y= ;
Run ;
You are changing
"W W"
to
"_"
when you want to change
"W W"
to
"W_W"
so
prxchange('s/(\w)\s(\w)/$1_$2/',-1,mytext);
Full example:
data test;
mytext='The quick brown fox 99 07 3475';
newtext = prxchange('s/([A-Za-z])\s([A-Za-z])/$1_$2/',-1,mytext);
put _all_;
run;
You can use the CALL PRXNEXT function to find the position of each match, then use the SUBSTR function to replace the space with an underscore. I've changed your regular expression as \w matches any alphanumeric character, so it should include spaces between numbers. I'm not sure how you got your result using that expression.
Anyway, the code below should give you what you want.
data have;
mytext='The quick brown fox 99 07 3475';
_re=prxparse('/[a-z]\s[a-z]/i'); /* match a letter followed by a space followed by a letter, ignore case */
_start=1 /* starting position for search */;
call prxnext(_re,_start,-1,mytext,_position,_length); /* find position of 1st match */
do while(_position>0); /* loop through all matches */
substr(mytext,_position+1,1)='_'; /* replace ' ' with '_' for matches */
_start=_start-2; /* prevents the next start position jumping 3 ahead (the length of the regex search string) */
call prxnext(_re,_start,-1,mytext,_position,_length); /* find position of next match */
end;
drop _: ;
run;