I'm trying to use regexp in R cran, using the library stringr. I was studing str_match and str_replace functions. I don't understand why they give different results when I use parentheses for Grouping :
library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"
a<-str_match("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",perl(s))
b<-str_replace("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",perl(s), "\\2")
a[3]
#[1] " PIAZZALE "
b
#[1] " SS"
Try using just the expression s instead of perl(s):
library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"
a<-str_match("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",s)
b<-str_replace("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",s, "\\2")
a[3]
#[1] " PIAZZALE "
b
#[1] " PIAZZALE "
I've had a look in the documentation for this library:
http://cran.r-project.org/web/packages/stringr/stringr.pdf
It suggests that while the str_replace method can accept POSIX patterns by default and also perl patterns if supplied, the str_match can only accept POSIX style patterns and will treat the pattern as such if supplied with a perl pattern. The reason they were supplying different values is that they were using different expression engines. str_detect can use perl expressions and returns either TRUEE or FALSE. could you potentially use the str_detect method instead of the match method?
The difference between POSIX and perl that causes this:
The POSIX engine does not recognise lazy (non-greedy) quantifiers.
Your expression
(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})
would be seen as the perl equivalent of
(.+)( PIAZZALE | SS)(.+)([0-9]{5})
Where the first quantified class .+ would match as much as it can (the full string) before backtracking and evaluating the rest of the expression. It is successful when the first quantified class .+ comes all the way back from the end of the string and consumes the characters MONT SS DPR leaving only SS for the second capture group a[3]
Simplified Explanation of Engine Inner Workings
Here is a simplified explanation of how the different engines are processing your string. All of your quantifiers/alternation are directly wrapped in capture groups so the numbered quantifiers in the following examples are also your capture groups:
Perl:
Quantifier 1: "M"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MO"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MON"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " "
Quantifier 4: FAILED - MUST BACKTRACK
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " D"
Quantifier 4: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " DPR PIAZZALE CADORNA, 1A RICCIONE "
Quantifier 4: "47838"
SUCCESS
POSIX:
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 4783"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 478"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47"
Quantifier 2: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT SS DPR P"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR "
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE 47838"
Quantifier 4: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT SS DPR "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE "
Quantifier 4: "47838"
SUCCESS
Related
I'm trying to match and substitute a pattern.
Test String: {1-Emp Name: "John", "2-Emp pat" : 1123,"3-Emp lwd" : 20}, "4-Emp Pat" : 1234}
I'm trying to match the pattern with the word "pat" from the test string and substitute
Expected Result: {1-Emp Name: "John", "matched Pattern" : 1123,"3-Emp lwd" : 20}, "matched Pattern" : 1234}
My regex: ".+?(?i)Pat.+?(?=:)
You can use
Regex pattern: (?i)"[^"]* Pat\b[^"]*("\s*:)
Replacement pattern: "matched pattern$1
See the regex demo. Details:
(?i) - case insensitive inline modifier
" - a " char
[^"]* - zero or more chars other than "
Pat - space + Pat word
\b - word boundary
[^"]* - zero or more chars other than "
("\s*:) - Group 1 ($1): ", zero or more whitespaces, :.
Text is
lemma A:
"
abx K() bc
"
// comment lemma B
lemma B:
"
abx bc sdsf
"
lemma C:
"
abfdfx K() bc
"
lemma D:
"
abxsf bc
"
I want to find the lemmas which contain K() inside its following quoted text. I have tried Perl regex (?s)^[ ]*lemma.*?"(?!").*?K\( but it overlaps two lemmas. The output should be: lemma A: "..." and lemma C: "...".
If the double quotes are at the start of the string, you can match a newline and then the double quote.
Then match any char except the double quote until you match K(
^[ ]*lemma\b.*\R"[^"]*K\(
^ Start of string
[ ]*lemma\b Match optional spaces and lemma
.*\R Match the rest of the line and a newline
"[^"]* Match " followed by optional chars other than "
K\( Match K(
Regex demo
You could use:
(?s)^[ ]*lemma[^"]*"[^"]*?K\(
[^"] means "any character but ""
See a demo here
I have a regex that look like this:
(?="(test)"\s*:\s*(".*?"|\[.*?]))
to match the value between "..." or [...]
Input
"test":"value0"
"test":["value1", "value2"]
Output
Group1 Group2
test value0
test "value1", "value2" // or - value1", "value2
I there any trick to ignore "" and [] and stick with two group, group1 and group2?
I tried (?="(test)"\s*:\s*(?="(.*?)"|\[(.*?)])) but this gives me 4 groups, which is not good for me.
You may use this conditional regex in PHP with branch reset group:
"(test)"\h*:\h*(?|"([^"]*)"|\[([^]]*)])
This will give you 2 capture groups in both the inputs with enclosing " or [...].
RegEx Demo
RegEx Details:
(?|..) is a branch reset group. Here Subpatterns declared within each alternative of this construct will start over from the same index
(?|"([^"]*)"|\[([^]]*)]) is if-then-else conditional subpatern which means if " is matched then use "([^"]*)" otherwise use \[([^]]*)] subpattern
You can use a pattern like
"(test)"\s*:\s*\K(?|"\K([^"]*)|\[\K([^]]*))
See the regex demo.
Details:
" - a " char
(test) - Group 1: test word
" - a " char
\s*:\s* - a colon enclosed with zero or more whitespaces
\K - match reset operator that clears the current overall match memory buffer (group value is still kept intact)
(?|"\K([^"]*)|\[\K([^]]*)) - a branch reset group:
"\K([^"]*) - matches a ", then discards it, and then captures into Group 2 zero or more chars other than "
| - or
\[\K([^]]*) - matches a [, then discards it, and then captures into Group 2 zero or more chars other than ]
In Java, you can't use \K and ?|, use capturing groups:
String s = "\"test\":[\"value1\", \"value2\"]";
Pattern pattern = Pattern.compile("\"(test)\"\\s*:\\s*(?:\"([^\"]*)|\\[([^\\]]*))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1));
if (matcher.group(2) != null) {
System.out.println("Value: " + matcher.group(2));
} else {
System.out.println("Value: " + matcher.group(3));
}
}
See a Java demo.
I have this \"([^"]*)\"
and on data """Storno ISP""- ""Nesprávne nastavená modulácia KZ (G.DMT/G.992.1B), potrebné nastaviť adsl2+ (G.992.5B)""" "Fast" "Battery" "JNAKA".
I would like to match only "Fast" "Battery" "JNAKA".
Where am I wrong?
You may require no double quotes on each side:
(?<!")"([^"]+)"(?!")
See the regex demo
Details
(?<!") - no " immediately on the left is allowed
" - a " char
([^"]+) - Group 1: one or more chars other than "
" - a " char
(?!") - no " immediately on the right is allowed.
Following is my text:
Total: ₹ 131.84
Thanks for choosing Uber, Pradeep
I would like to match the amount part, using the following code:
if ( $mail_body =~ /Total: \x{20B9} (\d+)/ ) {
$amount = $1;
}
But, it does not match, tried using regex debugging, here's the output:
Compiling REx "Total: \x{20B9} (\d+)"
Final program:
1: EXACT <Total: \x{20b9} > (5)
5: OPEN1 (7)
7: PLUS (9)
8: DIGIT (0)
9: CLOSE1 (11)
11: END (0)
anchored utf8 "Total: %x{20b9} " at 0 (checking anchored) minlen 10
Matching REx "Total: \x{20B9} (\d+)" against "Total: %342%202%271%302%240131.84%n%nThanks for choosing Ube"...
UTF-8 pattern...
Match failed
Freeing REx: "Total: \x{20B9} (\d+)"
The full code is at http://pastebin.com/TGdFX7hg.
Disclaimer: This feels more like a comment than an answer, but I need more space.
I've never used MIME::Parser and friends before, but from what I've read in the documentation, the following might work:
use Encode qw(decode);
# according to your code, $text_mail is a MIME::Entity object
my $charset = $text_mail->head->mime_attr('content-type.charset');
my $mail_body_raw = $text_mail->bodyhandle->as_string;
my $mail_body = decode $charset, $mail_body_raw;
The idea is to get the charset from the MIME::Head object, then use Encode to decode the body accordingly.
Of course, if you know that it's always going to be UTF-8 text, you could also hardcode that:
my $mail_body = decode 'UTF-8', $mail_body_raw;
After that, your regex may still fail to work because according to the debugging output in your question the character between ₹ and the number is actually not a simple space (ASCII 32, U+0020), but a non-breaking space (U+00A0). You should be able to match that with \s:
if ( $mail_body =~ /Total: \x{20B9}\s(\d+)/ ) {
This is a bit of searching for the explanation, not an outright answer. Please bear with me.
I believe your $mail_body does not contain what you think it does. You posted the input data as plain text. Was that copied from a mail client?
If I take the code and the input data from the question and run it with use re 'debug' I get a different output.
use utf8;
use strict;
use warnings;
use re 'debug';
my $mail_body = qq{Total: ₹ 131.84
Thanks for choosing Uber, Pradeep};
if ( $mail_body =~ /Total: \x{20B9} (\d+)/ ) {
my $amount = $1;
}
It will produce this:
Compiling REx "Total: \x{20B9} (\d+)"
Final program:
1: EXACT <Total: \x{20b9} > (5)
5: OPEN1 (7)
7: PLUS (9)
8: POSIXU[\d] (0)
9: CLOSE1 (11)
11: END (0)
anchored utf8 "Total: %x{20b9} " at 0 (checking anchored) minlen 10
Matching REx "Total: \x{20B9} (\d+)" against "Total: %x{20b9} 131.84%n%nThanks for choosing Uber, Pradeep"
UTF-8 pattern and string...
Intuit: trying to determine minimum start position...
Found anchored substr "Total: %x{20b9} " at offset 0...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <Total: > | 1:EXACT <Total: \x{20b9} >(5)
11 < %x{20b9} > <131.84%n%n>| 5:OPEN1(7)
11 < %x{20b9} > <131.84%n%n>| 7:PLUS(9)
POSIXU[\d] can match 3 times out of 2147483647...
14 <%x{20b9} 131> <.84%n%nTha>| 9: CLOSE1(11)
14 <%x{20b9} 131> <.84%n%nTha>| 11: END(0)
Match successful!
Freeing REx: "Total: \x{20B9} (\d+)"
Let's compare the line with the Matching REx to your output:
against against "Total: %x{20b9} 131.84%n%nThanks for choosing Uber, Pradeep"
against "Total: %342%202%271%302%240131.84%n%nThanks for choosing Ube"...
As we can see, there my output has %x{e2} and so on, while yours has %342.
When I started trying this code I forgot to put use utf8 in my code, so I got a bunch of single characters when the regex engine tried to match:
%x{e2}%x{82}%x{b9}
It then rejected the match.
So my conclusion is: Perl doesn't know your input data is utf8.