Extract strings in quotation marks and brackets - regex

My string test is :
My name is "Ralph" ("France" is my country, "123" my age, ... , "an other text", ...)
I want to get strings in quotation marks, but only these in brackets. In my example : strings France and 123.
I've tested this pattern :
#\(.*"(.*)".*\)#
but it only matches the last string 123 (I use preg_match_all(), so it should return every result, no?)
If I add the Ungreedy option it matches only the first string France. So I don't understand why it wasn't greedy without the U option, and is there a way to obtain my strings in quotation marks and in brackets?
Thanks,
Raphaël N.

The only way I could get this to work is by using:
$subject = 'My 00123450 "Ralph" ("France" is my country, "123" my age, ... , "an other text", ...)';
$pattern = '/\((?:[^"\)]*"(.*?)")?(?:[^"\)]*"(.*?)")?(?:[^"\)]*"(.*?)")?[^"]*?\)/';
preg_match_all($pattern, $subject, $matches);
for ($i = 1; $i < count($matches); $i++)
{
print($i.': '.$matches[$i][0].";\n");
}
Output:
1: France;
2: 123;
3: an other text;
That Regex would only work for up to 3 occurrences of "quoted strings" inside a set of brackets. You could however extend the regex string to grab up to N occurrences as follows:
The regex to find 1 to N quoted strings within each set of parenthesis is:
n=1 /\((?:[^"\)]*"(.*?)")?[^"]*?\)/
n=2 /\((?:[^"\)]*"(.*?)")?(?:[^"\)]*"(.*?)")?[^"]*?\)/
n=3 /\((?:[^"\)]*"(.*?)")?(?:[^"\)]*"(.*?)")?(?:[^"\)]*"(.*?)")?[^"]*?\)/
To find 1-N strings you repeat the section (?:[^"\)]*"(.*?)")? N times. For 1-100 strings within each set of parenthesis you would have to repeat that section 100 times - obviously the regex would start evaluating very slowly.
I realise that isn't by any means ideal, but its my best effort at a 1 pass solution.
In 2 passes:
$subject = 'My name is "Ralph" ("France" is my country, "123" my age, ... , "an other text", ...)';
$pattern = '/\(.*?\)/';
preg_match_all($pattern, $subject, $matches);
$pattern2 = '/".*?"/';
preg_match_all($pattern2, $matches[0][0], $matches2);
print_r($matches2);
Produces correct output in 2 passes. Eagerly awaiting an answer that shows how to do it in 1 though. I've tried every variation I could think of, but can't get it to include overlapping matches.

Keep it simple and do it in 2 steps:
$s = 'My name is "Ralph" ("France" is my country, "123" my age) and "I" am. ';
$str = preg_replace('#^.*?\(([^)]*)\).*$#', '$1', $s);
if (preg_match_all('/"([^"]*)"/', $str, $arr))
print_r($arr[0]);
OUTPUT:
Array
(
[0] => "France"
[1] => "123"
)

This should work for you:
\("([^"]+)".+?"(.+)"
Explained:
\(" - match a bracket and double quote
([^"]+)" - capture everything inside double quotes
.+?" - match anything up to the next double quote
(.+) - capture everything that isn't a double quote
" - match last double quote
As long as your sample data is exactly as given, the regex should work

Related

Regex named group in different locations [duplicate]

I have an input which contains substring in the format of TXT number or number TXT. I would like to write regexp which will match the format and returns only the number.
I came up with something like this:
$regex = '/TXT(?<number>[0-9]+)|(?<number>[0-9]+)TXT/'
The problem is that the compiler says that group with name number is already defined even though there is or operator in between.
Is it possible in php to write 2 groups with the same name? If it is not then how can I write regexp like that?
To write 2 groups with the same name you need to use the (?J) inline flag:
'/(?J)TXT(?<number>[0-9]+)|(?<number>[0-9]+)TXT/'
See the regex demo
Documentation:
J (PCRE_INFO_JCHANGED)
The (?J) internal option setting changes the local PCRE_DUPNAMES option. Allow duplicate names for subpatterns. As of PHP 7.2.0 J is supported as modifier as well.
PHP demo:
$regex = '/(?J)TXT(?<number>[0-9]+)|(?<number>[0-9]+)TXT/';
if (preg_match_all($regex, "TXT123 and 456TXT1", $matches, PREG_SET_ORDER, 0)) {
foreach ($matches as $m) {
echo $m["number"] . PHP_EOL;
}
}
Note that in your case, you do not need the groups:
'/TXT\K[0-9]+|[0-9]+(?=TXT)/'
The lookarounds will also do the job here.
You could use a branch reset group (?| and add a space between the digits and the TXT.
(?|TXT (?<number>[[0-9]+)|(?<number>[[0-9]+) TXT)
Regex demo | Php demo
For example
$re = '/(?|TXT (?<number>[[0-9]+)|(?<number>[[0-9]+) TXT)/';
$str = 'TXT 4
4 TXT';
preg_match_all($re, $str, $matches);
print_r($matches["number"]);
Output
Array
(
[0] => 4
[1] => 4
)

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig

Convert string with preg_replace in PHP

I have this string
$string = "some words and then #1.7 1.7 1_7 and 1-7";
and I would like that #1.7/1.7/1_7 and 1-7 to be replaced by S1E07.
Of course, instead of "1.7" is just an example, it could be "3.15" for example.
I managed to create the regular expression that would match the above 4 variants
/\#\d{1,2}\.\d{1,2}|\d{1,2}_\d{1,2}|\d{1,2}-\d{1,2}|\d{1,2}\.\d{1,2}/
but I cannot figure out how to use preg_replace (or something similar?) to actually replace the matches so they end up like S1E07
You need to use preg_replace_callback if you need to pad 0 if the number less than 10.
$string = "some words and then #1.7 1.7 1_7 and 1-7";
$string = preg_replace_callback('/#?(\d+)[._-](\d+)/', function($matches) {
return 'S'.$matches[1].'E'.($matches[2] < 10 ? '0'.$matches[2] : $matches[2]);
}, $string);
You could use this simple string replace:
preg_replace('/#?\b(\d{1,2})[-._](\d{1,2})\b/', 'S${1}E${2}', $string);
But it would not yield zero-padded numbers for the episode number:
// some words and then S1E7 S1E7 S1E7 and S1E7
You would have to use the evaluation modifier:
preg_replace('/#?\b(\d{1,2})[-._](\d{1,2})\b/e', '"S".str_pad($1, 2, "0", STR_PAD_LEFT)."E".str_pad($2, 2, "0", STR_PAD_LEFT)', $string);
...and use str_pad to add the zeroes.
// some words and then S01E07 S01E07 S01E07 and S01E07
If you don't want the season number to be padded you can just take out the first str_pad call.
I believe this will do what you want it to...
/\#?([0-9]+)[._-]([0-9]+)/
In other words...
\#? - can start with the #
([0-9]+) - capture at least one digit
[._-] - look for one ., _ or -
([0-9]+) - capture at least one digit
And then you can use this to replace...
S$1E$2
Which will put out S then the first captured group, then E then the second captured group
You need to put brackets around the parts you want to reuse ==> capture them. Then you can access those values in the replacement string with $1 (or ${1} if the groups exceed 9) for the first group, $2 for the second one...
The problem here is that you would end up with $1 - $8, so I would rewrite the expression into something like this:
/#?(\d{1,2})[._-](\d{1,2})/
and replace with
S${1}E${2}
I tested it on writecodeonline.com:
$string = "some words and then #1.7 1.7 1_7 and 1-7";
$result = preg_replace('/#?(\d{1,2})[._-](\d{1,2})/', 'S${1}E${2}', $string);

how to extract a single digit in a number using regexp

set phoneNumber 1234567890
this number single digit, i want divide this number into 123 456 7890 by using regexp. without using split function is it possible?
The following snippet:
regexp {(\d{3})(\d{3})(\d{4})} "8144658695" -> areacode first second
puts "($areacode) $first-$second"
Prints (as seen on ideone.com):
(814) 465-8695
This uses capturing groups in the pattern and subMatchVar... for Tcl regexp
References
http://www.hume.com/html84/mann/regexp.html
regular-expressions.info/Brackets for Capturing
On the pattern
The regex pattern is:
(\d{3})(\d{3})(\d{4})
\_____/\_____/\_____/
1 2 3
It has 3 capturing groups (…). The \d is a shorthand for the digit character class. The {3} in this context is "exactly 3 repetition of".
References
regular-expressions.info/Repetition, Character Class
my($number) = "8144658695";
$number =~ m/(\d\d\d)(\d\d\d)(\d\d\d\d)/;
my $num1 = $1;
my $num2 = $2;
my $num3 = $3;
print $num1 . "\n";
print $num2 . "\n";
print $num3 . "\n";
This is writen for Perl and works assuming the number is in the exact format you specified, hope this helps.
This site might help you with regex
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

In Perl, how can I detect if a string has multiple occurrences of double digits?

I wanted to match 110110 but not 10110. That means at least twice repeating of two consecutive digits which are the same. Any regex for that?
Should match: 110110, 123445446, 12344544644
Should not match: 10110, 123445
/(\d)\1.*\1\1/
This matches a string with 2 instances of a double number, ie 11011 but not 10011
\d matches any digit
\1 matches the first match effectively doubling the first entry
This will also match 1111. If there needs to be other characters between change .* to .+
ooh, this looks neater
((\d)\2).*\1
If you want to find non-matching values, but there has to be 2 sets of doubles, then you would simply need to add the first part again as in
((\d)\2).*((\d)\4)
The bracketing would mean that $1 and $3 would contain the double digits and $2 and $4 contains the single digits (which are then doubled).
11233
$1=11
$2=1
$3=33
$4=3
If I understand correctly, your regexp will be:
m{
(\d)\1 # First repeated pair
.* # Anything in between
(\d)\2 # Second repeated pair
}x
For example:
for my $x (qw(110110 123445446 12344544644 10110 123445)) {
my $m = $x =~ m{(\d)\1.*(\d)\2} ? "matches" : "does not match";
printf "%-11s : %s\n", $x, $m;
}
110110 : matches
123445446 : matches
12344544644 : matches
10110 : does not match
123445 : does not match
If you're talking about all digits, this will do it:
00.*00|11.*11|22.*22|33.*33|44.*44|55.*55|66.*66|77.*77|88.*88|99.*99
It's just 9 different patterns OR'ed together, each of which checks for at least two occurrences of the desired 2-digit pattern.
Using Perls more advanced REs, you can use the following for two consecutive digits twice:
(\d)\1.*\1\1
or, as one of your comments states, two consecutive digits follwed somewhere by two more consecutive digits which may not be the same:
(\d)\1.*(\d)\2
depending on how your data is, here's a minimal regex way.
while(<DATA>){
chomp;
#s = split/\s+/;
foreach my $i (#s){
if( $i =~ /123445/ && length($i) ne 6){
print $_."\n";
}
}
}
__DATA__
This is a line
blah 123445446 blah
blah blah 12344544644 blah
.... 123445 ....
this is last line
There is no reason to do everything in one regex... You can use the rest of Perl as well:
#!/usr/bin/perl -l
use strict;
use warnings;
my #strings = qw( 11233 110110 10110 123445 123445446 12344544644 );
print if is_wanted($_) for #strings;
sub is_wanted {
my ($s) = #_;
my #matches = $s =~ /(?<group>(?<first>[0-9])\k<first>)/g;
return 1 < #matches / 2;
}
__END__
If I've understood your question correctly, then this, according to regexbuddy (set to using perl syntax), will match 110110 but not 10110:
(1{2})0\10
The following is more general and will match any string where two equal digits is repeated later on in the string.
(\d{2})\d+\1\d*
The above will match the following examples:
110110
110011
112345611
2200022345
Finally, to find two sets of double digits in a string and you don't care where they are, try this:
\d*?(\d{2})\d+?\1\d*
This will match the examples above plus this one:
12345501355789
Its the two sets of double 5 in the above example that are matched.
[Update]
Having just seen your extra requirement of matching a string with two different double digits, try this:
\d*?(\d)\1\d*?(\d)\2\d*
This will match strings like the following:
12342202345567
12342202342267
Note that the 22 and 55 cause the first string to match and the pair of 22 cause the second string to match.