Extract groups separated by space - regex

I've got following string (example):
Loader[data-prop data-attr="value"]
There can be 1 - n attributes. I want to extract every attribute. (data-prop,data-attr="value"). I tried it in many different ways, for example with \[(?:(\S+)\s)*\] but I didn't get it right. The expression should be written in PREG style..

I suggest grabbing all the key-value pairs with a regex:
'~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~'
(see regex demo) and then
See IDEONE demo
$re = '~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~';
$str = "Loader[data-prop data-attr=\"value\" more-here='data' and-one-more=\"\"]";
preg_match_all($re, $str, $matches);
$arr = array();
for ($i = 0; $i < count($matches); $i++) {
if ($i != 0) {
$arr = array_merge(array_filter($matches[$i]),$arr);
}
}
print_r(preg_grep('~\A(?![\'"]\z)~', $arr));
Output:
Array
(
[3] => data-prop
[4] => data-attr="value"
[5] => more-here='data'
[6] => and-one-more=""
[7] => Loader
)
Notes on the regex (it only looks too complex):
(?:([^][]*)\b\[|(?!^)\G) - a boundary: we only start matching at a [ that is preceded with a word (a-zA-Z0-9_) character (with \b\[), or right after a successful match (with (?!^)\G). Also, ([^][]*) will capture into Group 1 the part before the [.
\s* - matches zero or more whitespace symbols
(\w+(?:-\w+)*) - captures into Group 2 "words" like "word1" or "word1-word2"..."word1-wordn"
(?:=(["\'])?[^\]]*?\3)? - optional group (due to (?:...)?) matching
= - an equal sign
(["\'])? - Group 3 (auxiliary group to check for the value delimiter) capturing either ", ' or nothing
[^\]]*? - (value) zero or more characters other than ] as few as possible
\3 - the closing ' or " (the same value captured in Group 3).
Since we cannot get rid of capturing ' or ", we can preg_grep all the elements that we are not interested in with preg_grep('~\A(?![\'"]\z)~', $arr) where \A(?![\'"]\z) matches any string that is not equal to ' or ".

how about something like [\s\[]([^\s\]]+(="[^"]+)*)+
gives
MATCH 1: data-prop
MATCH 2: data-attr="value"

Related

Match every thing between "****" or [****]

I have a regex that look like this:
(?="(test)"\s*:\s*(".*?"|\[.*?]))
to match the value between "..." or [...]
Input
"test":"value0"
"test":["value1", "value2"]
Output
Group1 Group2
test value0
test "value1", "value2" // or - value1", "value2
I there any trick to ignore "" and [] and stick with two group, group1 and group2?
I tried (?="(test)"\s*:\s*(?="(.*?)"|\[(.*?)])) but this gives me 4 groups, which is not good for me.
You may use this conditional regex in PHP with branch reset group:
"(test)"\h*:\h*(?|"([^"]*)"|\[([^]]*)])
This will give you 2 capture groups in both the inputs with enclosing " or [...].
RegEx Demo
RegEx Details:
(?|..) is a branch reset group. Here Subpatterns declared within each alternative of this construct will start over from the same index
(?|"([^"]*)"|\[([^]]*)]) is if-then-else conditional subpatern which means if " is matched then use "([^"]*)" otherwise use \[([^]]*)] subpattern
You can use a pattern like
"(test)"\s*:\s*\K(?|"\K([^"]*)|\[\K([^]]*))
See the regex demo.
Details:
" - a " char
(test) - Group 1: test word
" - a " char
\s*:\s* - a colon enclosed with zero or more whitespaces
\K - match reset operator that clears the current overall match memory buffer (group value is still kept intact)
(?|"\K([^"]*)|\[\K([^]]*)) - a branch reset group:
"\K([^"]*) - matches a ", then discards it, and then captures into Group 2 zero or more chars other than "
| - or
\[\K([^]]*) - matches a [, then discards it, and then captures into Group 2 zero or more chars other than ]
In Java, you can't use \K and ?|, use capturing groups:
String s = "\"test\":[\"value1\", \"value2\"]";
Pattern pattern = Pattern.compile("\"(test)\"\\s*:\\s*(?:\"([^\"]*)|\\[([^\\]]*))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1));
if (matcher.group(2) != null) {
System.out.println("Value: " + matcher.group(2));
} else {
System.out.println("Value: " + matcher.group(3));
}
}
See a Java demo.

Get regex to match multiple instances of the same pattern

So I have this regex - regex101:
\[shortcode ([^ ]*)(?:[ ]?([^ ]*)="([^"]*)")*\]
Trying to match on this string
[shortcode contact param1="test 2" param2="test1"]
Right now, the regex matches this:
[contact, param2, test1]
I would like it to match this:
[contact, param1, test 2, param2, test1]
How can I get regex to match the first instance of the parameters pattern, rather than just the last?
You may use
'~(?:\G(?!^)\s+|\[shortcode\s+(\S+)\s+)([^\s=]+)="([^"]*)"~'
See the regex demo
Details
(?:\G(?!^)\s+|\[shortcode\s+(\S+)\s+) - either the end of the previous match and 1+ whitespaces right after (\G(?!^)\s+) or (|)
\[shortcode - literal string
\s+ - 1+ whitespaces
(\S+) - Group 1: one or more non-whitespace chars
\s+ - 1+ whitespaces
([^\s=]+) - Group 2: 1+ chars other than whitespace and =
=" - a literal substring
([^"]*) - Group 3: any 0+ chars other than "
" - a " char.
PHP demo
$re = '~(?:\G(?!^)\s+|\[shortcode\s+(\S+)\s+)([^\s=]+)="([^"]*)"~';
$str = '[shortcode contact param1="test 2" param2="test1"]';
$res = [];
if (preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0)) {
foreach ($matches as $m) {
array_shift($m);
$res = array_merge($res, array_filter($m));
}
}
print_r($res);
// => Array( [0] => contact [1] => param1 [2] => test 2 [3] => param2 [4] => test1 )
Try using the below regex.
regex101
Below is your use case,
var testString = '[shortcode contact param1="test 2" param2="test1"]';
var regex = /[\w\s]+(?=[\="]|\")/gm;
var found = paragraph.match(regex);
If you log found you will see the result as
["shortcode contact param1", "test 2", " param2", "test1"]
The regex will match all the alphanumeric character including the underscore and blank spaces only if they are followed by =" or ".
I hope this helps.

RegEx - groups, need [this:andthis] from a string

I hope this is a simple question, but I'm still getting my head around groups.
I have this string: this is some text [propertyFromId:34] and this is more text and I will have more like them. I need to get the content between the brackets then a group with the alpha-only text on left of the colon and group with the integer on the right of the colon.
So, full Match: propertyFromId:34, Group 1: propertyFromId, Group 2: 34
This is my starting point (?<=\[)(.*?)(?=])
Use
\[([a-zA-Z]+):(\d+)]
See the regex demo
Details:
\[ - a [ symbol
([a-zA-Z]+) - Group 1 capturing one or more alpha chars ([[:alpha:]]+ or \p{L}+ can be used, too)
: - a colon
(\d+) - Group 2 capturing one or more digits
] - a closing ] symbol.
PHP demo:
$re = '~\[([a-zA-Z]+):(\d+)]~';
$str = 'this is some text [propertyFromId:34] and this is more text';
preg_match_all($re, $str, $matches);
print_r($matches);
// => Array
// (
// [0] => Array
// (
// [0] => [propertyFromId:34]
// )
//
// [1] => Array
// (
// [0] => propertyFromId
// )
//
// [2] => Array
// (
// [0] => 34
// )
//
// )

Challenging regular expression

There is a string in the following format:
It can start with any number of strings enclosed by double braces, possibly with white space between them (whitespace may or may not occur).
It may also contain strings enclosed by double-braces in the middle.
I am looking for a regular expression that can separate the start from the rest.
For example, given the following string:
{{a}}{{b}} {{c}} def{{g}}hij
The two parts are:
{{a}}{{b}} {{c}}
def{{g}}hij
I tried this:
/^({{.*}})(.*)$/
But, it captured also the g in the middle:
{{a}}{{b}} {{c}} def{{g}}
hij
I tried this:
/^({{.*?}})(.*)$/
But, it captured only the first a:
{{a}}
{{b}} {{c}} def{{g}}hij
This keeps matching {{, any non { or } character 1 or more times, }}, possible whitespace zero or more times and stores it in the first group. Rest of the string will be in the 2nd group. If there are no parts surrounded by {{ and }} the first group will be empty. This was in JavaScript.
var str = "{{a}}{{b}} {{c}} def{{g}}hij";
str.match(/^\s*((?:\{\{[^{}]+\}\}\s*)*)(.*)/)
// [whole match, group 1, group 2]
// ["{{a}}{{b}} {{c}} def{{g}}hij", "{{a}}{{b}} {{c}} ", "def{{g}}hij"]
How about using preg_split:
$str = '{{a}}{{b}} {{c}} def{{g}}hij';
$list = preg_split('/(\s[^{].+)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($list);
output:
Array
(
[0] => {{a}}{{b}} {{c}}
[1] => def{{g}}hij
)
I think I got it:
var string = "{{a}}{{b}} {{c}} def{{g}}hij";
console.log(string.match(/((\{\{\w+\}\})\s*)+/g));
// Output: [ '{{a}}{{b}} {{c}} ', '{{g}}' ]
Explanation:
( starts a group.
( another;
\{\{\w+\}\} looks for {{A-Za-z_0-9}}
) closes second group.
\s* Counts whitespace if it's there.
)+ closes the first group and looks for oits one or more occurrences.
When it gets any not-{{something}} type data, it stops.
P.S. -> Complex RegEx takes CPU speed.
You can use this:
(java)
string[] result = yourstr.split("\\s+(?!{)");
(php)
$result = preg_split('/\s+(?!{)/', '{{a}}{{b}} {{c}} def{{g}}hij');
print_r($result);
I donĀ“t know exactly why are you want to split, but in case that the string contains always a def inside, and you want to separate the string from there in two halves, then, you can try something like:
string text = "{{a}}{{b}} {{c}} def{{g}}hij";
Regex r = new Regex("def");
string[] split = new string[2];
int index = r.Match(text).Index;
split[0] = string.Join("", text.Take(index).Select(x => x.ToString()).ToArray<string>());
split[1] = string.Join("", text.Skip(index).Take(text.Length - index).Select(x => x.ToString()).ToArray<string>());
// Output: [ '{{a}}{{b}} {{c}} ', 'def{{g}}hij' ]

Regex - Ignore some parts of string in match

Here's my string:
address='St Marks Church',notes='The North East\'s premier...'
The regex I'm using to grab the various parts using match_all is
'/(address|notes)='(.+?)'/i'
The results are:
address => St Marks Church notes => The North East\
How can I get it to ignore the \' character for the notes?
Not sure if you're wrapping your string with heredoc or double quotes, but a less greedy approach:
$str4 = 'address="St Marks Church",notes="The North East\'s premier..."';
preg_match_all('~(address|notes)="([^"]*)"~i',$str4,$matches);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => address="St Marks Church"
[1] => notes="The North East's premier..."
)
[1] => Array
(
[0] => address
[1] => notes
)
[2] => Array
(
[0] => St Marks Church
[1] => The North East's premier...
)
)
Another method with preg_split:
//split the string at the comma
//assumes no commas in text
$parts = preg_split('!,!', $string);
foreach($parts as $key=>$value){
//split the values at the = sign
$parts[$key]=preg_split('!=!',$value);
foreach($parts[$key] as $k2=>$v2){
//trim the quotes out and remove the slashes
$parts[$key][$k2]=stripslashes(trim($v2,"'"));
}
}
Output looks like:
Array
(
[0] => Array
(
[0] => address
[1] => St Marks Church
)
[1] => Array
(
[0] => notes
[1] => The North East's premier...
)
)
Super slow old-skool method:
$len = strlen($string);
$key = "";
$value = "";
$store = array();
$pos = 0;
$mode = 'key';
while($pos < $len){
switch($string[$pos]){
case $string[$pos]==='=':
$mode = 'value';
break;
case $string[$pos]===",":
$store[$key]=trim($value,"'");
$key=$value='';
$mode = 'key';
break;
default:
$$mode .= $string[$pos];
}
$pos++;
}
$store[$key]=trim($value,"'");
Because you have posted that you are using match_all and the top tags in your profile are php and wordpress, I think it is fair to assume you are using preg_match_all() with php.
The following patterns will match the substrings required to buildyour desired associative array:
Patterns that generate a fullstring match and 1 capture group:
/(address|notes)='\K(?:\\\'|[^'])*/ (166 steps, demo link)
/(address|notes)='\K.*?(?=(?<!\\)')/ (218 steps, demo link)
Patterns that generate 2 capture groups:
/(address|notes)='((?:\\\'|[^'])*)/ (168 steps, demo link)
/(address|notes)='(.*?(?<!\\))'/ (209 steps, demo link)
Code: (Demo)
$string = "address='St Marks Church',notes='The North East\'s premier...'";
preg_match_all(
"/(address|notes)='\K(?:\\\'|[^'])*/",
$string,
$out
);
var_export(array_combine($out[1], $out[0]));
echo "\n---\n";
preg_match_all(
"/(address|notes)='((?:\\\'|[^'])*)/",
$string,
$out,
PREG_SET_ORDER
);
var_export(array_column($out, 2, 1));
Output:
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
---
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
Patterns #1 and #3 use alternatives to allow non-apostrophe characters or apostrophes not preceded by a backslash.
Patterns #2 and #4 (will require an additional backslash when implemented with php demo) use lookarounds to ensure that apostrophes preceded by a backslash don't end the match.
Some notes:
Using capture groups, alternatives, and lookarounds often costs pattern efficiency. Limiting the use of these components often improves performance. Using negated character classes with greedy quantifiers often improves performance.
Using \K (which restarts the fullstring match) is useful when trying to reduce capture groups and it reduces the size of the output array.
You should match up to an end quote that isn't preceded by a backslash thus:
(address|notes)='(.*?)[^\\]'
This [^\\] forces the character immediately preceding the ' character to be anything but a backslash.