Regex expression symbol without space - regex

Here is my regex: {{[^\{\s}]+\}}
And my input is {{test1}}{{test2}}{{test3}}.
How can I get these 3 tests by array using regex expression?

I would use: ~\{\{([^}]+?)\}\}~
and accessing array depends on your language!
[EDIT] add explanations
~: delimiter
\{\{, \}\}~: match characters literally. Should be
escaped.
[^}]: match anything inside {{}} until a }
+: repeat
pattern multiple times (for multiple characters)
?: is for 'lazy'
to match as few times as possible.
(): is to capture
:)
[EDIT] add PHP code sample for matching illustration:
<?php
$string= "{{test1}}{{test2}}{{test3}}";
if (preg_match_all("~\{\{([^}]+?)\}\}~s", $string, $matches))
{
print_r(array($matches));
// Do what you want
}
?>
will output this:
Array
(
[0] => Array
(
[0] => Array
(
[0] => {{test1}}
[1] => {{test2}}
[2] => {{test3}}
)
[1] => Array
(
[0] => test1
[1] => test2
[2] => test3
)
)
)

test[0-9]+
This matches all occurences of testX where X is an integer of any size.
If you're trying to identify the braces instead, use this:
[{\}]

C# uses Matches Method returns MatchCollection object.
Here is some codes,
Regex r = new Regex(#"{{[^{\s}]+}}");
MatchCollection col = r.Matches("{{test1}}{{test2}}{{test3}}");
string[] arr = null;
if (col != null)
{
arr = new string[col.Count];
for (int i = 0; i < col.Count; i++)
{
arr[i] = col[i].Value;
}
}

Related

Regex pattern to match groups starting with pattern

I am extract data from a text stream which is data structured as such
/1-<id>/<recType>-<data>..repeat n times../1-<id>/#-<data>..repeat n times..
In the above, the "/1" field precedes the record data which can then have any number of following fields, each with choice of recType from 2 to 9 (also, each field starts with a "/")
For example:
/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE
So, there are three groups of data above
1=XXXX 2=YYYY 9=ZZZZ
1=AAAA 3=BBBB 5=CCCC 8=NNNN 9=DDDD
1=QQQQ 2=WWWW 3=PPPP 7=EEEE
Data is for simplicity, I know for certain that its only contains [A-Z0-9. ] but can be variable length (not just 4 chars as per example)
Now, the following expression sort of works, but its only capturing the first 2 fields of each group and none of the remaining fields...
/1-(?'fld1'[A-Z]+)/((?'fldNo'[2-9])-(?'fldData'[A-Z0-9\. ]+))
I know I need some sort of quantifier in there somewhere, but I do not know what or where to place it.
You can use a regex to match these blocks using 2 .NET regex features: 1) capture collection and 2) multiple capturing groups with the same name in the pattern. Then, we'll need some Linq magic to combine the captured data into a list of lists:
(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*
Details:
(?<fldNo>1) - Group fldNo matching 1
- - a hyphen
(?'fldData'[^/]+) - Group "fldData" capturing 1+ chars other than /
(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))* - zero or more sequences of:
/ - a literal /
(?<fldNo>[2-9]) - 2 to 9 digit (Group "fldNo")
[-=] - a - or =
(?'fldData'[^/]+)- 1+ chars other than / (Group "fldData")
See the regex demo, results:
See C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var str = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
var res = Regex.Matches(str, #"(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*")
.Cast<Match>()
.Select(p => p.Groups["fldNo"].Captures.Cast<Capture>().Select(m => m.Value)
.Zip(p.Groups["fldData"].Captures.Cast<Capture>().Select(m => m.Value),
(first, second) => first + "=" + second))
.ToList();
foreach (var t in res)
Console.WriteLine(string.Join(" ", t));
}
}
I would suggest to first split the string by /1, then use a patern along these lines:
\/([1-9])[=-]([A-Z]+)
https://regex101.com/r/0nyzzZ/1
A single regex isn't the optimal tool for doing this (at least used in this way). The main reason is because your stream has a variable number of entries in it, and using a variable number of capture groups is not supported. I also noticed some of the values had "=" between them as well as the dash, which your current regex doesn't address.
The problem comes when you try and add a quantifier to a capture group - the group will only remember the last thing it captured, so if you add a quantifier, it will end up catching the first and last fields, leaving out all the rest of them. So something like this won't work:
\/1-(?'fld1'[A-Z]+)(?:\/(?'fldNo'[2-9])[-=](?'fldData'[A-Z]+))+
If your streams were all the same length, then a single regex could be used, but there's a way to do it using a foreach loop with a much simpler regex working on each part of your stream (so it verifies your stream as well when it goes along!)
Now I'm not sure what language you're working with when using this, but here is a solution in PHP that I think delivers what you need.
function extractFromStream($str)
{
/*
* Get an array of [num]-[letters] with explode. This will make an array that
* contains [0] => 1-AAAA, [1] => 2-BBBB ... etc
*/
$arr = explode("/", substr($str, 1));
$sorted = array();
$key = 0;
/*
* Sort this data into key->values based on numeric ordering.
* If the next one has a lower or equal starting number than the one before it,
* a new entry will be created. i.e. 2-aaaa => 1-cccc will cause a new
* entry to be made, just in case the stream doesn't always start with 1.
*/
foreach ($arr as $value)
{
// This will get the number at the start, and has the added bonus of making sure
// each bit is in the right format.
if (preg_match("/^([0-9]+)[=-]([A-Z]+)$/", $value, $matches)) {
$newKey = (int)$matches[1];
$match = $matches[2];
} else
throw new Exception("This is not a valid data stream!");
// This bit checks if we've got a lower starting number than last time.
if (isset($lastKey) && is_int($lastKey) && $newKey <= $lastKey)
$key += 1;
// Now sort them..
$sorted[$key][$newKey] = $match;
// This will be compared in the next iteration of the loop.
$lastKey = $newKey;
}
return $sorted;
}
Here's how you can use it...
$full = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
try {
$extracted = extractFromStream($full);
$stream1 = $extracted[0];
$stream2 = $extracted[1];
$stream3 = $extracted[2];
print "<pre>";
echo "Full extraction: \n";
print_r($extracted);
echo "\nFirst Stream:\n";
print_r($stream1);
echo "\nSecond Stream:\n";
print_r($stream2);
echo "\nThird Stream:\n";
print_r($stream3);
print "</pre>";
} catch (Exception $e) {
echo $e->getMessage();
}
This will print
Full extraction:
Array
(
[0] => Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
[1] => Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
[2] => Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
)
First Stream:
Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
Second Stream:
Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
Third Stream:
Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
So you can see you have the numbers as the array keys, and the values they correspond to, which are now readily accessible for further processing. I hope this helps you :)

Extract groups separated by space

I've got following string (example):
Loader[data-prop data-attr="value"]
There can be 1 - n attributes. I want to extract every attribute. (data-prop,data-attr="value"). I tried it in many different ways, for example with \[(?:(\S+)\s)*\] but I didn't get it right. The expression should be written in PREG style..
I suggest grabbing all the key-value pairs with a regex:
'~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~'
(see regex demo) and then
See IDEONE demo
$re = '~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~';
$str = "Loader[data-prop data-attr=\"value\" more-here='data' and-one-more=\"\"]";
preg_match_all($re, $str, $matches);
$arr = array();
for ($i = 0; $i < count($matches); $i++) {
if ($i != 0) {
$arr = array_merge(array_filter($matches[$i]),$arr);
}
}
print_r(preg_grep('~\A(?![\'"]\z)~', $arr));
Output:
Array
(
[3] => data-prop
[4] => data-attr="value"
[5] => more-here='data'
[6] => and-one-more=""
[7] => Loader
)
Notes on the regex (it only looks too complex):
(?:([^][]*)\b\[|(?!^)\G) - a boundary: we only start matching at a [ that is preceded with a word (a-zA-Z0-9_) character (with \b\[), or right after a successful match (with (?!^)\G). Also, ([^][]*) will capture into Group 1 the part before the [.
\s* - matches zero or more whitespace symbols
(\w+(?:-\w+)*) - captures into Group 2 "words" like "word1" or "word1-word2"..."word1-wordn"
(?:=(["\'])?[^\]]*?\3)? - optional group (due to (?:...)?) matching
= - an equal sign
(["\'])? - Group 3 (auxiliary group to check for the value delimiter) capturing either ", ' or nothing
[^\]]*? - (value) zero or more characters other than ] as few as possible
\3 - the closing ' or " (the same value captured in Group 3).
Since we cannot get rid of capturing ' or ", we can preg_grep all the elements that we are not interested in with preg_grep('~\A(?![\'"]\z)~', $arr) where \A(?![\'"]\z) matches any string that is not equal to ' or ".
how about something like [\s\[]([^\s\]]+(="[^"]+)*)+
gives
MATCH 1: data-prop
MATCH 2: data-attr="value"

Challenging regular expression

There is a string in the following format:
It can start with any number of strings enclosed by double braces, possibly with white space between them (whitespace may or may not occur).
It may also contain strings enclosed by double-braces in the middle.
I am looking for a regular expression that can separate the start from the rest.
For example, given the following string:
{{a}}{{b}} {{c}} def{{g}}hij
The two parts are:
{{a}}{{b}} {{c}}
def{{g}}hij
I tried this:
/^({{.*}})(.*)$/
But, it captured also the g in the middle:
{{a}}{{b}} {{c}} def{{g}}
hij
I tried this:
/^({{.*?}})(.*)$/
But, it captured only the first a:
{{a}}
{{b}} {{c}} def{{g}}hij
This keeps matching {{, any non { or } character 1 or more times, }}, possible whitespace zero or more times and stores it in the first group. Rest of the string will be in the 2nd group. If there are no parts surrounded by {{ and }} the first group will be empty. This was in JavaScript.
var str = "{{a}}{{b}} {{c}} def{{g}}hij";
str.match(/^\s*((?:\{\{[^{}]+\}\}\s*)*)(.*)/)
// [whole match, group 1, group 2]
// ["{{a}}{{b}} {{c}} def{{g}}hij", "{{a}}{{b}} {{c}} ", "def{{g}}hij"]
How about using preg_split:
$str = '{{a}}{{b}} {{c}} def{{g}}hij';
$list = preg_split('/(\s[^{].+)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($list);
output:
Array
(
[0] => {{a}}{{b}} {{c}}
[1] => def{{g}}hij
)
I think I got it:
var string = "{{a}}{{b}} {{c}} def{{g}}hij";
console.log(string.match(/((\{\{\w+\}\})\s*)+/g));
// Output: [ '{{a}}{{b}} {{c}} ', '{{g}}' ]
Explanation:
( starts a group.
( another;
\{\{\w+\}\} looks for {{A-Za-z_0-9}}
) closes second group.
\s* Counts whitespace if it's there.
)+ closes the first group and looks for oits one or more occurrences.
When it gets any not-{{something}} type data, it stops.
P.S. -> Complex RegEx takes CPU speed.
You can use this:
(java)
string[] result = yourstr.split("\\s+(?!{)");
(php)
$result = preg_split('/\s+(?!{)/', '{{a}}{{b}} {{c}} def{{g}}hij');
print_r($result);
I donĀ“t know exactly why are you want to split, but in case that the string contains always a def inside, and you want to separate the string from there in two halves, then, you can try something like:
string text = "{{a}}{{b}} {{c}} def{{g}}hij";
Regex r = new Regex("def");
string[] split = new string[2];
int index = r.Match(text).Index;
split[0] = string.Join("", text.Take(index).Select(x => x.ToString()).ToArray<string>());
split[1] = string.Join("", text.Skip(index).Take(text.Length - index).Select(x => x.ToString()).ToArray<string>());
// Output: [ '{{a}}{{b}} {{c}} ', 'def{{g}}hij' ]

Regex - Ignore some parts of string in match

Here's my string:
address='St Marks Church',notes='The North East\'s premier...'
The regex I'm using to grab the various parts using match_all is
'/(address|notes)='(.+?)'/i'
The results are:
address => St Marks Church notes => The North East\
How can I get it to ignore the \' character for the notes?
Not sure if you're wrapping your string with heredoc or double quotes, but a less greedy approach:
$str4 = 'address="St Marks Church",notes="The North East\'s premier..."';
preg_match_all('~(address|notes)="([^"]*)"~i',$str4,$matches);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => address="St Marks Church"
[1] => notes="The North East's premier..."
)
[1] => Array
(
[0] => address
[1] => notes
)
[2] => Array
(
[0] => St Marks Church
[1] => The North East's premier...
)
)
Another method with preg_split:
//split the string at the comma
//assumes no commas in text
$parts = preg_split('!,!', $string);
foreach($parts as $key=>$value){
//split the values at the = sign
$parts[$key]=preg_split('!=!',$value);
foreach($parts[$key] as $k2=>$v2){
//trim the quotes out and remove the slashes
$parts[$key][$k2]=stripslashes(trim($v2,"'"));
}
}
Output looks like:
Array
(
[0] => Array
(
[0] => address
[1] => St Marks Church
)
[1] => Array
(
[0] => notes
[1] => The North East's premier...
)
)
Super slow old-skool method:
$len = strlen($string);
$key = "";
$value = "";
$store = array();
$pos = 0;
$mode = 'key';
while($pos < $len){
switch($string[$pos]){
case $string[$pos]==='=':
$mode = 'value';
break;
case $string[$pos]===",":
$store[$key]=trim($value,"'");
$key=$value='';
$mode = 'key';
break;
default:
$$mode .= $string[$pos];
}
$pos++;
}
$store[$key]=trim($value,"'");
Because you have posted that you are using match_all and the top tags in your profile are php and wordpress, I think it is fair to assume you are using preg_match_all() with php.
The following patterns will match the substrings required to buildyour desired associative array:
Patterns that generate a fullstring match and 1 capture group:
/(address|notes)='\K(?:\\\'|[^'])*/ (166 steps, demo link)
/(address|notes)='\K.*?(?=(?<!\\)')/ (218 steps, demo link)
Patterns that generate 2 capture groups:
/(address|notes)='((?:\\\'|[^'])*)/ (168 steps, demo link)
/(address|notes)='(.*?(?<!\\))'/ (209 steps, demo link)
Code: (Demo)
$string = "address='St Marks Church',notes='The North East\'s premier...'";
preg_match_all(
"/(address|notes)='\K(?:\\\'|[^'])*/",
$string,
$out
);
var_export(array_combine($out[1], $out[0]));
echo "\n---\n";
preg_match_all(
"/(address|notes)='((?:\\\'|[^'])*)/",
$string,
$out,
PREG_SET_ORDER
);
var_export(array_column($out, 2, 1));
Output:
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
---
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
Patterns #1 and #3 use alternatives to allow non-apostrophe characters or apostrophes not preceded by a backslash.
Patterns #2 and #4 (will require an additional backslash when implemented with php demo) use lookarounds to ensure that apostrophes preceded by a backslash don't end the match.
Some notes:
Using capture groups, alternatives, and lookarounds often costs pattern efficiency. Limiting the use of these components often improves performance. Using negated character classes with greedy quantifiers often improves performance.
Using \K (which restarts the fullstring match) is useful when trying to reduce capture groups and it reduces the size of the output array.
You should match up to an end quote that isn't preceded by a backslash thus:
(address|notes)='(.*?)[^\\]'
This [^\\] forces the character immediately preceding the ' character to be anything but a backslash.

How to skip past an "=" then capture all comma-delimited words

I am doing this now with Instr/Split but have found regex in general to be much faster (this is an inner loop with 100K+ tests per run).
The general form is:
word0 = word1, word2, word3...
There are one or more words to the right of =. A word is defined as [\w.-]+. I need to allow for whitespace at any point as well. The = is required.
I want to return just word1, word2 and word3 in the Matches collection.
The = is what has me stumped. I either get one match or none depending on the pattern.
Here is some test code. Change RE.Pattern on line 17 to test.
Option Explicit
Test1 "word1, word2",""
Test1 " word0 = word1, word.2 , word3.qrs_t-1", "word1 word.2 word3.qrs_t-1"
Test1 "word0=word1", "word1"
WScript.Quit
Sub Test1(TestString, CorrectOutput)
Dim RE, Matches, Answer
Dim i, j
Set RE = New RegExp
RE.Global = True
RE.Pattern = "=([\w.-]+)"
Set Matches = RE.Execute(TestString)
Answer = "Input: " & vbTab & TestString & vbLf
Answer = Answer & "Correct:" & vbTab & CorrectOutput & vbLf
Answer = Answer & "Actual: " & vbTab
For i = 0 To Matches.Count -1
If i > 0 Then
Answer = Answer & " "
End If
Answer = Answer & Matches(i).value
Next
MsgBox Answer
End Sub
Use the following regular expression to extract the the substring with the wordlist from the input string:
str = "..."
Set re = New RegExp
re.Pattern = "^.*?=((?:[^,]+)(?:,[^,]+)*)"
re.Global = True
Set m = re.Execute(str)
Then use a second expression to remove the separating commas and mangle the whitespace:
Set re2 = New RegExp
re2.Pattern = "\s*,\s*"
re2.Global = True
wordlist = ""
If m.Count > 0 Then
wordlist = Trim(re2.Replace(m(0).SubMatches(0), " "))
End If
WScript.Echo wordlist
Description
Give this a try, it'll:
require a value before the equals sign
require an equals sign
require at least 1 value after the equals sign
return each of the 1 to 3 comma delimited chunks of text
trim spaces off all returned values
(?:^\s*?(\b[^=]*?\b)(?:\s{0,}[=]\s{0,}))(?:(['"]?)(\b[^,]*\b)\2\s*?)(?:$|(?:[,]\s*?(['"]?)(\b[^,]*\b)\4\s*?)(?:$|[,]\s*?(['"]?)(\b[^,]*\b)\6\s*?$))
(Right click the image and select view in new tab or new window for full size)
Groups
group 0 matches the full string if it's valid
groups 1-7
value before the equals sign
Quote delimiter if there was one for value 1
first value in the list of values
Quote delimiter if there was one for value 2
second value in the list of values
Quote delimiter if there was on for value 3
third value in the list of values
VB.NET Code Example to demonstrate the that the regex works
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim sourcestring as String = "replace with your source string"
Dim re As Regex = New Regex("(?:^\s*?(\b[^=]*?\b)(?:\s{0,}[=]\s{0,}))(?:(['"]?)(\b[^,]*\b)\2\s*?)(?:$|(?:[,]\s*?(['"]?)(\b[^,]*\b)\4\s*?)(?:$|[,]\s*?(['"]?)(\b[^,]*\b)\6\s*?$))",RegexOptions.IgnoreCase OR RegexOptions.Multiline OR RegexOptions.Singleline)
Dim mc as MatchCollection = re.Matches(sourcestring)
Dim mIdx as Integer = 0
For each m as Match in mc
For groupIdx As Integer = 0 To m.Groups.Count - 1
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames(groupIdx), m.Groups(groupIdx).Value)
Next
mIdx=mIdx+1
Next
End Sub
End Module
$matches Array:
(
[0] => Array
(
[0] => word0 = word1, word.2 , word3.qrs_t-1
)
[1] => Array
(
[0] => word0
)
[2] => Array
(
[0] =>
)
[3] => Array
(
[0] => word1
)
[4] => Array
(
[0] =>
)
[5] => Array
(
[0] => word.2
)
[6] => Array
(
[0] =>
)
[7] => Array
(
[0] => word3.qrs_t-1
)
)