I am extract data from a text stream which is data structured as such
/1-<id>/<recType>-<data>..repeat n times../1-<id>/#-<data>..repeat n times..
In the above, the "/1" field precedes the record data which can then have any number of following fields, each with choice of recType from 2 to 9 (also, each field starts with a "/")
For example:
/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE
So, there are three groups of data above
1=XXXX 2=YYYY 9=ZZZZ
1=AAAA 3=BBBB 5=CCCC 8=NNNN 9=DDDD
1=QQQQ 2=WWWW 3=PPPP 7=EEEE
Data is for simplicity, I know for certain that its only contains [A-Z0-9. ] but can be variable length (not just 4 chars as per example)
Now, the following expression sort of works, but its only capturing the first 2 fields of each group and none of the remaining fields...
/1-(?'fld1'[A-Z]+)/((?'fldNo'[2-9])-(?'fldData'[A-Z0-9\. ]+))
I know I need some sort of quantifier in there somewhere, but I do not know what or where to place it.
You can use a regex to match these blocks using 2 .NET regex features: 1) capture collection and 2) multiple capturing groups with the same name in the pattern. Then, we'll need some Linq magic to combine the captured data into a list of lists:
(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*
Details:
(?<fldNo>1) - Group fldNo matching 1
- - a hyphen
(?'fldData'[^/]+) - Group "fldData" capturing 1+ chars other than /
(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))* - zero or more sequences of:
/ - a literal /
(?<fldNo>[2-9]) - 2 to 9 digit (Group "fldNo")
[-=] - a - or =
(?'fldData'[^/]+)- 1+ chars other than / (Group "fldData")
See the regex demo, results:
See C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var str = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
var res = Regex.Matches(str, #"(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*")
.Cast<Match>()
.Select(p => p.Groups["fldNo"].Captures.Cast<Capture>().Select(m => m.Value)
.Zip(p.Groups["fldData"].Captures.Cast<Capture>().Select(m => m.Value),
(first, second) => first + "=" + second))
.ToList();
foreach (var t in res)
Console.WriteLine(string.Join(" ", t));
}
}
I would suggest to first split the string by /1, then use a patern along these lines:
\/([1-9])[=-]([A-Z]+)
https://regex101.com/r/0nyzzZ/1
A single regex isn't the optimal tool for doing this (at least used in this way). The main reason is because your stream has a variable number of entries in it, and using a variable number of capture groups is not supported. I also noticed some of the values had "=" between them as well as the dash, which your current regex doesn't address.
The problem comes when you try and add a quantifier to a capture group - the group will only remember the last thing it captured, so if you add a quantifier, it will end up catching the first and last fields, leaving out all the rest of them. So something like this won't work:
\/1-(?'fld1'[A-Z]+)(?:\/(?'fldNo'[2-9])[-=](?'fldData'[A-Z]+))+
If your streams were all the same length, then a single regex could be used, but there's a way to do it using a foreach loop with a much simpler regex working on each part of your stream (so it verifies your stream as well when it goes along!)
Now I'm not sure what language you're working with when using this, but here is a solution in PHP that I think delivers what you need.
function extractFromStream($str)
{
/*
* Get an array of [num]-[letters] with explode. This will make an array that
* contains [0] => 1-AAAA, [1] => 2-BBBB ... etc
*/
$arr = explode("/", substr($str, 1));
$sorted = array();
$key = 0;
/*
* Sort this data into key->values based on numeric ordering.
* If the next one has a lower or equal starting number than the one before it,
* a new entry will be created. i.e. 2-aaaa => 1-cccc will cause a new
* entry to be made, just in case the stream doesn't always start with 1.
*/
foreach ($arr as $value)
{
// This will get the number at the start, and has the added bonus of making sure
// each bit is in the right format.
if (preg_match("/^([0-9]+)[=-]([A-Z]+)$/", $value, $matches)) {
$newKey = (int)$matches[1];
$match = $matches[2];
} else
throw new Exception("This is not a valid data stream!");
// This bit checks if we've got a lower starting number than last time.
if (isset($lastKey) && is_int($lastKey) && $newKey <= $lastKey)
$key += 1;
// Now sort them..
$sorted[$key][$newKey] = $match;
// This will be compared in the next iteration of the loop.
$lastKey = $newKey;
}
return $sorted;
}
Here's how you can use it...
$full = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
try {
$extracted = extractFromStream($full);
$stream1 = $extracted[0];
$stream2 = $extracted[1];
$stream3 = $extracted[2];
print "<pre>";
echo "Full extraction: \n";
print_r($extracted);
echo "\nFirst Stream:\n";
print_r($stream1);
echo "\nSecond Stream:\n";
print_r($stream2);
echo "\nThird Stream:\n";
print_r($stream3);
print "</pre>";
} catch (Exception $e) {
echo $e->getMessage();
}
This will print
Full extraction:
Array
(
[0] => Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
[1] => Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
[2] => Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
)
First Stream:
Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
Second Stream:
Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
Third Stream:
Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
So you can see you have the numbers as the array keys, and the values they correspond to, which are now readily accessible for further processing. I hope this helps you :)
Related
I hope this is a simple question, but I'm still getting my head around groups.
I have this string: this is some text [propertyFromId:34] and this is more text and I will have more like them. I need to get the content between the brackets then a group with the alpha-only text on left of the colon and group with the integer on the right of the colon.
So, full Match: propertyFromId:34, Group 1: propertyFromId, Group 2: 34
This is my starting point (?<=\[)(.*?)(?=])
Use
\[([a-zA-Z]+):(\d+)]
See the regex demo
Details:
\[ - a [ symbol
([a-zA-Z]+) - Group 1 capturing one or more alpha chars ([[:alpha:]]+ or \p{L}+ can be used, too)
: - a colon
(\d+) - Group 2 capturing one or more digits
] - a closing ] symbol.
PHP demo:
$re = '~\[([a-zA-Z]+):(\d+)]~';
$str = 'this is some text [propertyFromId:34] and this is more text';
preg_match_all($re, $str, $matches);
print_r($matches);
// => Array
// (
// [0] => Array
// (
// [0] => [propertyFromId:34]
// )
//
// [1] => Array
// (
// [0] => propertyFromId
// )
//
// [2] => Array
// (
// [0] => 34
// )
//
// )
Here is my regex: {{[^\{\s}]+\}}
And my input is {{test1}}{{test2}}{{test3}}.
How can I get these 3 tests by array using regex expression?
I would use: ~\{\{([^}]+?)\}\}~
and accessing array depends on your language!
[EDIT] add explanations
~: delimiter
\{\{, \}\}~: match characters literally. Should be
escaped.
[^}]: match anything inside {{}} until a }
+: repeat
pattern multiple times (for multiple characters)
?: is for 'lazy'
to match as few times as possible.
(): is to capture
:)
[EDIT] add PHP code sample for matching illustration:
<?php
$string= "{{test1}}{{test2}}{{test3}}";
if (preg_match_all("~\{\{([^}]+?)\}\}~s", $string, $matches))
{
print_r(array($matches));
// Do what you want
}
?>
will output this:
Array
(
[0] => Array
(
[0] => Array
(
[0] => {{test1}}
[1] => {{test2}}
[2] => {{test3}}
)
[1] => Array
(
[0] => test1
[1] => test2
[2] => test3
)
)
)
test[0-9]+
This matches all occurences of testX where X is an integer of any size.
If you're trying to identify the braces instead, use this:
[{\}]
C# uses Matches Method returns MatchCollection object.
Here is some codes,
Regex r = new Regex(#"{{[^{\s}]+}}");
MatchCollection col = r.Matches("{{test1}}{{test2}}{{test3}}");
string[] arr = null;
if (col != null)
{
arr = new string[col.Count];
for (int i = 0; i < col.Count; i++)
{
arr[i] = col[i].Value;
}
}
I've got following string (example):
Loader[data-prop data-attr="value"]
There can be 1 - n attributes. I want to extract every attribute. (data-prop,data-attr="value"). I tried it in many different ways, for example with \[(?:(\S+)\s)*\] but I didn't get it right. The expression should be written in PREG style..
I suggest grabbing all the key-value pairs with a regex:
'~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~'
(see regex demo) and then
See IDEONE demo
$re = '~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~';
$str = "Loader[data-prop data-attr=\"value\" more-here='data' and-one-more=\"\"]";
preg_match_all($re, $str, $matches);
$arr = array();
for ($i = 0; $i < count($matches); $i++) {
if ($i != 0) {
$arr = array_merge(array_filter($matches[$i]),$arr);
}
}
print_r(preg_grep('~\A(?![\'"]\z)~', $arr));
Output:
Array
(
[3] => data-prop
[4] => data-attr="value"
[5] => more-here='data'
[6] => and-one-more=""
[7] => Loader
)
Notes on the regex (it only looks too complex):
(?:([^][]*)\b\[|(?!^)\G) - a boundary: we only start matching at a [ that is preceded with a word (a-zA-Z0-9_) character (with \b\[), or right after a successful match (with (?!^)\G). Also, ([^][]*) will capture into Group 1 the part before the [.
\s* - matches zero or more whitespace symbols
(\w+(?:-\w+)*) - captures into Group 2 "words" like "word1" or "word1-word2"..."word1-wordn"
(?:=(["\'])?[^\]]*?\3)? - optional group (due to (?:...)?) matching
= - an equal sign
(["\'])? - Group 3 (auxiliary group to check for the value delimiter) capturing either ", ' or nothing
[^\]]*? - (value) zero or more characters other than ] as few as possible
\3 - the closing ' or " (the same value captured in Group 3).
Since we cannot get rid of capturing ' or ", we can preg_grep all the elements that we are not interested in with preg_grep('~\A(?![\'"]\z)~', $arr) where \A(?![\'"]\z) matches any string that is not equal to ' or ".
how about something like [\s\[]([^\s\]]+(="[^"]+)*)+
gives
MATCH 1: data-prop
MATCH 2: data-attr="value"
There is a string in the following format:
It can start with any number of strings enclosed by double braces, possibly with white space between them (whitespace may or may not occur).
It may also contain strings enclosed by double-braces in the middle.
I am looking for a regular expression that can separate the start from the rest.
For example, given the following string:
{{a}}{{b}} {{c}} def{{g}}hij
The two parts are:
{{a}}{{b}} {{c}}
def{{g}}hij
I tried this:
/^({{.*}})(.*)$/
But, it captured also the g in the middle:
{{a}}{{b}} {{c}} def{{g}}
hij
I tried this:
/^({{.*?}})(.*)$/
But, it captured only the first a:
{{a}}
{{b}} {{c}} def{{g}}hij
This keeps matching {{, any non { or } character 1 or more times, }}, possible whitespace zero or more times and stores it in the first group. Rest of the string will be in the 2nd group. If there are no parts surrounded by {{ and }} the first group will be empty. This was in JavaScript.
var str = "{{a}}{{b}} {{c}} def{{g}}hij";
str.match(/^\s*((?:\{\{[^{}]+\}\}\s*)*)(.*)/)
// [whole match, group 1, group 2]
// ["{{a}}{{b}} {{c}} def{{g}}hij", "{{a}}{{b}} {{c}} ", "def{{g}}hij"]
How about using preg_split:
$str = '{{a}}{{b}} {{c}} def{{g}}hij';
$list = preg_split('/(\s[^{].+)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($list);
output:
Array
(
[0] => {{a}}{{b}} {{c}}
[1] => def{{g}}hij
)
I think I got it:
var string = "{{a}}{{b}} {{c}} def{{g}}hij";
console.log(string.match(/((\{\{\w+\}\})\s*)+/g));
// Output: [ '{{a}}{{b}} {{c}} ', '{{g}}' ]
Explanation:
( starts a group.
( another;
\{\{\w+\}\} looks for {{A-Za-z_0-9}}
) closes second group.
\s* Counts whitespace if it's there.
)+ closes the first group and looks for oits one or more occurrences.
When it gets any not-{{something}} type data, it stops.
P.S. -> Complex RegEx takes CPU speed.
You can use this:
(java)
string[] result = yourstr.split("\\s+(?!{)");
(php)
$result = preg_split('/\s+(?!{)/', '{{a}}{{b}} {{c}} def{{g}}hij');
print_r($result);
I donĀ“t know exactly why are you want to split, but in case that the string contains always a def inside, and you want to separate the string from there in two halves, then, you can try something like:
string text = "{{a}}{{b}} {{c}} def{{g}}hij";
Regex r = new Regex("def");
string[] split = new string[2];
int index = r.Match(text).Index;
split[0] = string.Join("", text.Take(index).Select(x => x.ToString()).ToArray<string>());
split[1] = string.Join("", text.Skip(index).Take(text.Length - index).Select(x => x.ToString()).ToArray<string>());
// Output: [ '{{a}}{{b}} {{c}} ', 'def{{g}}hij' ]
Here's my string:
address='St Marks Church',notes='The North East\'s premier...'
The regex I'm using to grab the various parts using match_all is
'/(address|notes)='(.+?)'/i'
The results are:
address => St Marks Church notes => The North East\
How can I get it to ignore the \' character for the notes?
Not sure if you're wrapping your string with heredoc or double quotes, but a less greedy approach:
$str4 = 'address="St Marks Church",notes="The North East\'s premier..."';
preg_match_all('~(address|notes)="([^"]*)"~i',$str4,$matches);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => address="St Marks Church"
[1] => notes="The North East's premier..."
)
[1] => Array
(
[0] => address
[1] => notes
)
[2] => Array
(
[0] => St Marks Church
[1] => The North East's premier...
)
)
Another method with preg_split:
//split the string at the comma
//assumes no commas in text
$parts = preg_split('!,!', $string);
foreach($parts as $key=>$value){
//split the values at the = sign
$parts[$key]=preg_split('!=!',$value);
foreach($parts[$key] as $k2=>$v2){
//trim the quotes out and remove the slashes
$parts[$key][$k2]=stripslashes(trim($v2,"'"));
}
}
Output looks like:
Array
(
[0] => Array
(
[0] => address
[1] => St Marks Church
)
[1] => Array
(
[0] => notes
[1] => The North East's premier...
)
)
Super slow old-skool method:
$len = strlen($string);
$key = "";
$value = "";
$store = array();
$pos = 0;
$mode = 'key';
while($pos < $len){
switch($string[$pos]){
case $string[$pos]==='=':
$mode = 'value';
break;
case $string[$pos]===",":
$store[$key]=trim($value,"'");
$key=$value='';
$mode = 'key';
break;
default:
$$mode .= $string[$pos];
}
$pos++;
}
$store[$key]=trim($value,"'");
Because you have posted that you are using match_all and the top tags in your profile are php and wordpress, I think it is fair to assume you are using preg_match_all() with php.
The following patterns will match the substrings required to buildyour desired associative array:
Patterns that generate a fullstring match and 1 capture group:
/(address|notes)='\K(?:\\\'|[^'])*/ (166 steps, demo link)
/(address|notes)='\K.*?(?=(?<!\\)')/ (218 steps, demo link)
Patterns that generate 2 capture groups:
/(address|notes)='((?:\\\'|[^'])*)/ (168 steps, demo link)
/(address|notes)='(.*?(?<!\\))'/ (209 steps, demo link)
Code: (Demo)
$string = "address='St Marks Church',notes='The North East\'s premier...'";
preg_match_all(
"/(address|notes)='\K(?:\\\'|[^'])*/",
$string,
$out
);
var_export(array_combine($out[1], $out[0]));
echo "\n---\n";
preg_match_all(
"/(address|notes)='((?:\\\'|[^'])*)/",
$string,
$out,
PREG_SET_ORDER
);
var_export(array_column($out, 2, 1));
Output:
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
---
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
Patterns #1 and #3 use alternatives to allow non-apostrophe characters or apostrophes not preceded by a backslash.
Patterns #2 and #4 (will require an additional backslash when implemented with php demo) use lookarounds to ensure that apostrophes preceded by a backslash don't end the match.
Some notes:
Using capture groups, alternatives, and lookarounds often costs pattern efficiency. Limiting the use of these components often improves performance. Using negated character classes with greedy quantifiers often improves performance.
Using \K (which restarts the fullstring match) is useful when trying to reduce capture groups and it reduces the size of the output array.
You should match up to an end quote that isn't preceded by a backslash thus:
(address|notes)='(.*?)[^\\]'
This [^\\] forces the character immediately preceding the ' character to be anything but a backslash.