Isolating URL paths with capturing groups

Isolating URL paths with capturing groups - regex

Is it possible to have n capture groups?
For example,
http://www.example.com/first-path
http://www.example.com/first-path/second-path
http://www.example.com/first-path/second-path/third-path
http://www.example.com/something.html
http://www.example.com/first-path?id=5
I am trying to capture first-path as group 1, second-path as group 2, and third-path as group 3 using http:\/\/(.*)\/(?!.*\/$)(.*), but it does not split the segments.
No specific programming language being used.

If you were using PHP, you could do something like this. The first split removes the leading http://www.example.com/ part, and the second then splits those values around the /:
$urls = array('http://www.example.com/first-path',
'http://www.example.com/first-path/second-path',
'http://www.example.com/first-path/second-path/third-path',
'http://www.example.com/something.html',
'http://www.example.com/first-path?id=5');
foreach ($urls as $url) {
$tail = preg_split('#https?://[^/]+/#', $url, -1, PREG_SPLIT_NO_EMPTY)[0];
$paths = preg_split('#/#', $tail);
print_r($paths);
}
Output:
Array
(
[0] => first-path
)
Array
(
[0] => first-path
[1] => second-path
)
Array
(
[0] => first-path
[1] => second-path
[2] => third-path
)
Array
(
[0] => something.html
)
Array
(
[0] => first-path?id=5
)
A similar thing could be done in Javascript:
let urls = ['http://www.example.com/first-path',
'http://www.example.com/first-path/second-path',
'http://www.example.com/first-path/second-path/third-path',
'http://www.example.com/something.html',
'http://www.example.com/first-path?id=5'];
console.log(urls.map(s => s.split(/https?:\/\/[^\/]+\//)[1].split(/\//)))
Output:
Array(5) […] 
0: Array [ "first-path" ]
1: Array [ "first-path", "second-path" ]
2: Array(3) [ "first-path", "second-path", "third-path" ]
3: Array [ "something.html" ]
4: Array [ "first-path?id=5" ]

Related

RegEx - groups, need [this:andthis] from a string

I hope this is a simple question, but I'm still getting my head around groups.
I have this string: this is some text [propertyFromId:34] and this is more text and I will have more like them. I need to get the content between the brackets then a group with the alpha-only text on left of the colon and group with the integer on the right of the colon.
So, full Match: propertyFromId:34, Group 1: propertyFromId, Group 2: 34
This is my starting point (?<=\[)(.*?)(?=])

Use
\[([a-zA-Z]+):(\d+)]
See the regex demo
Details:
\[ - a [ symbol
([a-zA-Z]+) - Group 1 capturing one or more alpha chars ([[:alpha:]]+ or \p{L}+ can be used, too)
: - a colon
(\d+) - Group 2 capturing one or more digits
] - a closing ] symbol.
PHP demo:
$re = '~\[([a-zA-Z]+):(\d+)]~';
$str = 'this is some text [propertyFromId:34] and this is more text';
preg_match_all($re, $str, $matches);
print_r($matches);
// => Array
// (
// [0] => Array
// (
// [0] => [propertyFromId:34]
// )
//
// [1] => Array
// (
// [0] => propertyFromId
// )
//
// [2] => Array
// (
// [0] => 34
// )
//
// )

Regex pattern to match groups starting with pattern

I am extract data from a text stream which is data structured as such
/1-<id>/<recType>-<data>..repeat n times../1-<id>/#-<data>..repeat n times..
In the above, the "/1" field precedes the record data which can then have any number of following fields, each with choice of recType from 2 to 9 (also, each field starts with a "/")
For example:
/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE
So, there are three groups of data above
1=XXXX 2=YYYY 9=ZZZZ
1=AAAA 3=BBBB 5=CCCC 8=NNNN 9=DDDD
1=QQQQ 2=WWWW 3=PPPP 7=EEEE
Data is for simplicity, I know for certain that its only contains [A-Z0-9. ] but can be variable length (not just 4 chars as per example)
Now, the following expression sort of works, but its only capturing the first 2 fields of each group and none of the remaining fields...
/1-(?'fld1'[A-Z]+)/((?'fldNo'[2-9])-(?'fldData'[A-Z0-9\. ]+))
I know I need some sort of quantifier in there somewhere, but I do not know what or where to place it.

You can use a regex to match these blocks using 2 .NET regex features: 1) capture collection and 2) multiple capturing groups with the same name in the pattern. Then, we'll need some Linq magic to combine the captured data into a list of lists:
(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*
Details:
(?<fldNo>1) - Group fldNo matching 1
- - a hyphen
(?'fldData'[^/]+) - Group "fldData" capturing 1+ chars other than /
(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))* - zero or more sequences of:
/ - a literal /
(?<fldNo>[2-9]) - 2 to 9 digit (Group "fldNo")
[-=] - a - or =
(?'fldData'[^/]+)- 1+ chars other than / (Group "fldData")
See the regex demo, results:
See C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var str = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
var res = Regex.Matches(str, #"(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*")
.Cast<Match>()
.Select(p => p.Groups["fldNo"].Captures.Cast<Capture>().Select(m => m.Value)
.Zip(p.Groups["fldData"].Captures.Cast<Capture>().Select(m => m.Value),
(first, second) => first + "=" + second))
.ToList();
foreach (var t in res)
Console.WriteLine(string.Join(" ", t));
}
}

I would suggest to first split the string by /1, then use a patern along these lines:
\/([1-9])[=-]([A-Z]+)
https://regex101.com/r/0nyzzZ/1

A single regex isn't the optimal tool for doing this (at least used in this way). The main reason is because your stream has a variable number of entries in it, and using a variable number of capture groups is not supported. I also noticed some of the values had "=" between them as well as the dash, which your current regex doesn't address.
The problem comes when you try and add a quantifier to a capture group - the group will only remember the last thing it captured, so if you add a quantifier, it will end up catching the first and last fields, leaving out all the rest of them. So something like this won't work:
\/1-(?'fld1'[A-Z]+)(?:\/(?'fldNo'[2-9])[-=](?'fldData'[A-Z]+))+
If your streams were all the same length, then a single regex could be used, but there's a way to do it using a foreach loop with a much simpler regex working on each part of your stream (so it verifies your stream as well when it goes along!)
Now I'm not sure what language you're working with when using this, but here is a solution in PHP that I think delivers what you need.
function extractFromStream($str)
{
/*
* Get an array of [num]-[letters] with explode. This will make an array that
* contains [0] => 1-AAAA, [1] => 2-BBBB ... etc
*/
$arr = explode("/", substr($str, 1));
$sorted = array();
$key = 0;
/*
* Sort this data into key->values based on numeric ordering.
* If the next one has a lower or equal starting number than the one before it,
* a new entry will be created. i.e. 2-aaaa => 1-cccc will cause a new
* entry to be made, just in case the stream doesn't always start with 1.
*/
foreach ($arr as $value)
{
// This will get the number at the start, and has the added bonus of making sure
// each bit is in the right format.
if (preg_match("/^([0-9]+)[=-]([A-Z]+)$/", $value, $matches)) {
$newKey = (int)$matches[1];
$match = $matches[2];
} else
throw new Exception("This is not a valid data stream!");
// This bit checks if we've got a lower starting number than last time.
if (isset($lastKey) && is_int($lastKey) && $newKey <= $lastKey)
$key += 1;
// Now sort them..
$sorted[$key][$newKey] = $match;
// This will be compared in the next iteration of the loop.
$lastKey = $newKey;
}
return $sorted;
}
Here's how you can use it...
$full = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
try {
$extracted = extractFromStream($full);
$stream1 = $extracted[0];
$stream2 = $extracted[1];
$stream3 = $extracted[2];
print "<pre>";
echo "Full extraction: \n";
print_r($extracted);
echo "\nFirst Stream:\n";
print_r($stream1);
echo "\nSecond Stream:\n";
print_r($stream2);
echo "\nThird Stream:\n";
print_r($stream3);
print "</pre>";
} catch (Exception $e) {
echo $e->getMessage();
}
This will print
Full extraction:
Array
(
[0] => Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
[1] => Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
[2] => Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
)
First Stream:
Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
Second Stream:
Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
Third Stream:
Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
So you can see you have the numbers as the array keys, and the values they correspond to, which are now readily accessible for further processing. I hope this helps you :)

Regex expression symbol without space

Here is my regex: {{[^\{\s}]+\}}
And my input is {{test1}}{{test2}}{{test3}}.
How can I get these 3 tests by array using regex expression?

I would use: ~\{\{([^}]+?)\}\}~
and accessing array depends on your language!
[EDIT] add explanations
~: delimiter
\{\{, \}\}~: match characters literally. Should be
escaped.
[^}]: match anything inside {{}} until a }
+: repeat
pattern multiple times (for multiple characters)
?: is for 'lazy'
to match as few times as possible.
(): is to capture
:)
[EDIT] add PHP code sample for matching illustration:
<?php
$string= "{{test1}}{{test2}}{{test3}}";
if (preg_match_all("~\{\{([^}]+?)\}\}~s", $string, $matches))
{
print_r(array($matches));
// Do what you want
}
?>
will output this:
Array
(
[0] => Array
(
[0] => Array
(
[0] => {{test1}}
[1] => {{test2}}
[2] => {{test3}}
)
[1] => Array
(
[0] => test1
[1] => test2
[2] => test3
)
)
)

test[0-9]+
This matches all occurences of testX where X is an integer of any size.
If you're trying to identify the braces instead, use this:
[{\}]

C# uses Matches Method returns MatchCollection object.
Here is some codes,
Regex r = new Regex(#"{{[^{\s}]+}}");
MatchCollection col = r.Matches("{{test1}}{{test2}}{{test3}}");
string[] arr = null;
if (col != null)
{
arr = new string[col.Count];
for (int i = 0; i < col.Count; i++)
{
arr[i] = col[i].Value;
}
}

Regex - Ignore some parts of string in match

Here's my string:
address='St Marks Church',notes='The North East\'s premier...'
The regex I'm using to grab the various parts using match_all is
'/(address|notes)='(.+?)'/i'
The results are:
address => St Marks Church notes => The North East\
How can I get it to ignore the \' character for the notes?

Not sure if you're wrapping your string with heredoc or double quotes, but a less greedy approach:
$str4 = 'address="St Marks Church",notes="The North East\'s premier..."';
preg_match_all('~(address|notes)="([^"]*)"~i',$str4,$matches);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => address="St Marks Church"
[1] => notes="The North East's premier..."
)
[1] => Array
(
[0] => address
[1] => notes
)
[2] => Array
(
[0] => St Marks Church
[1] => The North East's premier...
)
)
Another method with preg_split:
//split the string at the comma
//assumes no commas in text
$parts = preg_split('!,!', $string);
foreach($parts as $key=>$value){
//split the values at the = sign
$parts[$key]=preg_split('!=!',$value);
foreach($parts[$key] as $k2=>$v2){
//trim the quotes out and remove the slashes
$parts[$key][$k2]=stripslashes(trim($v2,"'"));
}
}
Output looks like:
Array
(
[0] => Array
(
[0] => address
[1] => St Marks Church
)
[1] => Array
(
[0] => notes
[1] => The North East's premier...
)
)
Super slow old-skool method:
$len = strlen($string);
$key = "";
$value = "";
$store = array();
$pos = 0;
$mode = 'key';
while($pos < $len){
switch($string[$pos]){
case $string[$pos]==='=':
$mode = 'value';
break;
case $string[$pos]===",":
$store[$key]=trim($value,"'");
$key=$value='';
$mode = 'key';
break;
default:
$$mode .= $string[$pos];
}
$pos++;
}
$store[$key]=trim($value,"'");

Because you have posted that you are using match_all and the top tags in your profile are php and wordpress, I think it is fair to assume you are using preg_match_all() with php.
The following patterns will match the substrings required to buildyour desired associative array:
Patterns that generate a fullstring match and 1 capture group:
/(address|notes)='\K(?:\\\'|[^'])*/ (166 steps, demo link)
/(address|notes)='\K.*?(?=(?<!\\)')/ (218 steps, demo link)
Patterns that generate 2 capture groups:
/(address|notes)='((?:\\\'|[^'])*)/ (168 steps, demo link)
/(address|notes)='(.*?(?<!\\))'/ (209 steps, demo link)
Code: (Demo)
$string = "address='St Marks Church',notes='The North East\'s premier...'";
preg_match_all(
"/(address|notes)='\K(?:\\\'|[^'])*/",
$string,
$out
);
var_export(array_combine($out[1], $out[0]));
echo "\n---\n";
preg_match_all(
"/(address|notes)='((?:\\\'|[^'])*)/",
$string,
$out,
PREG_SET_ORDER
);
var_export(array_column($out, 2, 1));
Output:
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
---
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
Patterns #1 and #3 use alternatives to allow non-apostrophe characters or apostrophes not preceded by a backslash.
Patterns #2 and #4 (will require an additional backslash when implemented with php demo) use lookarounds to ensure that apostrophes preceded by a backslash don't end the match.
Some notes:
Using capture groups, alternatives, and lookarounds often costs pattern efficiency. Limiting the use of these components often improves performance. Using negated character classes with greedy quantifiers often improves performance.
Using \K (which restarts the fullstring match) is useful when trying to reduce capture groups and it reduces the size of the output array.

You should match up to an end quote that isn't preceded by a backslash thus:
(address|notes)='(.*?)[^\\]'
This [^\\] forces the character immediately preceding the ' character to be anything but a backslash.

regex accents do not appears

I have this simple code that I need to match some words with accents but its not working like I need.
This is the code
<?
$ab=("BÉLICA HOL");
preg_match_all("/[A-ZÑÁÉÍÓÚ\.]+\b/", $ab,$match_mayusculas);
print_r($match_mayusculas);
?>
The result is this: Array ( [0] => Array ( [0] => BÃ‰LICA [1] => HOL ) )
Why?
If I do this
$ab=utf8_decode("BÉLICA HOL");
The result is Array ( [0] => Array ( [0] => B [1] => LICA [2] => HOL ) )
Where is my mistake?
Really thanks

This works
utf8_decode($match_mayusculas[0][0])
Thanks sp00m

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Isolating URL paths with capturing groups - regex

Related

RegEx - groups, need [this:andthis] from a string

Regex pattern to match groups starting with pattern

Regex expression symbol without space

Regex - Ignore some parts of string in match

regex accents do not appears

Categories

Resources