RegExp pattern to capture around two-characters delimiter - regex

I have a string which is something like:
prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::
I want to retrieve the value associated to a key (say, key1). The following pattern:
::key1==([^:]*)
...will work only if there are no ':' character in the value, so I want to make sure the pattern matching will stop only for the substring ::, but I'm can't find how to do that, as most examples I see are about single character matching.
How do I modify the regexp pattern to match all characters between "::key1==" and the next "::" ?
Thanks!

Can you do something like this : ::key1==(.*?)::? Assuming the language supports the lazy ? operator, this should work.

As mentioned in my comment to your question, if the entirety of your string is
prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::
I would suggest exploding/splitting the string at :: instead of using regex as it will usually always be faster. You didn't specify language but here is a php example:
// string
$string = "prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::";
// explode using :: as delimiter
$string = explode('::',$string);
// for each element...
foreach ($string as $value) {
// check if it has == in it
if (strpos($value,'==')!==false) $matches[] = $value;
}
// output
echo "<pre>";print_r($matches);
output:
Array
(
[0] => key0==value0
[1] => key1==value1
[2] => key2==value2
[3] => key3==value3
[4] => key4==value4
)
However, if you insist on the regex approach, here negative look-ahead alternative
::((?:(?!::).)+)
php example
// string
$string = "prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::";
preg_match_all('~::((?:(?!::).)+)~',$string,$matches);
//output
echo "<pre>";print_r($matches);
output
Array
(
[0] => key0==value0
[1] => key1==value1
[2] => key2==value2
[3] => key3==value3
[4] => key4==value4
)

I think you're looking for a positive look-ahead:
::key0==(.*?)(?=::\w+==)
With the following:
prefix::key0==val::ue0::key1==value1::key2==value2::key3==value3::key4==value4::
It correctly finds val::ue0. This also assumes the keys conform to \w ([0-9A-Za-z_])
Also, a positive look-ahead may be a bit of overkill, but will work if the answer contains ::, too.

Related

Optionally prevent a string at the end of a wildcard from being matched

I have the following string:
12345 This could be anythingREMOVE
I need to match 12345 and This could be anything. Unfortunately, the format I need to parse also has a string at the end of the line that isn't always present (REMOVE in this example). How can I match what I'm looking for without REMOVE? I've tried the following pattern:
^(\d{5}) (.*)(?:REMOVE|$)
Unfortunately, REMOVE is picked up by the wildcard:
(
[0] => Array
(
[0] => 12345 This could be anythingREMOVE
)
[1] => Array
(
[0] => 12345
)
[2] => Array
(
[0] => This could be anythingREMOVE
)
)
If last string REMOVE is optional then why can't use use htis regex:
"/^(\d{5}) /"
However if you really want to avoid REMOVE in matching pattern then use this:
$s = '12345 This could be anythingREMOVE';
if (preg_match("/^(\d{5}) (.*?)(?:REMOVE|)$/", $s, $arr))
var_dump($arr);
Output:
array(3) {
[0]=>
string(34) "12345 This could be anythingREMOVE"
[1]=>
string(5) "12345"
[2]=>
string(22) "This could be anything"
}
You can try this regex:
^(\d{5})((?:.(?!REMOVE))+.)
How It Works
^(\d{5}) -- Matches start of string, followed by five digits [0-9]. Group of parentheses use to captured the text matched.
((?:.(?!REMOVE))+ -- Matches any character if not immediately followed by the secuence REMOVE one or more times. It stops at the n in anything. it can't match the g because is followed by REMOVE.
.) -- Allow the g to match.

regex: match a pattern between last occurence of a "/" and the end of the line

I can't figure out how to match a pattern between LAST / and the end of the line.
I have tons of:
/usr/etc/blabla:/etc/bbb
/usr/etc/blabla:/etc/bffb.gh
/usr/etc/blabla:/local/fffusr
/usr/etc/blabla:/bin/dfusrd
/usr/etc/var:/etc/aaaaaf.ju
For example i want to match "usr" only when it is in the bold part.
I'm using grep.
EDIT:
I've a small problem with this solution:
/([^/]+)$
It doesn't match the pattern if it is immediately after the /, for example those:
/usr/etc/blabla:/bin/usrlala
/bin/bla/:/etc/usr
are not matched
FOUND IT: /([^/]*)$
It would be:
/([^/]+)$
But maybe you must escape the slash (/) depending on your language:
/\/([^\/]+)$/
Why do you want to use regex on such simple task?
If you're using php you can use
$pos = strrpos($line, '/');
to determine last occurance of / and then copy everything from there
$name = substr($line, $pos+1);
regex is not ultimate solution to everything. It will be slower on such simple string operations. Well, it will always be slower to your own procedure parsing a string (if it's written good).
echo "
/usr/etc/blabla:/etc/bbb
/usr/etc/blabla:/etc/bffb.gh
/usr/etc/blabla:/local/fffusr
/usr/etc/blabla:/bin/dfusrd
/usr/etc/var:/etc/aaaaaf.ju" | sed -n 's#.*/##;/.*usr.*/p'
fffusr
dfusrd
answer in javascript
var s = "/usr/etc/var:/etc/aaaaaf.ju"
s ; //# => /usr/etc/var:/etc/aaaaaf.ju
var last = s.match(/[^/]+$/);
last ; //# => aaaaaf.ju
Using PCRE:
$re = '/.+\/.*usr.*/i';
$string = '/usr/etc/blabla:/etc/bbb
/usr/etc/blabla:/etc/bffb.gh
/usr/etc/blabla:/local/fffusr
/usr/etc/blabla:/bin/dfusrd
/usr/etc/var:/etc/aaaaaf.ju';
$nMatches = preg_match_all($re, $string, $aMatches);
Result:
Array
(
[0] => Array
(
[0] => /usr/etc/blabla:/local/fffusr
[1] => /usr/etc/blabla:/bin/dfusrd
)
)

Regex: How to "step back"

I am having some trouble cooking up a regex that produces this result:
Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Michea2,3
How does one step back in regex and discard the last match? That is I need a comma before a space to not match. This what I came up with...
\d+(,|\r)
Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Micheal2,3
The regex feature you're asking about is called a positive lookbehind. But in your case, I don't think you need it. Try this:
\d+(?:,\d+)*
In your example, this will match the comma delimited lists of numbers and exclude the names and trailing commas and whitespace.
Here is a short bit of test code written in PHP that verifies it on your input:
<?php
$input = "Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Micheal2,3";
$matches = array();
preg_match_all('/\d+(?:,\d+)*/', $input, $matches);
print_r($matches[0]);
?>
outputs:
Array
(
[0] => 1
[1] => 1,2
[2] => 1,2,3,4,5,6,7,18
[3] => 2,3
)
I believe \d+,(?!\s) will do what you want. The ?! is a negative lookahead, which only matches if what follows the ?! does not appear at this position in the search string.
>>> re.findall(r'\d+,(?!\s)', 'Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Michea2,3')
['1,', '1,', '2,', '3,', '4,', '5,', '6,', '7,', '2,']
Or if you want to match the comma-separated list of numbers excluding the final comma use \d+(?:,\d+)*.
>>> re.findall(r'\d+(?:,\d+)*', 'Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Michea2,3')
['1', '1,2', '1,2,3,4,5,6,7,18', '2,3']

Regex for capturing numbered text list

I have a test list that I am trying to capture data from using a regex.
Here is a sample of the text format:
(1) this is a sample string /(2) something strange /(3) another bit of text /(4) the last one/ something!/
I have a Regex that currently captures this correctly, but I am having some difficulty with making it work under outlier conditions.
Here is my regex
/\(?\d\d?\)([^\)]+)(\/|\z)/
Unfortunately some of the data contains parentheses like this:
(1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/
The substrings '(1998-1999)' and '(blah)' make it fail!
Anyone care to have a crack at this one?
Thank you :D
I would try this:
\((\d+)\)\s+(.*?)(?=/(?:\(\d+\)|\z))
This rather scary looking regex does the following:
It looks for one or more digits wrapped in parentheses and captures them;
There must be at least one white space character after the digits in parentheses. This white space is ignored (not captured);
A non-greedy wildcard expression is used. This is (imho) the preferable way to using negative character groups (eg [^/]+) for this kind of problem;
The positive lookahead ((?=...)) says the expression must be followed by a backslash and then one of:
one or more digits wrapped in parentheses; or
the string terminator.
To give you an example in PHP (you don't specify your language):
$s = '(1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/';
preg_match_all('!\((\d+)\)\s+(.*?)(?=/(?:\(\d+\)|\z))!', $s, $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => (1) this is a sample string (1998-1999)
[1] => (2) something strange (blah)
[2] => (3) another bit of text
[3] => (4) the last one/ something!
)
[1] => Array
(
[0] => 1
[1] => 2
[2] => 3
[3] => 4
)
[2] => Array
(
[0] => this is a sample string (1998-1999)
[1] => something strange (blah)
[2] => another bit of text
[3] => the last one/ something!
)
)
Some notes:
You don't specify what you want to capture. I've assumed the list item number and the text. This could be wrong in which case just drop those capturing parentheses. Either way you can get the whole match;
I've dropped the trailing slash from the match. This may not be your intent. Again just change the capturing to suit;
I've allowed any number of digits for the item number. Your version allowed only two. If you prefer it that way replace \d+ with \d\d?.
Prepend a / to the beginning of string, append a (0) to the end of the string, then split the whole string with the pattern \/\(\d+\), and discard the first and last empty elements.
As long as / cannot appear in the text...
\(?\d?\d[^/]+

PHP/Javascript RegExp - Non-capturing group

I have three variations of a string:
1. view=(edit:29,30)
2. view=(edit:29,30;)
3. view=(edit:29,30;x:100;y:200)
I need a RegExp that:
capture up to and including ",30"
capture "x:100;y:200" - whenever there's a semicolon after the first match;
WILL NOT include leftmost semicolon in any of the groups;
entire string on the right of the first semicolon and up to ')' can/should be in the same group.
I came up with:
$pat = '/view=\((\w+)(:)([\d,]+)((;[^)]+){0,}|;)\)/';
Applied to 'view=(edit:29,30;x:100;y:200)' it yields:
Array
(
[0] => view=(edit:29,30;x:100;y:200)
[1] => edit
[2] => :
[3] => 29,30
[4] => ;x:100;y:200
[5] => ;x:100;y:200
)
THE QUESTION. How do I remove ';' from matches [4] and [5]?
IMPORTANT. The same RegExp should work with a string when no semicolons are present, as: 'view=(edit:29,30)'.
$pat = '/view=\((\w+)(:)([\d,]+)((;[^)]+){0,}|;)\)/';
$str = 'view=(edit:29,30;x:100;y:200)';
preg_match($pat, $str, $m);
print_r($m);
Thanks!
You don’t need to group everything. Try this regular expression:
/view=\((\w+):([\d,]+)(?:;([^)]+)?)?\)/
I guess you want something like this:
$pattern = '/view=\\((\\w+):(\\d+,\\d+)(?:;((?:\\w+:\\d+;?)*))?\\)/';
Should return
[0] view=(edit:29,30;x:100;y:200)
[1] edit
[2] 29,30
[3] x:100;y:200