Trying to create a RegEx for the following patterns - regex

Here are the patterns:
Red,Green (and so on...)
Red (+5.00),Green (+6.00) (and so on...)
Red (+5.00,+10.00),Green (+6.00,+20.00) (and so on...)
Red (+5.00),Green (and so on...)
Each attribute ("Red,"Green") can have 0, 1, or 2 modifiers (shown as "+5.00,+10.00", etc.).
I need to capture each of the attributes and their modifiers as a single string (i.e. "Red (+5.00,+10.00)", "Green (+6.00,+20.00)".
Help?

Another example (PCRE):
((?:Red|Green)(?:\s\((?:\+\d+\.\d+,?)+\))?)
Explanation:
(...) // a capture group
(?:...) // a non-capturing group
Read|Green // matches Red or Green
(?:...)? // an optional non-capturing group
\s // matches any whitespace character
\( // matches a literal (
(?:...)+ // a non-capturing group that can occur one or more times
\+ // matches a literal +
\d+ // matches one or more digits
\. // matches a literal .
\d+ // matches one or more digits
,? // matches an optional comma
\) //matches a literal )
Update:
Or actually if you just want to extract the data, then
((?:Red|Green)(?:\s\([^)]+\))?)
would be sufficient.
Update 2: As pointed out in your comment, this would match anything in the first part but , and (:
([^,(]+(?:\s\([^)]+\))?)
(does not work, too permissive)
to be more restrictive (allowing only characters and numbers, you can just use \w:
(\w+(?:\s\([^)]+\))?)
Update 3:
I see, the first of my alternatives does not work correctly, but \w works:
$pattern = "#\w+(?:\s\([^)]+\))?#";
$str = "foo (+15.00,-10.00),bar (-10.00,+25),baz,bing,bam (150.00,-5000.00)";
$matches = array();
preg_match_all($pattern, $str, $matches);
print_r($matches);
prints
Array
(
[0] => Array
(
[0] => foo (+15.00,-10.00)
[1] => bar (-10.00,+25)
[2] => baz
[3] => bing
[4] => bam (150.00,-5000.00)
)
)
Update 4:
Ok, I got something working, please check whether it always works:
(?=[^-+,.]+)[^(),]+(?:\s?\((?:[-+\d.]+,?)+\))?
With:
$pattern = "#(?=[^-+,.]+)[^(),]+(?:\s?\((?:[-+\d.]+,?)+\))?#";
$str = "5 lb. (+15.00,-10.00),bar (-10.00,+25),baz,bing,bam (150.00,-5000.00)";
preg_match_all gives me
Array
(
[0] => Array
(
[0] => 5 lb. (+15.00,-10.00)
[1] => bar (-10.00,+25)
[2] => baz
[3] => bing
[4] => bam (150.00,-5000.00)
)
)
Maybe there is a simpler regex, I'm not an expert...

PCRE format:
(Red|Green)(\s\((?P<val1>.+?)(,){0,1}(?P<val2>.+?){0,1}\)){0,1}
Match from PHP:
preg_match_all("/(Red|Green)(\s\((?P<val1>.+?)(,){0,1}(?P<val2>.+?){0,1}\)){0,1}/ims", $text, $matches);

Here's my bid:
/
(?:^|,) # Match line beginning or a comma
(?: # parent wrapper to catch multiple "color (+#.##)" patterns
( # grouping pattern for picking off matches
(?:(?:Red|Green),?)+ # match the color prefix
\s\( # space then parenthesis
(?: # wrapper for repeated number groups
(?:\x2B\d+\.\d+) # pattern for the +#.##
,?)+ # end wrapper
\) # closing parenthesis
)+ # end matching pattern
)+ # end parent wrapper
/
Which translates to:
/(?:^|,)(?:((?:(?:Red|Green),?)+\s\((?:(?:\x2B\d+\.\d+),?)+\))+)+/
EDIT
Sorry, it was only catching the last pattern before. This will catch all matches (or should).

Related

How to split string by slash which is not between numbers?

How to split string by slash which is not between numbers?
I am using preg_split function below:
$splitted = preg_split('#[/\\\\\_\s]+#u', $string);
Input: "925/123 Black/Jack"
Splitted result now:
[
0 => '925',
1 => '123',
2 => 'Black',
3 => 'Jack'
]
Splitted result I want:
[
0 => '925/123',
1 => 'Black',
2 => 'Jack'
]
You may use
preg_split('#(?:[\s\\\\_]|(?<!\d)/(?!\d))+#u', '925/123 Black/Jack')
See the PHP demo and the regex demo and the regex graph:
Details
(?: - start of a non-capturing group:
[\s\\_] - a whitespace, \ or _
| - or
(?<!\d)/(?!\d) - a / not enclosed with digits
)+ - end of a non-capturing group, repeat 1 or more times.
One option is match 1 or more digits divided by a forward slash with whitespace boundaries on the left and on the right.
Then use SKIP FAIL, and match 1 or more times what is listed in the character class. Note that you don't have to escape the underscore.
(?<!\S)\d+(?:/\d+)+(?!\S)(*SKIP)(*F)|[/\\_\s]+
Explanation
(?<!\S)\d+(?:/\d+)+(?!\S) Match a repeated number of digits between forward slashes
(*SKIP)(*F) Skip
| Or
[/\\_\s]+ Match 1+ occurrences of any of the listed
Regex demo | Php demo
For example
$string = "925/123 Black/Jack";
$pattern = "#(?<!\S)\d+(?:/\d+)+(?!\S)(*SKIP)(*F)|[/\\\\_\s]+#u";
$splitted = preg_split($pattern, $string);
print_r($splitted);
Output
Array
(
[0] => 925/123
[1] => Black
[2] => Jack
)
Your regex is unnecessarily complicated. You need to split your string on:
either a space (maybe more generally - a sequence of white chars),
or a slash
not preceded by a digit (negative lookbehind),
not followed by a digit (negative lookahead).
So the regex you need (enclosed in # chars, with doubled backslashes) is:
#(?<!\\d)/(?!\\d)|\\s+#
Example of code:
$string = "925/123 Black/Jack";
$pattern = "#(?<!\\d)/(?!\\d)|\\s+#";
$splitted = preg_split($pattern, $string);
print_r($splitted);
prints just what you want:
Array
(
[0] => 925/123
[1] => Black
[2] => Jack
)

Need to fix the perl regex to handle multiple cases

I'm trying to handle some cases of strings with the regex:
(.*note(?:'|")?\s*=>\s*)("|')?(.*?)\2(.*)
Strings:
note => "note goes here",
note => 'note goes here',
note => $note,
note => "$note",
note => '$note',
note => '$note'
note => $note . $note2 (can go longer, think it as key value of the perl hash)
# note => '$note',
There can be multiple spaces in start/end/in between. I need to capture " (or '), $note, ,or whatever is left after note_section. There can be # in beginning if this line is a comment, so, I've included .* in beginning. Given regex is failing in case 3 as there is \2 as null.
Edit:
Requirement is that I'm reading a file, and replacing the value of note with some tag say NOTETAG, and all other things around remain same, including inverted commas and spaces. For that,
we need to capture the everything from beginning till we start writing the value
We should capture inverted commas too, so that I can write it back exactly
We need to capture the value of the note
We should capture things after the note value as well.
e.g. note => "kamal" , will become note => "NOTETAG" , (notice we didnt ate , from last)
s{
\b
note
\s*
=>
\s*
\K
(?: (.*)
| '[^']*'
| "[^"]*"
)
}{
defined($1)
? $1 =~ s{\$note\b}{"NOTETAG"}gr
: '"NOTETAG"'
}exg;
Yuo could try (note\s*=>\s*(?:"|')?)[^'",]+
Explanation:
(...) - capturing group
note - match note literally
\s* - match zero or more of whitespaces
=> - match => literally
(?:..) - non-capturing group
"|' - alternation: match either ' or "
? - match preceding pattern zero or one time
[^'",]+ - negated character class - match one or more chraacters (due to + operator) other than ', ", ,
Demo
As a replacement use \1NOTETAG, where \1 means first capturing group

RegEx for capturing every single character except forward slash

I have the following two example strings:
"taxonomy": "abc/about_abc/bsc/archive/2009/presentations_dec"
"taxonomy": "about/archive/term"
"taxonomy": "_decommisioned/ntp-server.niehs.nih.gov/htdocs/results_status/resstatf"
I have tried with the following RegEx:
"taxonomy": "(\w+[^\/])\/?"?
The goal is to take each of those strings and explode them onto their own separate lines on the forward slash, so term1/term2/term3 equals
term1
term2
term3
I also don't know how many terms there are per line, which is why they are broke up like they are. It could be minimum one, max 7. My fill RegEx looks like this:
( "taxonomy": "(\w+[^\/])?\/?(\w+[^\/])?\/?(\w+[^\/])?\/?(\w+[^\/])?\/?(\w+[^\/])?\/?(\w+[^\/])?\/?(\w+[^\/])?\/?")
How do I adjust my capture group to get everything except the forward slashes?
As mentioned in the comments, in the third string this part ntp-server.niehs.nih.gov which is not matched by \w
But you might simplify your expression by matching not a forward slash by using a negated character class and a repeating pattern that match a forward slash and then again 1+ times not a forward slash.
Then you could split your match on a forward slash.
Pattern
"taxonomy": "\K[^/\n]+(?:/[^/\n]+)+(?=")
Explanation
"taxonomy": Match literally
"\K Match double quote and then forget what was matched using \K
[^/\n]+ Match 1+ times not a forward slash using a negated character class
(?:/[^/\n]+)+ Repeating pattern to match /, then 1+ times not a /
(?=") Positive lookahead to assert what is on the right is a double quote
Demo on regex101 | Php demo
For example, if you use explode in php:
$pattern = '~"taxonomy": "\K[^/\n]+(?:/[^/\n]+)+(?=")~';
$strings = [
'"taxonomy": "abc/about_abc/bsc/archive/2009/presentations_dec"',
'"taxonomy": "about/archive/term"',
'"taxonomy": "_decommisioned/ntp-server.niehs.nih.gov/htdocs/results_status/resstatf"'
];
foreach ($strings as $string) {
preg_match($pattern, $string, $match);
print_r(explode('/', $match[0]));
}
Result:
Array
(
[0] => abc
[1] => about_abc
[2] => bsc
[3] => archive
[4] => 2009
[5] => presentations_dec
)
Array
(
[0] => about
[1] => archive
[2] => term
)
Array
(
[0] => _decommisioned
[1] => ntp-server.niehs.nih.gov
[2] => htdocs
[3] => results_status
[4] => resstatf
)

Separating starting digits with regex

I want to separate the starting digits from strings as
01.text
2 - something
3 more
to get
array (
[0] => 01.text
[1] => 01
[2] text
)
array (
[0] => 2 - something
[1] => 2
[2] something
)
array (
[0] => 3 more
[1] => 3
[2] more
)
I tried a regex pattern of
^(\d+)\.+|\s+|-+(.*?)
but doesn't work as I expected.
My problem is how to match . or - with or without space after the digits.
Your regex uses an alternation which would match either in a capturing group one or more digits followed by a dot or a whitespace character or | in a group any character zero or more times non greedy.
You could update your regex to not use the alternations | and make the quantifier in the second group greedy.
In the first group capture one or more digits, then match your character in a character class followed by another capturing group that would match one or more times any character:
^(\d+)[.\s-]+(.+)
Demo
It's better try to give a pattern to strings that you want to split. I know that sometimes its not possible. So, this Regex match with all cases and give to you the Array you desire
/^(\d+)[\.\-\s]*(.*)?$/
let rows = [
"01.text",
"2 - something",
"3 more"
];
let regex = /^(\d+)[\.\-\s]*(.*)?$/;
for(let row of rows) {
console.log(regex.exec(row))
}
Anyway, if you know more separators in the file add then to the [\.\-\s]*

Will a lookahead in regular expressions always not capture or does it depend?

I've been reading some articles on non-capturing groups on this site and on the net
(such as http://www.regular-expressions.info/brackets.html and http://www.asiteaboutnothing.net/regexp/regex-disambiguation.html, What does the "?:^" regular expression mean?, What is a non-capturing group? What does a question mark followed by a colon (?:) mean?)
I am clear on the meaning of (?:foo). What I am unclear about is (?=foo). Is (?=foo) also always a non-capturing group, or does it depend?
No, (?=foo) will not capture "foo". Any look-around assertion (negative- and positive look ahead & behind) will not capture, but only check the presence (or absence) of text.
For example, the regex:
(X(?=\d+))
matches "X" only when there's one or more digits after it. However, these digits are not a part of match group 1.
You can define captures inside the look ahead to capture it. For example, the regex:
(X(?=(\d+)))
matches "X" only when there's one or more digits after it. And these digits are captured in match group 2.
A PHP demo:
<?php
$s = 'X123';
preg_match_all('/(X(?=(\d+)))/', $s, $matches);
print_r($matches);
?>
will print:
Array
(
[0] => Array
(
[0] => X
)
[1] => Array
(
[0] => X
)
[2] => Array
(
[0] => 123
)
)
Lookarounds are always non-capturing and zero-width.
Every group starting with ? will be non-capturing, although only (?:foo) works as a regular group.