Regex for capturing numbered text list - regex

I have a test list that I am trying to capture data from using a regex.
Here is a sample of the text format:
(1) this is a sample string /(2) something strange /(3) another bit of text /(4) the last one/ something!/
I have a Regex that currently captures this correctly, but I am having some difficulty with making it work under outlier conditions.
Here is my regex
/\(?\d\d?\)([^\)]+)(\/|\z)/
Unfortunately some of the data contains parentheses like this:
(1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/
The substrings '(1998-1999)' and '(blah)' make it fail!
Anyone care to have a crack at this one?
Thank you :D

I would try this:
\((\d+)\)\s+(.*?)(?=/(?:\(\d+\)|\z))
This rather scary looking regex does the following:
It looks for one or more digits wrapped in parentheses and captures them;
There must be at least one white space character after the digits in parentheses. This white space is ignored (not captured);
A non-greedy wildcard expression is used. This is (imho) the preferable way to using negative character groups (eg [^/]+) for this kind of problem;
The positive lookahead ((?=...)) says the expression must be followed by a backslash and then one of:
one or more digits wrapped in parentheses; or
the string terminator.
To give you an example in PHP (you don't specify your language):
$s = '(1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/';
preg_match_all('!\((\d+)\)\s+(.*?)(?=/(?:\(\d+\)|\z))!', $s, $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => (1) this is a sample string (1998-1999)
[1] => (2) something strange (blah)
[2] => (3) another bit of text
[3] => (4) the last one/ something!
)
[1] => Array
(
[0] => 1
[1] => 2
[2] => 3
[3] => 4
)
[2] => Array
(
[0] => this is a sample string (1998-1999)
[1] => something strange (blah)
[2] => another bit of text
[3] => the last one/ something!
)
)
Some notes:
You don't specify what you want to capture. I've assumed the list item number and the text. This could be wrong in which case just drop those capturing parentheses. Either way you can get the whole match;
I've dropped the trailing slash from the match. This may not be your intent. Again just change the capturing to suit;
I've allowed any number of digits for the item number. Your version allowed only two. If you prefer it that way replace \d+ with \d\d?.

Prepend a / to the beginning of string, append a (0) to the end of the string, then split the whole string with the pattern \/\(\d+\), and discard the first and last empty elements.

As long as / cannot appear in the text...
\(?\d?\d[^/]+

Related

Replacing a single term in a regex pattern

I am using regexp_filter in Sphinx to replace terms
In most cases I can do so e.g. misspellings are easy:
regexp_filter = Backround => Background
Even swapping using capturing group notation:
regexp_filter = (Left)(Right) => \2\1
However I am having more trouble when using a pattern match to find a given words I want to replace:
regexp_filter = (PatternWord1|PatternWord2)\W+(?:\w+\W+){1,6}?(SearchTerm)\b => NewSearchTerm
Where NewSearchTerm would be the term I want to replace just \2 with (leaving \1 and the rest of the pattern alone). So
So if I had text 'Pizza and Taco Parlor' then:
regexp_filter = (Pizza)\W+(?:\w+\W+){1,6}?(Parlor)\b => Store
Would convert to 'Pizza and Taco Store'
I know in this case the SearchTerm is /2 but not sure how to convert. I know I could append e.g. /2s to make it plural but how can I in fact replace it since it is just a single capturing group of several and I just want to replace that group?
So, if I understand the question. You have a strings that match the following criteria:
Begin with PattenWord1 or PatternWord2
Immediately followed by an uppercase word
Maybe followed by another word that is between 1 and 6 characters -- recommend using [A-z] rather than \w+\W+
Followed by "SearchTerm"
Let use this as a baseline:
PatternWord1HelloSearchTerm
And you only want to replace SearchTerm from the string.
So you need another pattern group around everything you want to keep:
regexp_filter = ((PatternWord1|PatternWord2)\W+(?:\w+\W+){1,6}?)(SearchTerm)\b => \1World
Your pattern group matches would be:
PatternWord1Hello
PatternWord1
SearchTerm
Your result would be:
PatternWord1HelloWorld

Split by regex with capturing groups in lookahead produces repeating fragments in results

I was hoping for a one-liner to insert thousands separators into string of digits with decimal separator (example: 78912345.12). My first attempt was to split the string in places where there is either 3 or 6 digits left until decimal separator:
console.log("5789123.45".split(/(?=([0-9]{3}\.|[0-9]{6}\.))/));
which gave me the following result (notice how fragments of original string are repeated):
[ '5', '789123.', '789', '123.', '123.45' ]
I found out that "problem" (please read problem here as my obvious misunderstanding) comes from using a group within lookahead expression. This simple expression works "correctly":
console.log("abcXdeYfgh".split(/(?=X|Y)/));
when executed prints:
[ 'abc', 'Xde', 'Yfgh' ]
But the moment I surround X|Y with parentheses:
console.log("abcXdeYfgh".split(/(?=(X|Y))/));
the resulting array looks like:
[ 'abc', 'X', 'Xde', 'Y', 'Yfgh' ]
Moreover, when I change the group to a non-capturing one, everything comes back to "normal":
console.log("abcXdeYfgh".split(/(?=(?:X|Y))/));
this yields again:
[ 'abc', 'Xde', 'Yfgh' ]
So, I could do the same trick (changing to non-capturing group) within original expression (and it indeed works), but I was hoping for an explanation of this behavior I cannot understand. I experience identical results when trying to do the same in .NET so it seems like a fundamental thing with how regular expression lookaheads work. This is my question: why lookahead with capturing groups produces those "strange" results?
Capturing groups inside a regex pattern inside a regex split method/function make the captured texts appear as separate elements in the resulting array (for most of the major languages).
Here is C#/.NET reference:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
Here is JavaScript reference:
If separator is a regular expression that contains capturing parentheses, then each time separator is matched, the results (including any undefined results) of the capturing parentheses are spliced into the output array. However, not all browsers support this capability.
Just a note: the same behavior is observed with
PHP (with preg_split and PREG_SPLIT_DELIM_CAPTURE flag):
print_r(preg_split("/(?<=(X))/","XYZ",-1,PREG_SPLIT_DELIM_CAPTURE));
// --> [0] => X, [1] => X, [2] => YZ
Ruby (with string.split):
"XYZ".split(/(?<=(X))/) # => X, X, YZ
But it is the opposite in Java, the captured text is not part of the resulting array:
System.out.println(Arrays.toString("XYZ".split("(?<=(X))"))); // => [X, YZ]
And in Python, with re module, re.split cannot split on the zero-width assertion, so the string does not get split at all with
print(re.split(r"(?<=(X))","XXYZ")) # => ['XXYZ']
Here is a simple way to do it in Javascript
number.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",")
Normally, including capture buffers could sometimes produce extra elements
if mixing with lookaheads.
You are on the right track but didn't have a natural anchor.
If you use a string where all the characters are the same type
(in your case digits), and using lookaheads, its not good enough
to do the split incrementally based on a length of common characters.
The engine just bumps along one character at a time, splitting on that
character and including the captured ones as elements.
You could handle this by consuming the capture in the process,
like (?=(\d{3}))\1 but that not only splits at the wrong place but
injects an empty element in the array.
The solution is to use the Natural Anchor, the DOT, then split at
multiples of 3 up to the dot anchor.
This forces the engine to seek to the point at which there are multiples
away from the anchor.
Then your problem is solved, no need for captures and the split is perfect.
Regex: (?=(?:[0-9]{3})+\.)
Formatted:
(?=
(?: [0-9]{3} )+
\.
)
C#:
string[] ary = Regex.Split("51234555632454789123.45", #"(?=(?:[0-9]{3})+\.)");
int size = ary.Count();
for (int i = 0; i < size; i++)
Console.WriteLine(" {0} = '{1}' ", i, ary[i]);
Output:
0 = '51'
1 = '234'
2 = '555'
3 = '632'
4 = '454'
5 = '789'
6 = '123.45'

Optionally prevent a string at the end of a wildcard from being matched

I have the following string:
12345 This could be anythingREMOVE
I need to match 12345 and This could be anything. Unfortunately, the format I need to parse also has a string at the end of the line that isn't always present (REMOVE in this example). How can I match what I'm looking for without REMOVE? I've tried the following pattern:
^(\d{5}) (.*)(?:REMOVE|$)
Unfortunately, REMOVE is picked up by the wildcard:
(
[0] => Array
(
[0] => 12345 This could be anythingREMOVE
)
[1] => Array
(
[0] => 12345
)
[2] => Array
(
[0] => This could be anythingREMOVE
)
)
If last string REMOVE is optional then why can't use use htis regex:
"/^(\d{5}) /"
However if you really want to avoid REMOVE in matching pattern then use this:
$s = '12345 This could be anythingREMOVE';
if (preg_match("/^(\d{5}) (.*?)(?:REMOVE|)$/", $s, $arr))
var_dump($arr);
Output:
array(3) {
[0]=>
string(34) "12345 This could be anythingREMOVE"
[1]=>
string(5) "12345"
[2]=>
string(22) "This could be anything"
}
You can try this regex:
^(\d{5})((?:.(?!REMOVE))+.)
How It Works
^(\d{5}) -- Matches start of string, followed by five digits [0-9]. Group of parentheses use to captured the text matched.
((?:.(?!REMOVE))+ -- Matches any character if not immediately followed by the secuence REMOVE one or more times. It stops at the n in anything. it can't match the g because is followed by REMOVE.
.) -- Allow the g to match.

Need to match any string between two delimiters

I need a regex to match parts of a string. For example, in the following string
Fault,10.224.2.3:4450,XX_XXX0039_XX.XX/0,AA,BBBBBB
I want to match the entire string and extract Fault,10.224.2.3:4450 and AA,BBBBBB. However, I want to ignore ,XX_XXX0039_XX.XX/0,.
Note that the string to ignore includes the delimiters, the commas (,). The string to ignore may contain the following characters:
./_0-9A-Za-z
The position of the period (.) is not fixed. Other examples of the pattern I want to ignore are:
,XX_XXX0039_XX.XX/0,
,XX_XX0039_XXXXX/1,
,X_XX0039_X/4,
I am using the regex in Simple Event Coordinator.
(\w+,\d+.\d+.\d+.\d+:\d+).*?,(\w+,\w+)
The best is to avoid your delimiter ,
Regex:
[^,]+
Result:
[0] => XX_XXX0039_XX.XX/0
[1] =>
[2] => XX_XX0039_XXXXX/1
[3] =>
[4] => X_XX0039_X/4

RegExp pattern to capture around two-characters delimiter

I have a string which is something like:
prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::
I want to retrieve the value associated to a key (say, key1). The following pattern:
::key1==([^:]*)
...will work only if there are no ':' character in the value, so I want to make sure the pattern matching will stop only for the substring ::, but I'm can't find how to do that, as most examples I see are about single character matching.
How do I modify the regexp pattern to match all characters between "::key1==" and the next "::" ?
Thanks!
Can you do something like this : ::key1==(.*?)::? Assuming the language supports the lazy ? operator, this should work.
As mentioned in my comment to your question, if the entirety of your string is
prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::
I would suggest exploding/splitting the string at :: instead of using regex as it will usually always be faster. You didn't specify language but here is a php example:
// string
$string = "prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::";
// explode using :: as delimiter
$string = explode('::',$string);
// for each element...
foreach ($string as $value) {
// check if it has == in it
if (strpos($value,'==')!==false) $matches[] = $value;
}
// output
echo "<pre>";print_r($matches);
output:
Array
(
[0] => key0==value0
[1] => key1==value1
[2] => key2==value2
[3] => key3==value3
[4] => key4==value4
)
However, if you insist on the regex approach, here negative look-ahead alternative
::((?:(?!::).)+)
php example
// string
$string = "prefix::key0==value0::key1==value1::key2==value2::key3==value3::key4==value4::";
preg_match_all('~::((?:(?!::).)+)~',$string,$matches);
//output
echo "<pre>";print_r($matches);
output
Array
(
[0] => key0==value0
[1] => key1==value1
[2] => key2==value2
[3] => key3==value3
[4] => key4==value4
)
I think you're looking for a positive look-ahead:
::key0==(.*?)(?=::\w+==)
With the following:
prefix::key0==val::ue0::key1==value1::key2==value2::key3==value3::key4==value4::
It correctly finds val::ue0. This also assumes the keys conform to \w ([0-9A-Za-z_])
Also, a positive look-ahead may be a bit of overkill, but will work if the answer contains ::, too.