Regex split string by two consecutive pipe || - regex

I want to split below string by two pipe(|| ) regex .
Input String
value1=data1||value2=da|ta2||value3=test&user01|
Expected Output
value1=data1
value2=da|ta2
value3=test&user01|
I tried ([^||]+) but its consider single pipe | also to split .
Try out my example - Regex
value2 has single pipe it should not be considered as matching.
I am using lua script like
for pair in string.gmatch(params, "([^||]+)") do
print(pair)
end

You can explicitly find each ||.
$ cat foo.lua
s = 'value1=data1||value2=da|ta2||value3=test&user01|'
offset = 1
for idx in string.gmatch(s, '()||') do
print(string.sub(s, offset, idx - 1) )
offset = idx + 2
end
-- Deal with the part after the right-most `||`.
-- Must +1 or it'll fail to handle s like "a=b||".
if offset <= #s + 1 then
print(string.sub(s, offset) )
end
$ lua foo.lua
value1=data1
value2=da|ta2
value3=test&user01|
Regarding ()|| see Lua's doc about Patterns (Lua does not have regex support) —
Captures:
A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture, and therefore has number 1; the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
As a special case, the capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.

the easiest way is to replace the sequence of 2 characters || with any other character (e.g. ;) that will not be used in the data, and only then use it as a separator:
local params = "value1=data1||value2=da|ta2||value3=test&user01|"
for pair in string.gmatch(params:gsub('||',';'), "([^;]+)") do
print(pair)
end
if all characters are possible, then any non-printable characters can be used, according to their codes: string.char("10") == "\10" == "\n"
even with code 1: "\1"
string.gmatch( params:gsub('||','\1'), "([^\1]+)" )

Related

matlab regular expression to check if variables are set in a script

To check if one or more variables (like var1, var2, ...) are set in the script, I read the script into a list of strings (line by line) and check if a line looks like
var1=...
[var1,x]=...
[var2,x]=...
[x,y,var1,z]=...
Currently I'm use the following pattern
pattern = '^([.*)?(var1|var2|var3)(.*])?=';
ix = regexp(s,pattern,'once');
It works for my purpose but I know it's not a safe pattern, because something like [x,vvvar1,y]=... also matches the pattern.
My current solution is to make separate patterns for each type of expressions, but I wonder if there is a unique pattern that can meet my needs.
Here are some examples, if I want to match any of abc or def,
pattern = '^([.*)?(abc|def)(.*])?=';
%% good examples
regexp('x=1',pattern,'once') % output []
regexp('aabc=1',pattern,'once') % output []
regexp('abc=1',pattern,'once') % output 1
regexp('[other,abc]=deal(1,2)',pattern,'once') % output 1
%% bad examples
regexp('[x,aabcc]=deal(1,2)',pattern,'once') % output 1
regexp('[x,abcc,y]=deal(1,2,3)',pattern,'once') % output 1
You want to make sure there is at least one specific variable in the string.
You can use
^\[?(\w+,)*(abc|def)(,\w+)*]?=
See the regex demo.
Details:
^ - start of string
\[? - an optional literal [ char
(\w+,)* - zero or more one or more word chars + a comma sequences
(abc|def) - either abc or def
(,\w+)* - zero or more comma + one or more word chars sequences
]? - an optional ] char
= - a = char.

Matlab: How to replace dynamic part of string with regexprep

I have strings like
#(foo) 5 + foo.^2
#(bar) bar(1,:) + bar(4,:)
and want the expression in the first group of parentheses (which could be anything) to be replaced by x in the whole string
#(x) 5 + x.^2
#(x) x(1,:) + x(4,:)
I thought this would be possible with regexprep in one step somehow, but after reading the docu and fiddling around for quite a while, I have not found a working solution, yet.
I know, one could use two commands: First, grab the string to be matched with regexp and then use it with regexprep to replace all occurrences.
However, I have the gut feeling this should be somehow possible with the functionality of dynamic expressions and tokens or the like.
Without the support of an infinite-width lookbehind, you cannot do that in one step with a single call to regexprep.
Use the first idea: extract the first word and then replace it with x when found in between word boundaries:
s = '#(bar) bar(1,:) + bar(4,:)';
word = regexp(s, '^#\((\w+)\)','tokens'){1}{1};
s = regexprep(s, strcat('\<',word,'\>'), 'x');
Output: #(x) x(1,:) + x(4,:)
The ^#\((\w+)\) regex matches the #( at the start of the string, then captures alphanumeric or _ chars into Group 1 and then matches a ). tokens option allows accessing the captured substring, and then the strcat('\<',word,'\>') part builds the whole word matching regex for the regexprep command.

Split by regex with capturing groups in lookahead produces repeating fragments in results

I was hoping for a one-liner to insert thousands separators into string of digits with decimal separator (example: 78912345.12). My first attempt was to split the string in places where there is either 3 or 6 digits left until decimal separator:
console.log("5789123.45".split(/(?=([0-9]{3}\.|[0-9]{6}\.))/));
which gave me the following result (notice how fragments of original string are repeated):
[ '5', '789123.', '789', '123.', '123.45' ]
I found out that "problem" (please read problem here as my obvious misunderstanding) comes from using a group within lookahead expression. This simple expression works "correctly":
console.log("abcXdeYfgh".split(/(?=X|Y)/));
when executed prints:
[ 'abc', 'Xde', 'Yfgh' ]
But the moment I surround X|Y with parentheses:
console.log("abcXdeYfgh".split(/(?=(X|Y))/));
the resulting array looks like:
[ 'abc', 'X', 'Xde', 'Y', 'Yfgh' ]
Moreover, when I change the group to a non-capturing one, everything comes back to "normal":
console.log("abcXdeYfgh".split(/(?=(?:X|Y))/));
this yields again:
[ 'abc', 'Xde', 'Yfgh' ]
So, I could do the same trick (changing to non-capturing group) within original expression (and it indeed works), but I was hoping for an explanation of this behavior I cannot understand. I experience identical results when trying to do the same in .NET so it seems like a fundamental thing with how regular expression lookaheads work. This is my question: why lookahead with capturing groups produces those "strange" results?
Capturing groups inside a regex pattern inside a regex split method/function make the captured texts appear as separate elements in the resulting array (for most of the major languages).
Here is C#/.NET reference:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
Here is JavaScript reference:
If separator is a regular expression that contains capturing parentheses, then each time separator is matched, the results (including any undefined results) of the capturing parentheses are spliced into the output array. However, not all browsers support this capability.
Just a note: the same behavior is observed with
PHP (with preg_split and PREG_SPLIT_DELIM_CAPTURE flag):
print_r(preg_split("/(?<=(X))/","XYZ",-1,PREG_SPLIT_DELIM_CAPTURE));
// --> [0] => X, [1] => X, [2] => YZ
Ruby (with string.split):
"XYZ".split(/(?<=(X))/) # => X, X, YZ
But it is the opposite in Java, the captured text is not part of the resulting array:
System.out.println(Arrays.toString("XYZ".split("(?<=(X))"))); // => [X, YZ]
And in Python, with re module, re.split cannot split on the zero-width assertion, so the string does not get split at all with
print(re.split(r"(?<=(X))","XXYZ")) # => ['XXYZ']
Here is a simple way to do it in Javascript
number.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",")
Normally, including capture buffers could sometimes produce extra elements
if mixing with lookaheads.
You are on the right track but didn't have a natural anchor.
If you use a string where all the characters are the same type
(in your case digits), and using lookaheads, its not good enough
to do the split incrementally based on a length of common characters.
The engine just bumps along one character at a time, splitting on that
character and including the captured ones as elements.
You could handle this by consuming the capture in the process,
like (?=(\d{3}))\1 but that not only splits at the wrong place but
injects an empty element in the array.
The solution is to use the Natural Anchor, the DOT, then split at
multiples of 3 up to the dot anchor.
This forces the engine to seek to the point at which there are multiples
away from the anchor.
Then your problem is solved, no need for captures and the split is perfect.
Regex: (?=(?:[0-9]{3})+\.)
Formatted:
(?=
(?: [0-9]{3} )+
\.
)
C#:
string[] ary = Regex.Split("51234555632454789123.45", #"(?=(?:[0-9]{3})+\.)");
int size = ary.Count();
for (int i = 0; i < size; i++)
Console.WriteLine(" {0} = '{1}' ", i, ary[i]);
Output:
0 = '51'
1 = '234'
2 = '555'
3 = '632'
4 = '454'
5 = '789'
6 = '123.45'

Tcl regular expressions

set d(aa1) 1
set d(aa2) 1
set d(aa3) 1
set d(aa4) 1
set d(aa5) 1
set d(aa6) 1
set d(aa7) 1
set d(aa8) 1
set d(aa9) 1
set d(aa10) 1
set d(aa11) 1
set regexp "a*\[1-9\]"
set res [array names d -glob $regexp]
puts "res = $res"
In this case, the result is:
res = aa11 aa6 aa2 aa7 aa3 aa8 aa4 aa9 aa5 aa1
But when I change the regexp from a*\[1-9\] to a*\[1-10\], the result becomes:
res = aa11 aa10 aa1
You have an error in your character class.
[1-10] does not mean a digit from 1 to 10
It means 1-1, which is a character ranging from 1 to 1 (i.e., simply a 1), or a 0. This explains your output.
to express a digit from 1 to 10, use this: (?:10?|[2-9]) (as one of several ways to do it.
therefore your regex becomes a*(?:10?|[2-9])
note that if your engine does not allow non-capturing group, you need to remove the ?:, for: a*(?:10?|[2-9])
You need to be sure what you're trying to match because glob style matching and regexp style matching are different in many aspects.
From the docs, glob has the following:
* matches any sequence of characters in string, including a null string.
? matches any single character in string.
[chars] matches any character in the set given by chars. If a sequence of the form x-y appears in chars, then any character between x and y, inclusive, will match. When used with -nocase, the end points of the range are converted to lower case first. Whereas {[A-z]} matches _ when matching case-sensitively (since _ falls between the Z and a), with -nocase this is considered like {[A-Za-z]} (and probably what was meant in the first place).
\x matches the single character x. This provides a way of avoiding the special interpretation of the characters *?[]\ in pattern.
Since you are using glob style matching, your current expression (a*\[1-9\]) matches an a, followed by any characters and any one of 1 through 9 (meaning it would also match something like abcjdne1).
If you want to match at least one a followed by numbers from 1 through 10, you will need something like this, using the -regexp mode:
set regexp {a+(?:[1-9]|10)}
set res [array names d -regexp $regexp]
Now, this regexp is I believe the more natural one for a beginner ((?:[1-9]|10) meaning either 1 through 9, or 10, but you can use the form that zx81 suggested with (?:10?|[2-9]) meaning 1, with an optional 0 for 10, or 2 through 9).
+ means that a must appear at least once for the array name to match.
If you now need to match the full names, you will need to use anchors:
^a+(?:[1-9]|10)$
Note: You cannot use glob matching if you want to match at least one a followed by digits, and alternation (the pipe used |) and quantifiers (? or + or *) the way they behave in regexp are not supported by glob matching.
One last thing, use braces to avoid escaping your pattern (unless you have a variable or running a function in your pattern and can't do otherwise).

Extract number not in brackets from this string using regular expressions [70-(90)]

[15-]
[41-(32)]
[48-(45)]
[70-15]
[40-(64)]
[(128)-42]
[(128)-56]
I have these values for which I want to extract the value not in curled brackets. If there is more than one, then add them together.
What is the regular expression to do this?
So the solution would look like this:
[15-] -> 15
[41-(32)] -> 41
[48-(45)] -> 48
[70-15] -> 85
[40-(64)] -> 40
[(128)-42] -> 42
[(128)-56] -> 56
You would be over complicating if you go for a regex approach (in this case, at least), also, regular expressions does not support mathematical operations, as pointed out by #richardtallent.
You can use an approach as shown here to extract a substring which omits the initial and final square brackets, and then, use the Split (as shown here) and split the string in two using the dash sign. Lastly, use the Instr function (as shown here) to see if any of the substrings that the split yielded contains a bracket.
If any of the substrings contain a bracket, then, they are omitted from the addition, or they are added up if otherwise.
Regular expressions does not support performing math on the terms. You can loop through the groups that are matched and perform the math outside of Regex.
Here's the pattern to extract any number within the square brackets that are not in cury brackets:
\[
(?:(?:\d+|\([^\)]*\))-)*
(\d+)
(?:-[^\]]*)*
\]
Each number will be returned in $1.
This works by looking for a number that is prefixed by any number of "words" separated by dashes, where the "words" are either numbers themselves or parenthesized strings, and followed by, optionally, a dash and some other stuff before hitting the end brace.
If VBA's RegEx doesn't support uncaptured groups (?:), remove all of the ?:'s and your captured numbers will be in $3 instead.
A simpler pattern also works:
\[
(?:[^\]]*-)*
(\d+)
(?:-[^\]]*)*
\]
This simply looks for numbers delimited by dashes and allowing for the number to be at the beginning or end.
Private Sub regEx()
Dim RegexObj As New VBScript_RegExp_55.RegExp
RegexObj.Pattern = "\[(\(?[0-9]*?\)?)-(\(?[0-9]*?\)?)\]"
Dim str As String
str = "[15-]"
Dim Match As Object
Set Match = RegexObj.Execute(str)
Dim result As Integer
Dim value1 As Integer
Dim value2 As Integer
If Not InStr(1, Match.Item(0).submatches.Item(0), "(", 1) Then
value1 = Match.Item(0).submatches.Item(0)
End If
If Not InStr(1, Match.Item(0).submatches.Item(1), "(", 1) And Not Match.Item(0).submatches.Item(1) = "" Then
value2 = Match.Item(0).submatches.Item(1)
End If
result = value1 + value2
MsgBox (result)
End Sub
Fill [15-] with the other strings.
Ok! It's been 6 years and 6 months since the question was posted. Still, for anyone looking for something like that maybe now or in the future...
Step 1:
Trim Leading and Trailing Spaces, if any
Step 2:
Find/Search:
\]|\[|\(.*\)
Replace With:
<Leave this field Empty>
Step 3:
Trim Leading and Trailing Spaces, if any
Step 4:
Find/Search:
^-|-$
Replace With:
<Leave this field Empty>
Step 5:
Find/Search:
-
Replace With:
\+