Tcl split regex over multiple lines - regex

I have a long RE to match dates in multiple files and I would like to split it out over multiple lines so it is easier to read and update. I am setting it as a variable and then calling that variable in the regex statement.
set ::eval::regexdate { \d[\/\.-]\d{2}[\/\.-]\d{4}|\d{2}[\/\.-]\d{2}[\/\.-]\d{4}|\d{4}[\/\.-]\d{2}[\/\.-]\d{2}|(([12]\d|3[01])|([12]\d|3[01])(th|nd|rd|st))\s(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{4}|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[\/\.-]\d{2}[\/\.-]\d{4} }
I am then calling it with the following regexp line...
if {[regexp "($::eval::regexdate)" $linefromfile all date]} {
Do something...
}
This all works fine if the RE is set as one long string, but if I try to break it out over multiple lines using (?x) as outlined in this post.
regexp pattern across multiple lines
set ::eval::regexdate {(?x)
\d[\/\.-]\d{2}[\/\.-]\d{4}|
\d{2}[\/\.-]\d{2}[\/\.-]\d{4}|
\d{4}[\/\.-]\d{2}[\/\.-]\d{2}|
(([12]\d|3[01])|([12]\d|3[01])(th|nd|rd|st))\s(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{4}|
(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[\/\.-]\d{2}[\/\.-]\d{4}
}
I get the following error...
`couldn't compile regular expression pattern: quantifier operand invalid.`
I am not sure why this is happening, my understanding is that using (?x) ignores all white space and comments so it should just stitch the lines back together to create the one long RE, no? Are the "|" operands causing an issue in the way that I have split the RE up?
Any help would be greatly appreciated in figuring out why it won't work when using (?x).
Thanks

The problem is the way you use the regexdate variable in your regxep command. As the post you reference indicates, (?x) should be at the start of the regular expression. However, by using "($::eval::regexdate)" you put parentheses around it, effectively making the expression ((?x)…). Putting parentheses around the complete regular expression is not very useful, as the regexp command will already put the full match in the first variable handed to it.
So, either omit the parentheses and use the complete match as the date:
regexp $::eval::regexdate $linefromfile date
Or move the (?x) to the call:
set ::eval::regexdate {
\d[\/\.-]\d{2}[\/\.-]\d{4}|
\d{2}[\/\.-]\d{2}[\/\.-]\d{4}|
\d{4}[\/\.-]\d{2}[\/\.-]\d{2}|
(([12]\d|3[01])|([12]\d|3[01])(th|nd|rd|st))\s(January|February|March|April|May|June|July|August|September|October|November|December)\s\d{4}|
(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[\/\.-]\d{2}[\/\.-]\d{4}
}
if {[regexp "(?x)($::eval::regexdate)" $linefromfile all date]} {
Do something...
}

Related

Regex Expression - text between quotes and brackets

i have the following JSON string that i need to parse:
{'ConnectionDetails':'{\'server\':\'johnssasd02\',\'database\':\'enterprise analytics\'}'}]}
i am already using the expression '([^']*)' to get everything in quotes, which correctly gets me the ConnectionDetails title. However i now need an expression to get me everything between '{ and '} in order to get the full path value. so i need to capture the following from above string:
{\'server\':\'johnssasd02\',\'database\':\'enterprise analytics\'}
but having trouble coming up the regex expression
thanks
In order to extract the data between the curly braces {} you can use the regex: \{(.*?)\}
i accomplished it within an SSIS derived column task where i removed unwanted characters from the input string. that way i don't have to worry about dealing with them using regex.

regular expression replace removes first and last character when using $1

I have string like this:
&breakUp=Mumbai;city,Puma;brand&
where Mumbai;city and Puma;brand are filters(let say) separated by comma(,). I have to add more filters like Delhi;State.
I am using following regular expression to find the above string:
&breakUp=.([\w;,]*).&
and following regular expression to replace it:
&breakUp=$1,Delhi;State&
It is finding the string correctly but while replacing it is removing the first and last character and giving the following result:
&breakUp=umbai;city,Puma;bran,Delhi;State&
How to resolve this?
Also, If I have no filters I don't want that first comma. Like
&breakUp=&
should become
&breakUp=Delhi;State&
How to do it?
My guess is that your expression is just fine, there are two extra . in there, that we would remove those:
&breakUp=([\w;,]*)&
In this demo, the expression is explained, if you might be interested.
To bypass &breakUp=&, we can likely apply this expression:
&breakUp=([^&]+)&
Demo
Your problem seems to be the leading and trailing period, they are matched to any character.
Try using this regex:
&breakUp=([\w;,]*)&

How to do regular Expression in AutoIt Script

In Autoit script Iam unable to do Regular expression for the below string Here the numbers will get changed always.
Actual String = _WinWaitActivate("RX_IST2_AM [PID:942564 NPID:10991 SID:498702881] sbivvrwm060.dev.ib.tor.Test.com:30000","")
Here the PID, NPID & SID : will be changing and rest of the things are always constant.
What i have tried below is
_WinWaitActivate("RX_IST2_AM [PID:'([0-9]{1,6})' NPID:'([0-9]{1,5})' SID:'([0-9]{1,9})' sbivvrwm060.dev.ib.tor.Test.com:30000","")
Can someone please help me
As stated in the documentation, you should write the prefix REGEXPTITLE: and surround everything with square brackets, but "escape" all including ones as the dots (.) and spaces () with a backslash (\) and instead of [0-9] you might use \d like "[REGEXPTITLE:RX_IST2_AM\ \[PID:(\d{1,6})\ NPID:(\d{1,5})\ SID:(\d{1,9})\] sbivvrwm060\.dev\.ib\.tor\.Test\.com:30000]" as your parameter for the Win...(...)-Functions.
You can even omit the round brackets ((...)) but keep their content if you don't want to capture the content to process it further like with StringRegExp(...) or StringRegExpReplace(...) - using the _WinWaitActivete(...)-Function it won't make sense anyways as it is only matching and not replacing or returning anything from your regular expression.
According to regex101 both work, with the round brackets and without - you should always use a tool like this site to confirm that your expression is actually working for your input string.
Not familiar with autoit, but remember that regex has to completely match your string to capture results. For example, (goat)s will NOT capture the word goat if your string is goat or goater.
You have forgotten to add a ] in your regex, so your pattern doesn't match the string and capture groups will not be extracted. Also I'm not completely sold on the usage of '. Based on this page, you can do something like StringRegExp(yourstring, 'RX_IST2_AM [PID:([0-9]{1,6}) NPID:([0-9]{1,5}) SID:([0-9]{1,9})]', $STR_REGEXPARRAYGLOBALMATCH) and $1, $2 and $3 would be your results respectively. But maybe your approach works too.

how to avoid to match the last letter in this regexp?

I have a quesion about regexp in tcl:
first output: TIP_12.3.4 %
second output: TIP_12.3.4 %
and sometimes the output maybe look like:
first output: TIP_12 %
second output: TIP_12 %
I want to get the number 12.3.4 or 12 using the following exgexp:
output: TIP_(/[0-9].*/[0-9])
but why it does not matches 12.3.4 or 12%?
You need to escape the dot, else it stands for "match every character". Also, I'm not sure about the slashes in your regexp. Better solution:
/TIP_(\d+\.?)+/
Your problem is that / is not special in Tcl's regular expression language at all. It's just an ordinary printable non-letter character. (Other languages are a little different, as it is quite common to enclose regular expressions in / characters; this is not the case in Tcl.) Because it is a simple literal, using it in your RE makes it expect it in the input (despite it not being there); unsurprisingly, that makes the RE not match.
Fixing things: I'd use a regular expression like this: output: TIP_([\d.]+) under the assumption that the data is reasonably well formatted. That would lead to code like this:
regexp {output: TIP_([0-9.]+)} $input -> dottedDigits
Everything not in parentheses is a literal here, so that the code is able to find what to match. Inside the parentheses (the bit we're saving for later) we want one or more digits or periods; putting them inside a square-bracketed-set is perfect and simple. The net effect is to store the 12.3.4 in the variable dottedDigits (if found) and to yield a boolean result that says whether it matched (i.e., you can put it in an if condition usefully).
NB: the regular expression is enclosed in braces because square brackets are also Tcl language metacharacters; putting the RE in braces avoids trouble with misinterpretation of your script. (You could use backslashes instead, but they're ugly…)
Try this :
output: TIP_(/([0-9\.^%]*)/[0-9])
Capture group 1.
Demo here :
http://regexr.com?31f6g
The following expression works for me:
{TIP_((\d+\.?)+)}

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.