How do I extract all matches with a Tcl regex? - regex

hi everybody i want solution for this regular expression, my problem is Extract all the hex numbers in the form H'xxxx, i used this regexp but i didn't get all hexvalues only i get one number, how to get whole hex number from this string
set hex "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set res [regexp -all {H'([0-9A-Z]+)&} $hex match hexValues]
puts "$res H$hexValues"
i am getting output is 5 H4D52

On -all -inline
From the documentation:
-all : Causes the regular expression to be matched as many times as possible in the string, returning the total number of matches found. If this is specified with match variables, they will contain information for the last match only.
-inline : Causes the command to return, as a list, the data that would otherwise be placed in match variables. When using -inline, match variables may not be specified. If used with -all, the list will be concatenated at each iteration, such that a flat list is always returned. For each match iteration, the command will append the overall match data, plus one element for each subexpression in the regular expression.
Thus to return all matches --including captures by groups-- as a flat list in Tcl, you can write:
set matchTuples [regexp -all -inline $pattern $text]
If the pattern has groups 0…N-1, then each match is an N-tuple in the list. Thus the number of actual matches is the length of this list divided by N. You can then use foreach with N variables to iterate over each tuple of the list.
If N = 2 for example, you have:
set numMatches [expr {[llength $matchTuples] / 2}]
foreach {group0 group1} $matchTuples {
...
}
References
regular-expressions.info/Tcl
Sample code
Here's a solution for this specific problem, annotated with output as comments (see also on ideone.com):
set text "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set pattern {H'([0-9A-F]{4})}
set matchTuples [regexp -all -inline $pattern $text]
puts $matchTuples
# H'22EF 22EF H'2354 2354 H'4BD4 4BD4 H'4C4B 4C4B H'4D52 4D52 H'4DC9 4DC9
# \_________/ \_________/ \_________/ \_________/ \_________/ \_________/
# 1st match 2nd match 3rd match 4th match 5th match 6th match
puts [llength $matchTuples]
# 12
set numMatches [expr {[llength $matchTuples] / 2}]
puts $numMatches
# 6
foreach {whole hex} $matchTuples {
puts $hex
}
# 22EF
# 2354
# 4BD4
# 4C4B
# 4D52
# 4DC9
On the pattern
Note that I've changed the pattern slightly:
Instead of [0-9A-Z]+, e.g. [0-9A-F]{4} is more specific for matching exactly 4 hexadecimal digits
If you insist on matching the &, then the last hex string (H'4DC9 in your input) can not be matched
This explains why you get 4D52 in the original script, because that's the last match with &
Maybe get rid of the &, or use (&|$) instead, i.e. a & or the end of the string $.
References
regular-expressions.info/Finite Repetition, Anchors

I'm not Tclish, but I think you need to use both the -inline and -all options:
regexp -all -inline {H'([0-9A-Z]+)&} $string
EDIT: Here it is again, this time with a corrected regex (see the comments):
regexp -all -inline {H'[0-9A-F]+&} $string

Related

Check if string end with substring, Tcl

I have a searching string .state_s[0] and another two lists of strings:
{cache.state_s[0]} {cache.state_s[1]}
and
{cache.state_s[0]a} {cache.state_s[1]}
I need command(s) Tcl interpreter accepts to ask if the searching string is matching any of the items in the string list. What is also very important, the solution should only return positive result for the first list. I tried:
set pattern {.state_s[0]}
set escaped_pattern [string map {* \\* ? \\? [ \\[ ] \\] \\ \\\\} $pattern]
set m1 {{cache.state_s[0]} {cache.state_s[1]}}
set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
regexp $escaped_pattern $m1
regexp $escaped_pattern $m2
However, the above commands are returning "1" with both regexp calls.
Basically, I need a way to check if a substring (having special chars like [) is at the end of a string.
You have the elements as a list in a variable m1.
set m1 {{cache.state_s[0]} {cache.state_s[1]}}
set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
But, when you apply it against regexp, Tcl will treat the input m1 as a whole string, not a list.
Since both the list contains the string .state_s[0], regexp returning the result as 1.
If you want to apply the regular expression for each element, then I would recommend to use the lsearch with -regexp flag.
% set m1 {{cache.state_s[0]} {cache.state_s[1]}}
{cache.state_s[0]} {cache.state_s[1]}
% set m2 {{cache.state_s[0]a} {cache.state_s[1]}}
{cache.state_s[0]a} {cache.state_s[1]}
%
% lsearch -regexp -inline $m1 {\.state_s\[0]$};
cache.state_s[0]
% lsearch -regexp -inline $m2 {\.state_s\[0]$}
%
The pattern I have used here is {\.state_s\[0]$}. The last $ symbol represents end of line. With this, we are ensuring that the element doesn't have any more characters in it. We don't have escape the closing square bracket ] in Tcl.

Matching a regexp in TCL PERL

I am having follwing pattern
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
I want to segregate each Pattern block . I am using TCL . Regexp that I am using is not resolving the purpose
set updateList [regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list]
Which Regexp to use to segregate each pattern
I need output as
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
Your pattern Pattern\[\d+\].*?Value.*?\n contains mixed quantifiers: both greedy and lazy. Tcl does not handle mixed quantifier type as you would expect it in, say, PCRE (PHP, Perl), .NET, etc., it defaults to the first found one, as the subsequent quantifiers inherit the preceding quantifier type. So, the + after \d is greedy, thus, all others (in .*?) are also greedy - even if you declared them to be lazy. Also, the . matches a newline in Tcl regex, too, so, your pattern works like this.
So, based on your regex, you can make the \d+ lazy with \d+? and replace \n at the end with (?:\n|$) to match both the newline and the end of string:
set RE {Pattern\[\d+?\].*?Value.*?(?:\n|$)}
set updateList [regexp -all -inline $RE $str]
See the IDEONE demo
Alternative 1
Also, you can use a more verbose regex if your input string always has the same structure with all elements - Pattern, Key, Value - present:
set updateList [regexp -all -inline {Pattern\[\d+\]:\s*Key[^\n]*\s*Value[^\n]*} $str]
See the IDEONE demo, and here is the regex demo.
Since a . can match a newline, we need to use a [^\n] negated character class matching any character but a line feed.
Alternative 2
You can use an unrolled lazy subpattern matching Pattern[n]: and then any character that is not a starting point for a Pattern[n]: sequence:
set RE {Pattern\[\d+\]:[^P]*(?:P(?!attern\[\d+\]).)*}
set updateList [regexp -all -inline $RE $str]
See another IDEONE demo and a regex101 demo
Try this
Pattern\[\d+\](.|\n)*?Value.*?\n
The dot . character matches any characters but line break, so you need to add it in. Be aware that your line may end with a carriage character so you might need to add \r in.
% set list { Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list
{Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list ;# only changing `\d+` to `\d+?`
{Pattern[1]:
Key : "key1"
Value : 100
} {Pattern[2]:
Key : "key2"
Value : 20
} {Pattern[3]:
Key : "key3"
Value : 30
} {Pattern[4]:
Key : "key4"
Value : 220
}
If $list does not end with a newline, you won't get the "pattern[4]" element returned. In that case, change
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list
to
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?(?:\n|$)} $list
You want to capture blocks of lines and output them with blank lines in between. Your example data displays patterns on different levels that can be used to recognize which lines belong to which block.
The simplest pattern is this: every three lines in the input make up a block. This pattern suggests processing like this:
set lines [split [string trim $list \n] \n]
foreach {a b c} $lines {puts $a\n$b\n$c\n\n}
There is nothing in your example data that suggests that this wouldn't work. Still, there may be some complications that aren't reflected in your example data.
If there are stray blank lines in the input, you might need to get rid of them first:
set lines [lmap line $lines {if {[string is space $line]} continue else {set line}}]
If some blocks contain less or more lines than in your example, another simple pattern is that every block starts with a line that has optional(?) whitespace and the word Pattern. Those lines (except the first) should be preceded by a block-delimiter in the output:
set lines [split [string trim $list \n] \n]
puts [lindex $lines 0]
foreach line [lrange $lines 1 end] {
if {[regexp {\s*Pattern} $line]} {
puts \n$line
} else {
puts $line
}
}
puts \n
If the lines don't actually begin with whitespace, you could use string match Pattern* $line instead of the regular expression.
Documentation: continue, foreach, if, lindex, lmap, lmap replacement, lrange, puts, regexp, set, split, string

Need explanation of tcl regexp inline example in the man page please

While trying to understand regexp and --inline use, saw this example but couldn't understand how it works.
Link to the man page is: http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm#M13
In there, under --inline option, this example was given:
regexp -inline -- {\w(\w)} " inlined "
=> {in n}
regexp -all -inline -- {\w(\w)} " inlined "
=> {in n li i ne e}
How does this "{\w(\w)}" yield "{in n}"? Can someone explain please.
Appreciate the help.
Thanks
If -inline but not -all is not given, regexp returns a list consisting of one value for the entire region matched and one value for each submatch (regions captured by parentheses). To see what the entire match is, ignore the parentheses: the pattern is now {\w\w}, matching the two first word characters in the string (in). The first submatch is what you get if you skip one word character (the \w outside the parentheses) and then capture the next word character (the \w inside the parentheses), getting n.
If both -inline and -all are given, regexp does this repeatedly, restarting at the first character beyond the last entire match.
I think that to understand -inline, you must first understand that -inline puts the matches (and submatches) in a list. Because if you had...
regexp -- {\w(\w)} " inlined " m1 m2
You will have...
% puts $m1
in
% puts $m2
n
As the whole match in is stored in m1 while the submatch of the capture group n is stored in m2.
Putting those in a list (i.e. when using -inline) will give {in n}.
When you now have -all and -inline at the same time (assuming that you already know that -all retrieves all non-overlapping matches in regexp), you can no more use variable names after the input string, so you get a list containing all the matches and submatches and if I have to name them m and s (for match and submatch respectively), you have:
in n li i ne e
m s m s m s

Passing a variable to regexp when the variable may have brackets (TCL)

In my job, I deal a lot with entities whose names may contain square brackets. We mostly use tcl, so square brackets can sometimes cause havoc. I'm trying to do the following:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
echo [regexp "${pat}_affin.*" $aff]
However, this returns a 0 when I would expect a 1. I'm certain that when ${pat} is passed to the regexp engine, the brackets are being expanded and read as "[9]" instead of "[9]".
How do I phrase the regexp so a pattern contains a variable when the variable itself may have special regexp characters?
EDIT: An easy way would be to just escape the brackets when setting $pat. However, the value for $pat is passed to me by a function so I cannot easily do that.
Just ruthlessly escape all non-word chars:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
puts [regexp "${pat}_affin.*" $aff] ;# ==> 0
set escaped_pat [regsub -all {\W} $pat {\\&}]
puts $escaped_pat ;# ==> pair_shap_val\[9\]
puts [regexp "${escaped_pat}_affin.*" $aff] ;# ==> 1
A second thought: this doesn't really seem to require regular expression matching. It appears you just need to check that the pat string is contained in the aff string:
% expr {[string first $pat $aff] != -1}
1

How to match nth occurrence in a string using regular expression

How to match nth occurrence in a string using regular expression
set test {stackoverflowa is a best solution finding site
stackoverflowb is a best solution finding site stackoverflowc is a
best solution finding sitestackoverflowd is a best solution finding
sitestackoverflowe is a best solution finding site}
regexp -all {stackoverflow} $test
The above one give "5" as output
regexp {stackoverflow} $test
The above one give stackoverflow as result, here it is matching first occurrence of stackoverflow (i.e) stackoverflowa
My requirement is i want to match 5th occurrence of stackoverflow (i.e) stackoverflowe from the above given string.
Please some one clarify my question..Thanks
Then another one question
Try
set results [regexp -inline -all {stackoverflow.} $test]
# => stackoverflowa stackoverflowb stackoverflowc stackoverflowd stackoverflowe
puts [lindex $results 4]
I'll be back to explain this further shortly, making pancakes right now.
So.
The command returns a list (-inline) of all (-all) substrings of the string contained in test that match the string "stackoverflow" (less quotes) plus one character, which can be any character. This list is stored in the variable result, and by indexing with 4 (because indexing is zero-based), the fifth element of this list can be retrieved (and, in this case, printed).
The dot at the end of the expression wasn't in your expression: I added it to check that I really did get the right match. You can of course omit the dot to match "stackoverflow" exactly.
ETA (from Donal's comment): in many cases it's convenient to extract not the string itself, but its position and extent within the searched string. The -indices option gives you that (I'm not using the dot in the expression now: the index list makes it obvious which one of the "stackoverflow"s I'm getting anyway):
set indices [regexp -inline -all -indices {stackoverflow} $test]
# => {0 12} {47 59} {94 106} {140 152} {186 198}
You can then use string range to get the string match:
puts [string range $test {*}[lindex $indices 4]]
The lindex $indices 4 gives me the list 186 198; the {*} prefix makes the two elements in that list appear as two separate arguments in the invocation of string range.