tcl regexp with with paren-delimited groups - regex

I have a group of strings that look like:
foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>
so is boils down to a prefix (foo, bar, pizza,...) followed by any number of attribute names inclosed in angled brackets.
both the prefix and the attributes may consist of any character with the exception of angled brackets (which are only used for delimiting the attribute names)
Neither prefix nor attribute names must be empty.
Now I would like to have a regex in my Tcl application, that gives me both the prefix and all the attributes (it's ok if they keep their delimiting brackets, though i the end I have to split them up into a list).
The trivial approach ^(.+)(<.+>)*$ doesn't work because the trailing .+ is too greedy and eats away all the matches for the attribute names.
So I tried excluding the forbidden angle brackets ^(\[^<>\]+)(<.+>)*$ which works OK at first glance - but then i discovered that this would match fnork<<>><x<>> violating the rule that attribute names must not contain any angular brackets (apart from the delimiting one).
Third, I extended the forbidden characters to the attribute names ^(\[^<>\]+)(<\[^<>\]>)*$, but now things are getting a bit shady: while the regex only matches valid strings (so both prefix and attribute names must not contain any brackets), i no longer get the attribute names in as a match part:
% regexp -all -inline "^(\[^<>\]+)(<\[^<>\]+>)*" "A<xyz><123>"
A<xyz><123> A <123>
For whatever reason the <xyz> is not returned!
Any idea how to fix this?
side-note
the actual string I'm trying to parse uses square brackets and parentheses as delimiters. something like: pizza[large](tomato)(olives)(cheese) where there [term] can appear 0 or 1 time, whereas the (term)s can appear 0 or more times.
but due to the nature of square brackets and parentheses this requires a fair amount of quoting, which is probably too much of a distraction to be useful here)

In this case, the trick is to use a fairly simple RE and post-process the results:
% regexp -all -inline {^([^<>]+)((?:<[^<>]+>)*)$} foo<xyz><123>
foo<xyz><123> foo <xyz><123>
% regexp -all -inline {[^<>]+} <xyz><123>
xyz 123
You were almost there, but were struggling with using (<[^<>]+>)*, which won't work as that only captures the group one of the times it matches. (I wasn't aware that it captured the last match, but since I rarely want either first or last but rather all, I use a different approach.)
Putting that all together and assuming you've got one big multi-line string that has all the pieces you want to look at in it (e.g., because you've read it from a file) you get:
set str "foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>"
# Find the matching lines and do the first-level extract on them
foreach {- prefix attribs} [regexp -all -line -inline {^([^<>]+)((?:<[^<>]+>)*)$} $str] {
# Split the attribute names
set attributes [regexp -all -inline {[^<>]+} $attribs]
# Show that we've matched them for real
puts "prefix='$prefix', attributes=[join $attributes ,]"
}
Which produces this output:
prefix='foo', attributes=xyz,123
prefix='bar', attributes=
prefix='pizza', attributes=oregano,tomato,mozzarella

Let's tokenize this.
package require string::token
set lex {[[] LB []] RB [(] LP [)] RP [^][()]+ t}
set str {pizza[large](tomato)(olives)(cheese)}
% set tokens [::string::token text $lex $str]
{t 0 4} {LB 5 5} {t 6 10} {RB 11 11} {LP 12 12} {t 13 18} {RP 19" 19} {LP 20 20} {t 21 26} {RP 27 27} {LP 28 28} {t 29 34} {RP 35 35}
Having tokenized, we can parse, or evaluate the tokens as statements in a little language:
% set terms [lassign $tokens prefix]
proc t {str beg end} {
string range $str $beg $end
}
proc LB {str beg end} {
return "Optional term is: "
}
proc RB args {
return \n
}
proc LP {str beg end} {
rename LP {}
proc LP args {
return ", "
}
return "Arguments are: "
}
proc RP args {}
% puts "Prefix is: [eval [linsert $prefix 1 $str]]"
Prefix is: pizza
% % join [lmap term $terms {eval [linsert $term 1 $str]}] {}
Optional term is: large
Arguments are: tomato, olives, cheese
Documentation:
eval,
join,
lassign,
linsert,
lmap (for Tcl 8.5),
lmap,
package,
proc,
puts,
rename,
return,
set,
string::token (package)

I might have misread the requirements, but given that you have already "encoded" all structural details in your ad hoc notation, why not have the Tcl list machinery do the work?
set str {foo(xyz)(123)
bar
pizza[large](oregano)(tomato)(mozzarella)}
foreach line [split $str \n] {
set line [string map {"[" " " "]" " " ")(" " " "(" " {" ")" "} "} $line]
set suffix [lassign $line prefix]
lassign $suffix a b
if {[llength $suffix] == 2} {
set optional $a
set attributes $b
} else {
set optional ""
set attributes $a
}
puts "prefix='$prefix', optional='$optional', attributes='[join $attributes ,]'"
}
I apologise, strictly speaking, my answer does not address the regex question. And there is less wizardy than in the other replies ;)

Related

Matching a regexp in TCL PERL

I am having follwing pattern
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
I want to segregate each Pattern block . I am using TCL . Regexp that I am using is not resolving the purpose
set updateList [regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list]
Which Regexp to use to segregate each pattern
I need output as
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
Your pattern Pattern\[\d+\].*?Value.*?\n contains mixed quantifiers: both greedy and lazy. Tcl does not handle mixed quantifier type as you would expect it in, say, PCRE (PHP, Perl), .NET, etc., it defaults to the first found one, as the subsequent quantifiers inherit the preceding quantifier type. So, the + after \d is greedy, thus, all others (in .*?) are also greedy - even if you declared them to be lazy. Also, the . matches a newline in Tcl regex, too, so, your pattern works like this.
So, based on your regex, you can make the \d+ lazy with \d+? and replace \n at the end with (?:\n|$) to match both the newline and the end of string:
set RE {Pattern\[\d+?\].*?Value.*?(?:\n|$)}
set updateList [regexp -all -inline $RE $str]
See the IDEONE demo
Alternative 1
Also, you can use a more verbose regex if your input string always has the same structure with all elements - Pattern, Key, Value - present:
set updateList [regexp -all -inline {Pattern\[\d+\]:\s*Key[^\n]*\s*Value[^\n]*} $str]
See the IDEONE demo, and here is the regex demo.
Since a . can match a newline, we need to use a [^\n] negated character class matching any character but a line feed.
Alternative 2
You can use an unrolled lazy subpattern matching Pattern[n]: and then any character that is not a starting point for a Pattern[n]: sequence:
set RE {Pattern\[\d+\]:[^P]*(?:P(?!attern\[\d+\]).)*}
set updateList [regexp -all -inline $RE $str]
See another IDEONE demo and a regex101 demo
Try this
Pattern\[\d+\](.|\n)*?Value.*?\n
The dot . character matches any characters but line break, so you need to add it in. Be aware that your line may end with a carriage character so you might need to add \r in.
% set list { Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list
{Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list ;# only changing `\d+` to `\d+?`
{Pattern[1]:
Key : "key1"
Value : 100
} {Pattern[2]:
Key : "key2"
Value : 20
} {Pattern[3]:
Key : "key3"
Value : 30
} {Pattern[4]:
Key : "key4"
Value : 220
}
If $list does not end with a newline, you won't get the "pattern[4]" element returned. In that case, change
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list
to
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?(?:\n|$)} $list
You want to capture blocks of lines and output them with blank lines in between. Your example data displays patterns on different levels that can be used to recognize which lines belong to which block.
The simplest pattern is this: every three lines in the input make up a block. This pattern suggests processing like this:
set lines [split [string trim $list \n] \n]
foreach {a b c} $lines {puts $a\n$b\n$c\n\n}
There is nothing in your example data that suggests that this wouldn't work. Still, there may be some complications that aren't reflected in your example data.
If there are stray blank lines in the input, you might need to get rid of them first:
set lines [lmap line $lines {if {[string is space $line]} continue else {set line}}]
If some blocks contain less or more lines than in your example, another simple pattern is that every block starts with a line that has optional(?) whitespace and the word Pattern. Those lines (except the first) should be preceded by a block-delimiter in the output:
set lines [split [string trim $list \n] \n]
puts [lindex $lines 0]
foreach line [lrange $lines 1 end] {
if {[regexp {\s*Pattern} $line]} {
puts \n$line
} else {
puts $line
}
}
puts \n
If the lines don't actually begin with whitespace, you could use string match Pattern* $line instead of the regular expression.
Documentation: continue, foreach, if, lindex, lmap, lmap replacement, lrange, puts, regexp, set, split, string

Passing a variable to regexp when the variable may have brackets (TCL)

In my job, I deal a lot with entities whose names may contain square brackets. We mostly use tcl, so square brackets can sometimes cause havoc. I'm trying to do the following:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
echo [regexp "${pat}_affin.*" $aff]
However, this returns a 0 when I would expect a 1. I'm certain that when ${pat} is passed to the regexp engine, the brackets are being expanded and read as "[9]" instead of "[9]".
How do I phrase the regexp so a pattern contains a variable when the variable itself may have special regexp characters?
EDIT: An easy way would be to just escape the brackets when setting $pat. However, the value for $pat is passed to me by a function so I cannot easily do that.
Just ruthlessly escape all non-word chars:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
puts [regexp "${pat}_affin.*" $aff] ;# ==> 0
set escaped_pat [regsub -all {\W} $pat {\\&}]
puts $escaped_pat ;# ==> pair_shap_val\[9\]
puts [regexp "${escaped_pat}_affin.*" $aff] ;# ==> 1
A second thought: this doesn't really seem to require regular expression matching. It appears you just need to check that the pat string is contained in the aff string:
% expr {[string first $pat $aff] != -1}
1

need help in tcl command usage for regsub

I am new learner for tcl. I have some issue as below when using regsub. Consider the following scenario:
set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
set test2 abc
regsub -all ${test2}_[1-9] $test1 [list] test1
I expected $test1 output is [prefix_abc_3 AAA_0]
However regsub has also removed the partial matched string which is prefix_abc_3. Does anyone here have any idea on how to regsub the exact words only in a list?
I tried to find solution via net but could not get any clue/hints. Appreciate if someone here can help me.
\m and \M in regexps match the beginning and end of a word respectively. But you don't have a string of words in test1, but a list of elements, and sometimes there's a difference so don't mix the two. regsub only handles strings while lsearch works with lists:
set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
set test2 abc
set test1 [lsearch -all -inline -not -regexp $test1 "^${test2}_\[1-9\]\$"]
If the pattern is that simple, you can use the -glob option (the default) instead of -regexp and maybe save some processor time.
What exactly did you execute?
When I type the commands above into tclsh, it displays an error -
% set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
prefix_abc_3 abc_1 abc_2 AAA_0
% set test2 abc
abc
% regsub -all ${test2}_[1-9] [list] test1
invalid command name "1-9"
I'm unsure what you are trying to do. You start by inisitalising test1 as a list. You then treat it as a string by passing it to regsub. This is a completely legal thing to do, but may indicate that you are confused by something. Are you trying to test your substitution by applying it four times, to each of prefix_abc_3, abc_1, abc_2 and AAA_0? You can certainly do that the way you are, but a more natural way would be to do
foreach test $test1 {
regsub $pattern $test [list] testResult
puts stdout $testResult
}
Then again, what are you trying to achieve with your substitution? It looks as though your are trying to replace the stringabc with a null string, i.e. remove it altogether. Passing [list] as a null string is perfectly valid, but again may indicate confusion between lists and strings.
To achieve the result you want, all you need to do is add a leading space to your pattern, pass a space as the substitution string and escape the square brackets, i.e.
regsub -all " ${test2}_\[-9\]" $test1 " " test1
but I suspect that this is a made-up example and you're really trying to do something slightly different.
Edit
To obtain a list that contains just those list entries that don't exactly match your pattern, I suggest
proc removeExactMatches {input} {
set result [list]; # Initialise the result list
foreach inputElement $input {
if {![regexp {^abc_[0-9]$} $inputElement]} {
lappend result $inputElement
}
}
return $result
}
set test1 [removeExactMatches [list prefix_abc_3 abc_1 abc_2 AAA_0]]
Notes:
i) I don't use regsub at all.
ii) Although it's safe and legal to switch around between lists and strings, it all takes time and it obscures what I'm tryng to do, so I avoid it wherever possible. You seem to have a list of strings and you want to remove some of them, so that's what I use in my suggested solution. The regular expression commands in Tcl handle strings so I pass them strings.
iii) To ensure that the list elements match exactly, I anchor the pattern to the start and end of the string that I'm matching against using ^ and $.
iv) To prevent the interpreter from recognising the [1-9] in the regular expression pattern and trying to execute a (non-existant) command 1-9, I enclose the whole pattern string within curly brackets.
v) For greater generality, I might want to pass the pattern to the proc as well as the input list (of strings), in that case, I'd do
proc removeExactMatches {inputPattern input} {
.
.
.
set pattern "^"
append pattern $inputPattern
append pattern "\$"
.
.
.
if {![regub $pattern $inputElement]} {
.
.
.
}
set test1 [removeExactMatches {abc_[1-9]} {prefix_abc_3 abc_1 abc_2 AAA_0}]
to minimise the number of characters that had to be escaped. (Actually I probably wouldn't use the quotation marks for the start and end anchors within the proc - they aren't really needed and I'm a lazy typist!)
Looking at your original question, it seems that you might want to vary only the abc part of the pattern, in which case you might want to just pass that to your proc and append the _[0-9] as well as the anchors within it - don't forget to escape the square brackets or use curly brackets if you go down this route.

Handle commas in quoted strings in Tcl

I'm using the following line in Tcl to parse a comma-separated line of fields. Some of the fields may be quoted so they can contain comma's:
set line {12,"34","56"}
set fresult [regsub -all {(\")([^\"]+)(\",)|([^,\"]+),} $line {{\2\4} } fields]
puts $fields
{12} {34} "56"
(It's a bit strange that the last field is quoted instead of braced but that's not the problem here)
However, when there is a comma in the quote, it does not work:
set line {12,"34","56,78"}
set fresult [regsub -all {(\")([^\"]+)(\",)|([^,\"]+),} $line {{\2\4} } fields]
puts $fields
{12} {34} "{56} 78"
I would expect:
{12} {34} {56,78}
Is there something wrong with my regexp or it there something tcl-ish going on?
One option that comes to mind is using the CSV functionality in TclLib. (No reason to reinvent the wheel unless you have to...)
http://tcllib.sourceforge.net/doc/csv.html
Docs Excerpt
::csv::split ? -alternate ? line
{sepChar ,} {delChar "} converts a
line in CSV format into a list of the
values contained in the line. The
character used to separate the values
from each other can be defined by the
caller, via sepChar, but this is
optional. The default is ",". The
quoting character can be defined by
the caller, but this is optional. The
default is '"'. If the option
-alternate is spcified a slightly different syntax is used to parse the
input. This syntax is explained below,
in the section FORMAT.
The problem seems to be an extra comma: you only accept quoted strings if they have a comma after them., and do the same for non-quoted tokens, This works:
set fresult [regsub -all {(\")([^\"]+)(\")|([^,\"]+)} $line {{\2\4} } fields]
^(no commas)^
Working Example: http://ideone.com/O2hss
You can safely keep the commas out of the pattern - the regex engine will keen searching new matches: it will skip a comma it cannot match, and start at the next character.
Bonus: this will also handle escaped quotes, using \" (if you need you should be able to adapt easily by using "" instead of \\. ).:
set fresult [regsub -all {"((?:[^"\\]|\\.)+)"|([^,"]+)} $line {{\1\2} } fields]
Example: http://ideone.com/ztkBh
Use the following regsub
% set line {12,"34","56,78"}
% regsub -all {(,")|(",)|"} $line " " line
% set line
12 34 56,78 <<< Result
Here all the occurrences of ," or ", or " (in order) are replaced by space
As you said to #Kobi, if you allow for empty fields, you should allow for empty strings ""
{((\")([^\"]*)(\")|([^,\"]*))(,|$)} where the fields of interest shifted to 3 and 5
Expanded: { ( (\")([^\"]*)(\") | ([^,\"]*) ) (,|$) } I admit, I don't know if tcl allows (?:) non-capture grouping.

How do I extract all matches with a Tcl regex?

hi everybody i want solution for this regular expression, my problem is Extract all the hex numbers in the form H'xxxx, i used this regexp but i didn't get all hexvalues only i get one number, how to get whole hex number from this string
set hex "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set res [regexp -all {H'([0-9A-Z]+)&} $hex match hexValues]
puts "$res H$hexValues"
i am getting output is 5 H4D52
On -all -inline
From the documentation:
-all : Causes the regular expression to be matched as many times as possible in the string, returning the total number of matches found. If this is specified with match variables, they will contain information for the last match only.
-inline : Causes the command to return, as a list, the data that would otherwise be placed in match variables. When using -inline, match variables may not be specified. If used with -all, the list will be concatenated at each iteration, such that a flat list is always returned. For each match iteration, the command will append the overall match data, plus one element for each subexpression in the regular expression.
Thus to return all matches --including captures by groups-- as a flat list in Tcl, you can write:
set matchTuples [regexp -all -inline $pattern $text]
If the pattern has groups 0…N-1, then each match is an N-tuple in the list. Thus the number of actual matches is the length of this list divided by N. You can then use foreach with N variables to iterate over each tuple of the list.
If N = 2 for example, you have:
set numMatches [expr {[llength $matchTuples] / 2}]
foreach {group0 group1} $matchTuples {
...
}
References
regular-expressions.info/Tcl
Sample code
Here's a solution for this specific problem, annotated with output as comments (see also on ideone.com):
set text "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set pattern {H'([0-9A-F]{4})}
set matchTuples [regexp -all -inline $pattern $text]
puts $matchTuples
# H'22EF 22EF H'2354 2354 H'4BD4 4BD4 H'4C4B 4C4B H'4D52 4D52 H'4DC9 4DC9
# \_________/ \_________/ \_________/ \_________/ \_________/ \_________/
# 1st match 2nd match 3rd match 4th match 5th match 6th match
puts [llength $matchTuples]
# 12
set numMatches [expr {[llength $matchTuples] / 2}]
puts $numMatches
# 6
foreach {whole hex} $matchTuples {
puts $hex
}
# 22EF
# 2354
# 4BD4
# 4C4B
# 4D52
# 4DC9
On the pattern
Note that I've changed the pattern slightly:
Instead of [0-9A-Z]+, e.g. [0-9A-F]{4} is more specific for matching exactly 4 hexadecimal digits
If you insist on matching the &, then the last hex string (H'4DC9 in your input) can not be matched
This explains why you get 4D52 in the original script, because that's the last match with &
Maybe get rid of the &, or use (&|$) instead, i.e. a & or the end of the string $.
References
regular-expressions.info/Finite Repetition, Anchors
I'm not Tclish, but I think you need to use both the -inline and -all options:
regexp -all -inline {H'([0-9A-Z]+)&} $string
EDIT: Here it is again, this time with a corrected regex (see the comments):
regexp -all -inline {H'[0-9A-F]+&} $string