Handle commas in quoted strings in Tcl - regex

I'm using the following line in Tcl to parse a comma-separated line of fields. Some of the fields may be quoted so they can contain comma's:
set line {12,"34","56"}
set fresult [regsub -all {(\")([^\"]+)(\",)|([^,\"]+),} $line {{\2\4} } fields]
puts $fields
{12} {34} "56"
(It's a bit strange that the last field is quoted instead of braced but that's not the problem here)
However, when there is a comma in the quote, it does not work:
set line {12,"34","56,78"}
set fresult [regsub -all {(\")([^\"]+)(\",)|([^,\"]+),} $line {{\2\4} } fields]
puts $fields
{12} {34} "{56} 78"
I would expect:
{12} {34} {56,78}
Is there something wrong with my regexp or it there something tcl-ish going on?

One option that comes to mind is using the CSV functionality in TclLib. (No reason to reinvent the wheel unless you have to...)
http://tcllib.sourceforge.net/doc/csv.html
Docs Excerpt
::csv::split ? -alternate ? line
{sepChar ,} {delChar "} converts a
line in CSV format into a list of the
values contained in the line. The
character used to separate the values
from each other can be defined by the
caller, via sepChar, but this is
optional. The default is ",". The
quoting character can be defined by
the caller, but this is optional. The
default is '"'. If the option
-alternate is spcified a slightly different syntax is used to parse the
input. This syntax is explained below,
in the section FORMAT.

The problem seems to be an extra comma: you only accept quoted strings if they have a comma after them., and do the same for non-quoted tokens, This works:
set fresult [regsub -all {(\")([^\"]+)(\")|([^,\"]+)} $line {{\2\4} } fields]
^(no commas)^
Working Example: http://ideone.com/O2hss
You can safely keep the commas out of the pattern - the regex engine will keen searching new matches: it will skip a comma it cannot match, and start at the next character.
Bonus: this will also handle escaped quotes, using \" (if you need you should be able to adapt easily by using "" instead of \\. ).:
set fresult [regsub -all {"((?:[^"\\]|\\.)+)"|([^,"]+)} $line {{\1\2} } fields]
Example: http://ideone.com/ztkBh

Use the following regsub
% set line {12,"34","56,78"}
% regsub -all {(,")|(",)|"} $line " " line
% set line
12 34 56,78 <<< Result
Here all the occurrences of ," or ", or " (in order) are replaced by space

As you said to #Kobi, if you allow for empty fields, you should allow for empty strings ""
{((\")([^\"]*)(\")|([^,\"]*))(,|$)} where the fields of interest shifted to 3 and 5
Expanded: { ( (\")([^\"]*)(\") | ([^,\"]*) ) (,|$) } I admit, I don't know if tcl allows (?:) non-capture grouping.

Related

tcl regexp with with paren-delimited groups

I have a group of strings that look like:
foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>
so is boils down to a prefix (foo, bar, pizza,...) followed by any number of attribute names inclosed in angled brackets.
both the prefix and the attributes may consist of any character with the exception of angled brackets (which are only used for delimiting the attribute names)
Neither prefix nor attribute names must be empty.
Now I would like to have a regex in my Tcl application, that gives me both the prefix and all the attributes (it's ok if they keep their delimiting brackets, though i the end I have to split them up into a list).
The trivial approach ^(.+)(<.+>)*$ doesn't work because the trailing .+ is too greedy and eats away all the matches for the attribute names.
So I tried excluding the forbidden angle brackets ^(\[^<>\]+)(<.+>)*$ which works OK at first glance - but then i discovered that this would match fnork<<>><x<>> violating the rule that attribute names must not contain any angular brackets (apart from the delimiting one).
Third, I extended the forbidden characters to the attribute names ^(\[^<>\]+)(<\[^<>\]>)*$, but now things are getting a bit shady: while the regex only matches valid strings (so both prefix and attribute names must not contain any brackets), i no longer get the attribute names in as a match part:
% regexp -all -inline "^(\[^<>\]+)(<\[^<>\]+>)*" "A<xyz><123>"
A<xyz><123> A <123>
For whatever reason the <xyz> is not returned!
Any idea how to fix this?
side-note
the actual string I'm trying to parse uses square brackets and parentheses as delimiters. something like: pizza[large](tomato)(olives)(cheese) where there [term] can appear 0 or 1 time, whereas the (term)s can appear 0 or more times.
but due to the nature of square brackets and parentheses this requires a fair amount of quoting, which is probably too much of a distraction to be useful here)
In this case, the trick is to use a fairly simple RE and post-process the results:
% regexp -all -inline {^([^<>]+)((?:<[^<>]+>)*)$} foo<xyz><123>
foo<xyz><123> foo <xyz><123>
% regexp -all -inline {[^<>]+} <xyz><123>
xyz 123
You were almost there, but were struggling with using (<[^<>]+>)*, which won't work as that only captures the group one of the times it matches. (I wasn't aware that it captured the last match, but since I rarely want either first or last but rather all, I use a different approach.)
Putting that all together and assuming you've got one big multi-line string that has all the pieces you want to look at in it (e.g., because you've read it from a file) you get:
set str "foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>"
# Find the matching lines and do the first-level extract on them
foreach {- prefix attribs} [regexp -all -line -inline {^([^<>]+)((?:<[^<>]+>)*)$} $str] {
# Split the attribute names
set attributes [regexp -all -inline {[^<>]+} $attribs]
# Show that we've matched them for real
puts "prefix='$prefix', attributes=[join $attributes ,]"
}
Which produces this output:
prefix='foo', attributes=xyz,123
prefix='bar', attributes=
prefix='pizza', attributes=oregano,tomato,mozzarella
Let's tokenize this.
package require string::token
set lex {[[] LB []] RB [(] LP [)] RP [^][()]+ t}
set str {pizza[large](tomato)(olives)(cheese)}
% set tokens [::string::token text $lex $str]
{t 0 4} {LB 5 5} {t 6 10} {RB 11 11} {LP 12 12} {t 13 18} {RP 19" 19} {LP 20 20} {t 21 26} {RP 27 27} {LP 28 28} {t 29 34} {RP 35 35}
Having tokenized, we can parse, or evaluate the tokens as statements in a little language:
% set terms [lassign $tokens prefix]
proc t {str beg end} {
string range $str $beg $end
}
proc LB {str beg end} {
return "Optional term is: "
}
proc RB args {
return \n
}
proc LP {str beg end} {
rename LP {}
proc LP args {
return ", "
}
return "Arguments are: "
}
proc RP args {}
% puts "Prefix is: [eval [linsert $prefix 1 $str]]"
Prefix is: pizza
% % join [lmap term $terms {eval [linsert $term 1 $str]}] {}
Optional term is: large
Arguments are: tomato, olives, cheese
Documentation:
eval,
join,
lassign,
linsert,
lmap (for Tcl 8.5),
lmap,
package,
proc,
puts,
rename,
return,
set,
string::token (package)
I might have misread the requirements, but given that you have already "encoded" all structural details in your ad hoc notation, why not have the Tcl list machinery do the work?
set str {foo(xyz)(123)
bar
pizza[large](oregano)(tomato)(mozzarella)}
foreach line [split $str \n] {
set line [string map {"[" " " "]" " " ")(" " " "(" " {" ")" "} "} $line]
set suffix [lassign $line prefix]
lassign $suffix a b
if {[llength $suffix] == 2} {
set optional $a
set attributes $b
} else {
set optional ""
set attributes $a
}
puts "prefix='$prefix', optional='$optional', attributes='[join $attributes ,]'"
}
I apologise, strictly speaking, my answer does not address the regex question. And there is less wizardy than in the other replies ;)

tcl regexp from variable and special characters

I am a bit confused
my input string is " foo/1"
my motivation is to set foo as a variable and regexp it :
set line " foo/1"
set a foo
regexp "\s$a" $line does not work
also I noticed that only if I use curly and giving the exact string braces it works
regexp {\sfoo} $line works
regexp "\sfoo" $line doesnt work
can somebody explain why?
thanks
Quick answer:
"\\s" == {\s}
Long answer:
In Tcl, if you type a string using "" for enclosing it, everything inside will be evaluated first and then used as a string. This means that \s is evaluated (interpreted) as an escape character, instead of two characters.
If you want to type \ character inside "" string you have to escape it as well: \\. In your case you would have to type "\\sfoo".
In case of {} enclosed strings, they are always quoted, no need for repeated backslash.
Using "" is good if you want to use variables or inline commands in the string, for example:
puts "The value $var and the command result: [someCommand $arg]"
The above will evaluate $var and [someCommand $arg] and put them into the string.
If you'd have used braces, for example:
puts {The value $var and the command result: [someCommand $arg]}
The string will not be evaluated. It will contain all the $ and [ characters, just like you typed them.

Passing a variable to regexp when the variable may have brackets (TCL)

In my job, I deal a lot with entities whose names may contain square brackets. We mostly use tcl, so square brackets can sometimes cause havoc. I'm trying to do the following:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
echo [regexp "${pat}_affin.*" $aff]
However, this returns a 0 when I would expect a 1. I'm certain that when ${pat} is passed to the regexp engine, the brackets are being expanded and read as "[9]" instead of "[9]".
How do I phrase the regexp so a pattern contains a variable when the variable itself may have special regexp characters?
EDIT: An easy way would be to just escape the brackets when setting $pat. However, the value for $pat is passed to me by a function so I cannot easily do that.
Just ruthlessly escape all non-word chars:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
puts [regexp "${pat}_affin.*" $aff] ;# ==> 0
set escaped_pat [regsub -all {\W} $pat {\\&}]
puts $escaped_pat ;# ==> pair_shap_val\[9\]
puts [regexp "${escaped_pat}_affin.*" $aff] ;# ==> 1
A second thought: this doesn't really seem to require regular expression matching. It appears you just need to check that the pat string is contained in the aff string:
% expr {[string first $pat $aff] != -1}
1

perl regexp for multi line file

i have patterns in a file which looks like this:
db::parameter nf
-data. Value. \
-data2. Value2. \
db::parameter ww
-data1. Value1. \
-data2. Value2. \
I need regexp which will take whole pattern into a variable starting from db
I tried to match the pattern untill empty line will show up
while(<$infile>){
chomp;
If( $_=~/db:parameter\s+$){
print $_;}
P.s. I know regexp is totaly wrong, but im not that good at regexps
If you want to use an empty line as a record separator, may I suggest using paragraph mode?
$/ = ""; # set input record separator to empty string
while (<>) { # proceed as usual
Using the empty string is a special case, as described in the documentation linked above:
Setting $/ to "\n\n" means something slightly different than setting to "" , if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline.

need help in tcl command usage for regsub

I am new learner for tcl. I have some issue as below when using regsub. Consider the following scenario:
set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
set test2 abc
regsub -all ${test2}_[1-9] $test1 [list] test1
I expected $test1 output is [prefix_abc_3 AAA_0]
However regsub has also removed the partial matched string which is prefix_abc_3. Does anyone here have any idea on how to regsub the exact words only in a list?
I tried to find solution via net but could not get any clue/hints. Appreciate if someone here can help me.
\m and \M in regexps match the beginning and end of a word respectively. But you don't have a string of words in test1, but a list of elements, and sometimes there's a difference so don't mix the two. regsub only handles strings while lsearch works with lists:
set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
set test2 abc
set test1 [lsearch -all -inline -not -regexp $test1 "^${test2}_\[1-9\]\$"]
If the pattern is that simple, you can use the -glob option (the default) instead of -regexp and maybe save some processor time.
What exactly did you execute?
When I type the commands above into tclsh, it displays an error -
% set test1 [list prefix_abc_3 abc_1 abc_2 AAA_0]
prefix_abc_3 abc_1 abc_2 AAA_0
% set test2 abc
abc
% regsub -all ${test2}_[1-9] [list] test1
invalid command name "1-9"
I'm unsure what you are trying to do. You start by inisitalising test1 as a list. You then treat it as a string by passing it to regsub. This is a completely legal thing to do, but may indicate that you are confused by something. Are you trying to test your substitution by applying it four times, to each of prefix_abc_3, abc_1, abc_2 and AAA_0? You can certainly do that the way you are, but a more natural way would be to do
foreach test $test1 {
regsub $pattern $test [list] testResult
puts stdout $testResult
}
Then again, what are you trying to achieve with your substitution? It looks as though your are trying to replace the stringabc with a null string, i.e. remove it altogether. Passing [list] as a null string is perfectly valid, but again may indicate confusion between lists and strings.
To achieve the result you want, all you need to do is add a leading space to your pattern, pass a space as the substitution string and escape the square brackets, i.e.
regsub -all " ${test2}_\[-9\]" $test1 " " test1
but I suspect that this is a made-up example and you're really trying to do something slightly different.
Edit
To obtain a list that contains just those list entries that don't exactly match your pattern, I suggest
proc removeExactMatches {input} {
set result [list]; # Initialise the result list
foreach inputElement $input {
if {![regexp {^abc_[0-9]$} $inputElement]} {
lappend result $inputElement
}
}
return $result
}
set test1 [removeExactMatches [list prefix_abc_3 abc_1 abc_2 AAA_0]]
Notes:
i) I don't use regsub at all.
ii) Although it's safe and legal to switch around between lists and strings, it all takes time and it obscures what I'm tryng to do, so I avoid it wherever possible. You seem to have a list of strings and you want to remove some of them, so that's what I use in my suggested solution. The regular expression commands in Tcl handle strings so I pass them strings.
iii) To ensure that the list elements match exactly, I anchor the pattern to the start and end of the string that I'm matching against using ^ and $.
iv) To prevent the interpreter from recognising the [1-9] in the regular expression pattern and trying to execute a (non-existant) command 1-9, I enclose the whole pattern string within curly brackets.
v) For greater generality, I might want to pass the pattern to the proc as well as the input list (of strings), in that case, I'd do
proc removeExactMatches {inputPattern input} {
.
.
.
set pattern "^"
append pattern $inputPattern
append pattern "\$"
.
.
.
if {![regub $pattern $inputElement]} {
.
.
.
}
set test1 [removeExactMatches {abc_[1-9]} {prefix_abc_3 abc_1 abc_2 AAA_0}]
to minimise the number of characters that had to be escaped. (Actually I probably wouldn't use the quotation marks for the start and end anchors within the proc - they aren't really needed and I'm a lazy typist!)
Looking at your original question, it seems that you might want to vary only the abc part of the pattern, in which case you might want to just pass that to your proc and append the _[0-9] as well as the anchors within it - don't forget to escape the square brackets or use curly brackets if you go down this route.