Matching a regexp in TCL PERL - regex

I am having follwing pattern
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
I want to segregate each Pattern block . I am using TCL . Regexp that I am using is not resolving the purpose
set updateList [regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list]
Which Regexp to use to segregate each pattern
I need output as
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220

Your pattern Pattern\[\d+\].*?Value.*?\n contains mixed quantifiers: both greedy and lazy. Tcl does not handle mixed quantifier type as you would expect it in, say, PCRE (PHP, Perl), .NET, etc., it defaults to the first found one, as the subsequent quantifiers inherit the preceding quantifier type. So, the + after \d is greedy, thus, all others (in .*?) are also greedy - even if you declared them to be lazy. Also, the . matches a newline in Tcl regex, too, so, your pattern works like this.
So, based on your regex, you can make the \d+ lazy with \d+? and replace \n at the end with (?:\n|$) to match both the newline and the end of string:
set RE {Pattern\[\d+?\].*?Value.*?(?:\n|$)}
set updateList [regexp -all -inline $RE $str]
See the IDEONE demo
Alternative 1
Also, you can use a more verbose regex if your input string always has the same structure with all elements - Pattern, Key, Value - present:
set updateList [regexp -all -inline {Pattern\[\d+\]:\s*Key[^\n]*\s*Value[^\n]*} $str]
See the IDEONE demo, and here is the regex demo.
Since a . can match a newline, we need to use a [^\n] negated character class matching any character but a line feed.
Alternative 2
You can use an unrolled lazy subpattern matching Pattern[n]: and then any character that is not a starting point for a Pattern[n]: sequence:
set RE {Pattern\[\d+\]:[^P]*(?:P(?!attern\[\d+\]).)*}
set updateList [regexp -all -inline $RE $str]
See another IDEONE demo and a regex101 demo

Try this
Pattern\[\d+\](.|\n)*?Value.*?\n
The dot . character matches any characters but line break, so you need to add it in. Be aware that your line may end with a carriage character so you might need to add \r in.

% set list { Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list
{Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list ;# only changing `\d+` to `\d+?`
{Pattern[1]:
Key : "key1"
Value : 100
} {Pattern[2]:
Key : "key2"
Value : 20
} {Pattern[3]:
Key : "key3"
Value : 30
} {Pattern[4]:
Key : "key4"
Value : 220
}
If $list does not end with a newline, you won't get the "pattern[4]" element returned. In that case, change
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list
to
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?(?:\n|$)} $list

You want to capture blocks of lines and output them with blank lines in between. Your example data displays patterns on different levels that can be used to recognize which lines belong to which block.
The simplest pattern is this: every three lines in the input make up a block. This pattern suggests processing like this:
set lines [split [string trim $list \n] \n]
foreach {a b c} $lines {puts $a\n$b\n$c\n\n}
There is nothing in your example data that suggests that this wouldn't work. Still, there may be some complications that aren't reflected in your example data.
If there are stray blank lines in the input, you might need to get rid of them first:
set lines [lmap line $lines {if {[string is space $line]} continue else {set line}}]
If some blocks contain less or more lines than in your example, another simple pattern is that every block starts with a line that has optional(?) whitespace and the word Pattern. Those lines (except the first) should be preceded by a block-delimiter in the output:
set lines [split [string trim $list \n] \n]
puts [lindex $lines 0]
foreach line [lrange $lines 1 end] {
if {[regexp {\s*Pattern} $line]} {
puts \n$line
} else {
puts $line
}
}
puts \n
If the lines don't actually begin with whitespace, you could use string match Pattern* $line instead of the regular expression.
Documentation: continue, foreach, if, lindex, lmap, lmap replacement, lrange, puts, regexp, set, split, string

Related

tcl regexp with with paren-delimited groups

I have a group of strings that look like:
foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>
so is boils down to a prefix (foo, bar, pizza,...) followed by any number of attribute names inclosed in angled brackets.
both the prefix and the attributes may consist of any character with the exception of angled brackets (which are only used for delimiting the attribute names)
Neither prefix nor attribute names must be empty.
Now I would like to have a regex in my Tcl application, that gives me both the prefix and all the attributes (it's ok if they keep their delimiting brackets, though i the end I have to split them up into a list).
The trivial approach ^(.+)(<.+>)*$ doesn't work because the trailing .+ is too greedy and eats away all the matches for the attribute names.
So I tried excluding the forbidden angle brackets ^(\[^<>\]+)(<.+>)*$ which works OK at first glance - but then i discovered that this would match fnork<<>><x<>> violating the rule that attribute names must not contain any angular brackets (apart from the delimiting one).
Third, I extended the forbidden characters to the attribute names ^(\[^<>\]+)(<\[^<>\]>)*$, but now things are getting a bit shady: while the regex only matches valid strings (so both prefix and attribute names must not contain any brackets), i no longer get the attribute names in as a match part:
% regexp -all -inline "^(\[^<>\]+)(<\[^<>\]+>)*" "A<xyz><123>"
A<xyz><123> A <123>
For whatever reason the <xyz> is not returned!
Any idea how to fix this?
side-note
the actual string I'm trying to parse uses square brackets and parentheses as delimiters. something like: pizza[large](tomato)(olives)(cheese) where there [term] can appear 0 or 1 time, whereas the (term)s can appear 0 or more times.
but due to the nature of square brackets and parentheses this requires a fair amount of quoting, which is probably too much of a distraction to be useful here)
In this case, the trick is to use a fairly simple RE and post-process the results:
% regexp -all -inline {^([^<>]+)((?:<[^<>]+>)*)$} foo<xyz><123>
foo<xyz><123> foo <xyz><123>
% regexp -all -inline {[^<>]+} <xyz><123>
xyz 123
You were almost there, but were struggling with using (<[^<>]+>)*, which won't work as that only captures the group one of the times it matches. (I wasn't aware that it captured the last match, but since I rarely want either first or last but rather all, I use a different approach.)
Putting that all together and assuming you've got one big multi-line string that has all the pieces you want to look at in it (e.g., because you've read it from a file) you get:
set str "foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>"
# Find the matching lines and do the first-level extract on them
foreach {- prefix attribs} [regexp -all -line -inline {^([^<>]+)((?:<[^<>]+>)*)$} $str] {
# Split the attribute names
set attributes [regexp -all -inline {[^<>]+} $attribs]
# Show that we've matched them for real
puts "prefix='$prefix', attributes=[join $attributes ,]"
}
Which produces this output:
prefix='foo', attributes=xyz,123
prefix='bar', attributes=
prefix='pizza', attributes=oregano,tomato,mozzarella
Let's tokenize this.
package require string::token
set lex {[[] LB []] RB [(] LP [)] RP [^][()]+ t}
set str {pizza[large](tomato)(olives)(cheese)}
% set tokens [::string::token text $lex $str]
{t 0 4} {LB 5 5} {t 6 10} {RB 11 11} {LP 12 12} {t 13 18} {RP 19" 19} {LP 20 20} {t 21 26} {RP 27 27} {LP 28 28} {t 29 34} {RP 35 35}
Having tokenized, we can parse, or evaluate the tokens as statements in a little language:
% set terms [lassign $tokens prefix]
proc t {str beg end} {
string range $str $beg $end
}
proc LB {str beg end} {
return "Optional term is: "
}
proc RB args {
return \n
}
proc LP {str beg end} {
rename LP {}
proc LP args {
return ", "
}
return "Arguments are: "
}
proc RP args {}
% puts "Prefix is: [eval [linsert $prefix 1 $str]]"
Prefix is: pizza
% % join [lmap term $terms {eval [linsert $term 1 $str]}] {}
Optional term is: large
Arguments are: tomato, olives, cheese
Documentation:
eval,
join,
lassign,
linsert,
lmap (for Tcl 8.5),
lmap,
package,
proc,
puts,
rename,
return,
set,
string::token (package)
I might have misread the requirements, but given that you have already "encoded" all structural details in your ad hoc notation, why not have the Tcl list machinery do the work?
set str {foo(xyz)(123)
bar
pizza[large](oregano)(tomato)(mozzarella)}
foreach line [split $str \n] {
set line [string map {"[" " " "]" " " ")(" " " "(" " {" ")" "} "} $line]
set suffix [lassign $line prefix]
lassign $suffix a b
if {[llength $suffix] == 2} {
set optional $a
set attributes $b
} else {
set optional ""
set attributes $a
}
puts "prefix='$prefix', optional='$optional', attributes='[join $attributes ,]'"
}
I apologise, strictly speaking, my answer does not address the regex question. And there is less wizardy than in the other replies ;)

Tcl regexp not returning all matches

I am reading a file, the content is as below:
Aug2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
--------------------------------------
So I wanted to extract the information between every dashed line and put them into a list. Assuming $data is containing the file content, I am using the tcl regexp below to extract the info:
regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data
As I know, the returned matched result will be stored as a list that containing fullMatch and subMatch.
I double checked with llength command, there is only one fullMatch and subMatch.
llength $data
2
Why is there only 1 subMatch? There supposed to be 5 matches like below:
Aug2017:
--------------------------------------
Name Age Phone --> 1st Match
--------------------------------------
Jack 25 128736372
Peter 26 987840392 --> 2nd Match
--------------------------------------
Sep2017: --> 3rd Match
--------------------------------------
Name Age Phone --> 4th Match
--------------------------------------
Jared 21 874892032
Eric 24 847938427 --> 5th Match
--------------------------------------
So in this case, I am choosing the second list element (subMatch) with lindex.
lindex [regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data] 1
However the result I got is like this, seems like it is matching from the beginning and end of the content:
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
My impression was regexp should match from the beginning and match sequentially to the end of the string, not sure why tcl regex is behaving like this? Am I missing something?
** The main thing I want to achieve here is to extract data between the dashed separator, the above data is just an example.
Expected result: a list that containing all matches
{ {Name Age Phone} -->1st match
{Jack 25 128736372
Peter 26 987840392} -->2nd match
{Sep2017:} -->3rd match
{Name Age Phone} -->4th match
{Jared 21 874892032
Eric 24 847938427} -->5th match
}
UPDATE:
I have slightly changed my tcl regex as below, to include the lookahead and the suggestion by #glenn:
regexp -all -inline -expanded -- {\s+?-{2,}\s+?(.*?)(?=\s+?-{2,}\s+?)} $data
The result I got (10 submatches):
{ {----------------------
Name Age Phone} -->1st match
{Name Age Phone} -->2nd match
{----------------------
Jack 25 128736372
Peter 26 987840392} -->3rd match
{Jack 25 128736372
Peter 26 987840392} -->4th match
{----------------------
Sep2017:} -->5th match
{Sep2017:} -->6th match
...
...
}
It is pretty close to the expected result, but I still want to figure out how to use regex to perfectly match the expected 5 submatches.
Regular expression matching is not a good tool for this kind of problem. You're much better off with some kind of line filter.
A regular expression-based filter, closely matched to your example lines:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {[regexp {:} $line]} continue
if {![regexp {\d} $line]} continue
puts $line
}
close $f
Rationale: only month name lines have colons, header lines and separators have no digits in them.
A filter that doesn't rely as much on regular expressions:
set f [open data.txt]
set skip 4
while {[gets $f line] >= 0} {
if {$skip < 1} {
if {[regexp {\-{2,}} $line]} {
set skip 4
} else {
puts $line
}
} else {
incr skip -1
}
}
close $f
This code reads every line, skips four lines at the beginning of each month, and resets the skip to 4 when a line of dashes interrupts the data.
(Note: the expression \-{2,} makes it look like the dash is special in a regular expression and needs to be escaped for that reason. Actually, it's because if the dash is the first character in the expression, the regexp command tries to interpret it as a switch. regexp -- {-{2,}} ... would work too but looks even stranger, I think.)
ETA (see comment): to get data between separators (i.e. just filter out the separators), try this:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
close $f
Or:
package require fileutil
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
This should also work:
regsub -all -line {^\s+-{2,}.*(\n|\Z)} $data {}
Enabling newline-sensitive matching, this matches and removes all lines consisting only of whitespace, dashes, optional non-newlines and either a newline character or the end-of-outer-string.
To collect a list of matches rather than just printing filtered lines:
set matches {}
set matchtext {}
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
append matchtext $line\n
} else {
lappend matches $matchtext
set matchtext {}
}
}
After running this, the variable matches contains a list whose items are contiguous lines between separators.
Another way to to the same thing:
::textutil::splitx $data {(?n)^\s+-{2,}.*(?:\n|\Z)}
(It also adds an empty element at the end of the list, which is easy enough to remove if it is a problem.)
Documentation:
< (operator),
>= (operator),
append,
close,
continue,
fileutil (package),
gets,
if,
incr,
lappend,
open,
package,
puts,
regexp,
set,
textutil (package),
while,
Syntax of Tcl regular expressions

Extract value using regex when space is missing

I am trying to extract the text and the third column from the below output. My problem is that there is one line where the space is missing. Is it possible to extract that value in this case?
4086 process-working 841901 841901 1234 22
4297 procesor_stats_controller_fmm543182 543182 0 22
4028 ipv6_ma 3063025 3063025 -55 78
4280 tty-verifyd 694043 694043 0 22
My regex so far looks like this:
\d+\s+(\w+-?\w+)\s*\d+\s+\d+\s+(-?\d+)\s+\d+
Thank you
EDIT: it's actually a bug in the device, at least one space should be there, so I'll just let them fix it and then retry. Thank you for taking the time to answer this :)
I this case, I'd first split the line into fields
foreach line $lines {
set fields [regexp -inline -all {\S+} $line]
if {[llength $fields] == 6} {
puts [lindex $fields 2]
} else {
# extract the digits at the end of this field
regexp {\d+$} [lindex $fields 1] value
puts $value
}
}
841901
543182
3063025
694043
The problem is with the \w. \w is equivalent to [a-zA-Z0-9_].
So, it'll match with the number also (since space is missing).
Instead of \w, use [a-zA-Z_]. Hence, this regular expression should do for you :
\d+\s+([a-z]+-?[a-z]+)\s*(\d+)\s+\d+\s+\d+\s+\d+

Passing a variable to regexp when the variable may have brackets (TCL)

In my job, I deal a lot with entities whose names may contain square brackets. We mostly use tcl, so square brackets can sometimes cause havoc. I'm trying to do the following:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
echo [regexp "${pat}_affin.*" $aff]
However, this returns a 0 when I would expect a 1. I'm certain that when ${pat} is passed to the regexp engine, the brackets are being expanded and read as "[9]" instead of "[9]".
How do I phrase the regexp so a pattern contains a variable when the variable itself may have special regexp characters?
EDIT: An easy way would be to just escape the brackets when setting $pat. However, the value for $pat is passed to me by a function so I cannot easily do that.
Just ruthlessly escape all non-word chars:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
puts [regexp "${pat}_affin.*" $aff] ;# ==> 0
set escaped_pat [regsub -all {\W} $pat {\\&}]
puts $escaped_pat ;# ==> pair_shap_val\[9\]
puts [regexp "${escaped_pat}_affin.*" $aff] ;# ==> 1
A second thought: this doesn't really seem to require regular expression matching. It appears you just need to check that the pat string is contained in the aff string:
% expr {[string first $pat $aff] != -1}
1

How do I extract all matches with a Tcl regex?

hi everybody i want solution for this regular expression, my problem is Extract all the hex numbers in the form H'xxxx, i used this regexp but i didn't get all hexvalues only i get one number, how to get whole hex number from this string
set hex "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set res [regexp -all {H'([0-9A-Z]+)&} $hex match hexValues]
puts "$res H$hexValues"
i am getting output is 5 H4D52
On -all -inline
From the documentation:
-all : Causes the regular expression to be matched as many times as possible in the string, returning the total number of matches found. If this is specified with match variables, they will contain information for the last match only.
-inline : Causes the command to return, as a list, the data that would otherwise be placed in match variables. When using -inline, match variables may not be specified. If used with -all, the list will be concatenated at each iteration, such that a flat list is always returned. For each match iteration, the command will append the overall match data, plus one element for each subexpression in the regular expression.
Thus to return all matches --including captures by groups-- as a flat list in Tcl, you can write:
set matchTuples [regexp -all -inline $pattern $text]
If the pattern has groups 0…N-1, then each match is an N-tuple in the list. Thus the number of actual matches is the length of this list divided by N. You can then use foreach with N variables to iterate over each tuple of the list.
If N = 2 for example, you have:
set numMatches [expr {[llength $matchTuples] / 2}]
foreach {group0 group1} $matchTuples {
...
}
References
regular-expressions.info/Tcl
Sample code
Here's a solution for this specific problem, annotated with output as comments (see also on ideone.com):
set text "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set pattern {H'([0-9A-F]{4})}
set matchTuples [regexp -all -inline $pattern $text]
puts $matchTuples
# H'22EF 22EF H'2354 2354 H'4BD4 4BD4 H'4C4B 4C4B H'4D52 4D52 H'4DC9 4DC9
# \_________/ \_________/ \_________/ \_________/ \_________/ \_________/
# 1st match 2nd match 3rd match 4th match 5th match 6th match
puts [llength $matchTuples]
# 12
set numMatches [expr {[llength $matchTuples] / 2}]
puts $numMatches
# 6
foreach {whole hex} $matchTuples {
puts $hex
}
# 22EF
# 2354
# 4BD4
# 4C4B
# 4D52
# 4DC9
On the pattern
Note that I've changed the pattern slightly:
Instead of [0-9A-Z]+, e.g. [0-9A-F]{4} is more specific for matching exactly 4 hexadecimal digits
If you insist on matching the &, then the last hex string (H'4DC9 in your input) can not be matched
This explains why you get 4D52 in the original script, because that's the last match with &
Maybe get rid of the &, or use (&|$) instead, i.e. a & or the end of the string $.
References
regular-expressions.info/Finite Repetition, Anchors
I'm not Tclish, but I think you need to use both the -inline and -all options:
regexp -all -inline {H'([0-9A-Z]+)&} $string
EDIT: Here it is again, this time with a corrected regex (see the comments):
regexp -all -inline {H'[0-9A-F]+&} $string